nips nips2000 nips2000-49 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peter Dayan, Sham Kakade
Abstract: Explaining away has mostly been considered in terms of inference of states in belief networks. We show how it can also arise in a Bayesian context in inference about the weights governing relationships such as those between stimuli and reinforcers in conditioning experiments such as bacA, 'Ward blocking. We show how explaining away in weight space can be accounted for using an extension of a Kalman filter model; provide a new approximate way of looking at the Kalman gain matrix as a whitener for the correlation matrix of the observation process; suggest a network implementation of this whitener using an architecture due to Goodall; and show that the resulting model exhibits backward blocking.
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract Explaining away has mostly been considered in terms of inference of states in belief networks. [sent-7, score-0.165]
2 We show how it can also arise in a Bayesian context in inference about the weights governing relationships such as those between stimuli and reinforcers in conditioning experiments such as bacA, 'Ward blocking. [sent-8, score-0.493]
3 1 Introduction The phenomenon of expl aining away is commonplace in inference in belief networks. [sent-10, score-0.206]
4 Explaining away is typically realized by recurrent inference procedures, such as mean field inference (see Jordan, 1998). [sent-12, score-0.315]
5 However, explaining away is not only important in the space of on-line explanations for data; it is also important in the space of weights. [sent-13, score-0.23]
6 This is a very general problem that we 'Ward blocking (Shanks, illustrate using a phenomenon from animal conditioning called bad 1985; Miller & Matute, 1996). [sent-14, score-0.621]
7 Backwards blocking poses a very different problem from standard explaining away, and rather complex theories have been advanced to account for it (eg Wagner & Brandon, 1989). [sent-16, score-0.398]
8 We treat it as a case for Kalman jiltering, and suggest a novel network model for Kalman filtering to solve it. [sent-17, score-0.091]
9 ) in forward and backward blocking, but stronger (Rf2) in the sharing paradigm. [sent-20, score-0.327]
10 The effect that concerns this paper is occurring during the second set of trials during backward blocking in which the association between the sound and the reward is weakened (compared with sharing), even though the sound is not presented during these trials. [sent-21, score-1.097]
11 The apparent association between the sound and the reward established in the first set of trials is explained away in the second set of trials. [sent-22, score-0.579]
12 Not only does this suggestion lack a statistical basis, but also its network implementation requires that the activation of the opponent sound units makes weaker the weights from the standard sound units to reward. [sent-24, score-0.552]
13 In this paper, we first extend the Kalman filter based conditioning theory of Sutton (1992) to the case of backward blocking. [sent-26, score-0.74]
14 Next, we show the close relationship between the key quantity for a Kalman filter - namely the covariance matrix of uncertainty about the relationship between the stimuli and the reward - and the symmetric whitening matrix for the stimuli. [sent-27, score-1.162]
15 Then we show how the Goodall algorithm for whitening (Goodall 1960; Atick & Redlich, 1993) makes for an appropriate network implementation for weight updates based on the Kalman filter. [sent-28, score-0.412]
16 The final algorithm is a motivated mixture of unsupervised and reinforcement (or, equivalently in this case, supervised) learning. [sent-29, score-0.072]
17 Last, we demonstrate backward blocking in the full model. [sent-30, score-0.512]
18 2 The Kalman filter and classical conditioning Sutton (1992) suggested that one can understand classical conditioning in terms of normative statistical inference. [sent-31, score-0.879]
19 The idea is that on trial n there is a set of true weights Wn mediating the relationship between the presentation of stimuli Xn and the amount of reward Tn that is delivered, where (1) and En '" N[O, T2] is zero-mean Gaussian noise, independent from one trial to the next. [sent-32, score-0.419]
20 Crucially, to allow for the possibility (realized in most conditioning experiments) that the true weights might change, the model includes a diffusion term W n +1 = Wn + 11n (2) where 11n '" N[O, (72][] is also Gaussian. [sent-35, score-0.409]
21 The task for the animal is to take observations of the stimuli {xn} and rewards {Tn} and infer a distribution over W n . [sent-36, score-0.128]
22 Provided that the initial uncertainty can be captured as Wo '" N[O, ~o] for some covariance matrix ~o , inference takes the form of a standard recursive Kalman filter, for which P(WnITl . [sent-37, score-0.364]
23 Tn-d '" N[w n, ~n] and , , Wn+l =w n ~ _ ~ Lln+l - LIn iPor vectors a, b, matrix C, a· b + ~n ~ . [sent-40, score-0.101]
24 Xn +T2 ) (3) (4) = I:i aibi, a· C· b = I:ij aiCijbj, matrix [ab]ij = aibj. [sent-45, score-0.101]
25 If 1;n ex n, then the update for the mean can be seen as a standard delta rule (Widrow & Stearns, 1985; Rescorla & Wagner, 1972), involving the prediction error (or innovation) On rn - wn . [sent-46, score-0.401]
26 Note the familiar, but at first sight counterintuitive, result that the update for the covariance matrix does not depend on the innovation or the observed rn . [sent-48, score-0.375]
27 2 = In backward blocking, in the first set of trials, the off-diagonal terms of the covariance matrix 1;n become negative. [sent-49, score-0.494]
28 This can either be seen from the form of the update equation for the covariance matrix (since Xn '" (1,1)), or, more intuitivep', from the fact that these trials imply a constraint only on w* + w~, therefore forcing wn and w* to be negatively correlated. [sent-50, score-0.624]
29 The consequence of this negative correlation in the second set of trials is that the S component of 1;n . [sent-51, score-0.165]
30 Another way of looking at this is in terms of explaining away in weight space. [sent-55, score-0.298]
31 From the first set of trials, the animal infers that w* + w~ = R > 0; from the second, that the prediction owes to w* rather than w~, and so the old value w~ = R/2 is explained away by w*. [sent-56, score-0.271]
32 Sutton (1992) actually suggested the approximation of forcing the off-diagonal components of the covariance matrix 1;n to be 0, which, of course, prevents the system from accounting for backward blocking. [sent-57, score-0.56]
33 We seek a network account of explaining away in the space of weights by implementing an approximate form of Kalman filtering. [sent-58, score-0.403]
34 3 Whitening and the Kalman filter In conventional applications of the Kalman filter, Xn would typically be constant. [sent-59, score-0.245]
35 The plan for this section is to derive an approximate relationship between the average covariance matrix over the weights f; and a whitening matrix for the stimulus inputs. [sent-62, score-0.845]
36 In the next section, we consider an implementation of a particular whitening algorithm as an unsupervised way of estimating the covariance matrix for the Kalman filter and show how to use it to learn the weights Wn appropriately. [sent-63, score-0.954]
37 Consider the case that Xn are random, with correlation matrix (xx) = Q, and consider the mean covariance matrix f; for the Kalman filter, averaging across the variation in x. [sent-64, score-0.413]
38 Then, we can solve for the average of the asymptotic value of f; in the equation for the update of the Kalman filter as f;Qf; ex n (5) Thus f; is a whitening filter for the correlation matrix Q of the inputs {x}. [sent-71, score-1.003]
39 Symmetric whitening filters (f; must be symmetric) are generally unique (Atick & Redlich, 1993). [sent-72, score-0.282]
40 This result is very different from the standard relationship between Kalman filtering and whitening. [sent-73, score-0.093]
41 The standard Kalman filter is a whitening filter for the innovations process on = rn - wn . [sent-74, score-0.982]
42 X n , ie as extracting all the systematic variation into W n, leaving only random variation due to the observation noise and the diffusion process. [sent-75, score-0.235]
43 Equation 5 is an additional level of whitening, saying that one can look at the long-run average covariance 2Note also the use of the alternative form of the Kalman filter, in which we perform observation/conditioning followed by drift, rather than drift followed by observation/conditioning. [sent-76, score-0.211]
44 8 E 8 y(t) ~ 6 o ~4 '5 ~ 2 on diagonal component off diagonal com anent Figure 1: Whitening. [sent-78, score-0.088]
45 The upper curve shows the average maximum diagonal element of the same matrix. [sent-80, score-0.075]
46 The off-diagonal components are around an order of magnitude smaller than the ondiagonal components, even in the difficult regime where v is near 0, and thus the matrix Q is nearly singular. [sent-81, score-0.101]
47 Identity feedforward weights 1I map inputs x to a recurrent network yet} whose output is used to make predictions. [sent-83, score-0.289]
48 Learning of the recurrent weights B is based on Goodall's (1960) rule; learning of the prediction weights is based on the delta rule, only using yeO} to make the predictions and y(oo} to change the weights. [sent-84, score-0.419]
49 matrix of the uncertainty in Wn as whitening the input process x n . [sent-85, score-0.432]
50 This is inherently unsupervised, in that whitening takes place without any reference to the observed rewards (or even the innovation). [sent-86, score-0.282]
51 Given the approximation, we tested whether f; really whitens Q by by generating Xn from a Gaussian distribution, with mean (1,1) and variance v 2 II, calculating the long-run average value of f;, and assessing whether f f;Qf; is white. [sent-87, score-0.074]
52 The figure shows that the off-diagonal components are comparatively very small, even when v is very small, for which Q has an eigenvalue very near to 0 making the whitening matrix nearly undefined. [sent-89, score-0.383]
53 Equally, in this case, ~n tends to have very large values, since, looking at equation 4, the growth in uncertainty coming from a 2 II is not balanced by any observation in the direction (1, -1) that is orthogonal to (1,1) . [sent-90, score-0.164]
54 = Of course, only the long-run average covariance matrix f; whitens Q. [sent-91, score-0.322]
55 We make the further approximation of using an on-line estimate of the symmetric whitening matrix as the online estimate of the covariance of the weights ~n . [sent-92, score-0.672]
56 4 A network model Figure IB shows a network model in which prediction weights wn adapt in a manner that is appropriately sensitive to a learned, on-line, estimate of the whitening matrix. [sent-93, score-0.718]
57 The network has two components, a mapping from input x to output y(t), via recurrent feedback weights B (the Goodall (1960) whitening filter), and a mapping from y, through a set of prediction weights W to an estimate of the reward. [sent-94, score-0.699]
58 The feedforward weights from x to yare just the identity matrix II. [sent-96, score-0.244]
59 y(O) = = The first part of the network is a straightforward implementation of Goodall's whitening filter (Goodall, 1960; Atick & Redlich, 1993). [sent-100, score-0.618]
60 The recurrent dynamics in the y-Iayer are taken as being purely linear. [sent-101, score-0.083]
61 Goodall's algorithm changes the recurrent weights B using local, anti-Hebbian learning, according to tl. [sent-103, score-0.221]
62 (6) This rule stabilizes on average when II = (II - B)-lQ[(II - B)-l], that is when (II - B)-l is a whitening filter for the correlation matrix Q of the inputs. [sent-105, score-0.737]
63 The learning rule gets wrong the absolute magnitude of the weight changes (since it lacks the Xn . [sent-107, score-0.113]
64 5 Results Figure 2 shows the result of learning in backward blocking. [sent-110, score-0.246]
65 In association with Tn = 1, first stimulus Xn = (1,1) was presented for 20 trials, then stimulus Xn = (1,0) was presented for a further 20 trials. [sent-111, score-0.134]
66 Figure 2A shows the development of the weights w~ (solid) and w~ (dashed). [sent-112, score-0.169]
67 5; during the second set, they differentiate sharply with the weight associated with the light growing towards 1, and that with the sound, which is explained away, growing towards O. [sent-114, score-0.19]
68 Figure 2B shows the development of two terms in the estimated covariance matrix. [sent-115, score-0.206]
69 The negative covariance between light and sound is evident, and causes the sharp changes in the weights on the 21st trial. [sent-116, score-0.507]
70 The increases in the magnitudes of ~~L and ~~s during the first sta~e of backwards blocking come from the lack of information in the input about w~ - w n ' despite its continual diffusion (from equation 2). [sent-118, score-0.437]
71 Figures 2 E-H show a non-pathological case with observation noise added. [sent-121, score-0.08]
72 6 Discussion We have shown how the standard Kalman filter produces explaining away in the space of weights, and suggested and proved efficacious a natural network model for implementing the Kalman filter. [sent-123, score-0.606]
73 The model mixes unsupervised learning of a whitener for the observation process (ie the Xn of equation 1), providing the covariance matrix governing the uncertainty in the weights, with supervised (or equivalently reinforcement) learning of the mean values of the weights . [sent-124, score-0.631]
74 Unsupervised learning is reasonable since the evolution of the covariance matrix of the weights is independent of the innovations. [sent-125, score-0.358]
75 , trial tria l E , 20 o 10 20 3D 40 trial F , . [sent-154, score-0.14]
76 ' 20 trial trial Figure 2: Backward blocking in the full model. [sent-171, score-0.406]
77 A) The development of w over 20 trials with Xn = (1,1) and 20 with Xn = (1,0) . [sent-172, score-0.192]
78 B) The development of the estimated covariance of the weight for the light 'E~L and cross-covariance between the light and the sound 'E~s. [sent-173, score-0.528]
79 C & D) The development of wand 'E from the exact Kalman filter with parameters (IT = . [sent-176, score-0.304]
80 E) The development of w as in A) except with multiplicative Gaussian noise added (ie noise with standard deviation 0. [sent-179, score-0.152]
81 F & G) The comparison of win the model (solid line) and in the exact Kalman filter (dashed line), using the sarne parameters for the Kalman filter as in C) and D). [sent-181, score-0.49]
82 Further work is needed to understand how to set the parameters of the Goodall learning rule to match 0'2 and 7 2 exactly. [sent-184, score-0.073]
83 Hinton (personal communication) has suggested an alternative interpretation of Kalman filtering based on a heteroassociative novelty filter. [sent-185, score-0.108]
84 Here, the idea is to use the recurrent network B only once, rather than to equilibrium, with (as for our model) Yn(O) Xn , the prediction v = wn . [sent-186, score-0.346]
85 If we compare the update for Bn to that for ~n (equation 4), we can see that it amounts approximately to assuming neither observation noise nor drift. [sent-196, score-0.109]
86 Thus, whereas our network model approximates the long-run covariance matrix, the novelty filter approximates the instantaneous covariance matrix directly, and could clearly be adapted to take account of noise. [sent-197, score-0.744]
87 Unfortunately, there are few quantitatively precise experimental results on backwards blocking, so it is hard to choose between different possible rules. [sent-198, score-0.083]
88 Sutton (1992) suggested an online way of estimating the elements of the covariance matrix, observing that E[t5~l = 7 2 + Xn . [sent-200, score-0.186]
89 Xn (8) and so considered using a standard delta rule to fit the square innovation using a quadratic input representation ((X~)2, (X~)2 , x~ X x~, 1) . [sent-202, score-0.206]
90 3 The weight associated with the last ele3 Although the x~ x x~ term was omitted from Sutton's diagonal approximation to 'En. [sent-203, score-0.083]
91 ment, ie the bias, should come to be the observation noise 7 2 ; the weights associated with the other elements are just the components of ~n. [sent-204, score-0.231]
92 The most critical concern about this is that it is not obvious how to use the resulting covariance matrix to control learning about the mean values of the weights. [sent-205, score-0.275]
93 There is also the more theoretical concern that the covariance matrix should really be independent of the prediction errors, one manifestation of which is that the occurrence of backward blocking in the model of equation 8 is strongly sensitive to initial conditions. [sent-206, score-0.876]
94 Although backward blocking is a robust phenomenon, particularly in human conditioning experiments (Shanks, 1985), it is not observed in all animal conditioning paradigms. [sent-207, score-1.075]
95 One possibility for why not is that the anatomical substrate of the cross-modal recurrent network (the B weights in the model) is not ubiquitously available. [sent-208, score-0.256]
96 In its absence, y( 00) = y(O) = Xn in response to an input X n , and so the network will perform like the standard delta or Rescorla-Wagner (Rescorla & Wagner, 1972) rule. [sent-209, score-0.157]
97 The Kalman filter is only one part of a more complicated picture for statistically normative models of conditioning. [sent-210, score-0.278]
98 Acknowledgements We are very grateful to David Shanks, Rich Sutton, Read Montague and Terry Sejnowski for discussions of the Kalman filter model and its relationship to backward blocking, and to Sam Roweis for comments on the paper. [sent-213, score-0.527]
99 Biological significance in forward and backward blocking: Resolution of a discrepancy between animal conditioning and human causal judgment. [sent-224, score-0.598]
100 Evolution of a structured connectionist model of Pavlovian conditioning (AESOP). [sent-236, score-0.249]
wordName wordTfidf (topN-words)
[('kalman', 0.403), ('whitening', 0.282), ('blocking', 0.266), ('conditioning', 0.249), ('backward', 0.246), ('filter', 0.245), ('goodall', 0.213), ('xn', 0.209), ('sound', 0.161), ('wn', 0.149), ('covariance', 0.147), ('trials', 0.133), ('away', 0.127), ('wagner', 0.116), ('weights', 0.11), ('explaining', 0.103), ('matrix', 0.101), ('redlich', 0.083), ('backwards', 0.083), ('yn', 0.083), ('recurrent', 0.083), ('sutton', 0.08), ('atick', 0.077), ('shanks', 0.073), ('reward', 0.07), ('trial', 0.07), ('innovation', 0.066), ('delta', 0.065), ('animal', 0.065), ('brandon', 0.064), ('rescorla', 0.064), ('whitener', 0.064), ('network', 0.063), ('stimuli', 0.063), ('bn', 0.062), ('tn', 0.062), ('light', 0.061), ('association', 0.06), ('development', 0.059), ('kakade', 0.055), ('paradigms', 0.055), ('prediction', 0.051), ('diffusion', 0.05), ('xx', 0.05), ('uncertainty', 0.049), ('observation', 0.048), ('rule', 0.046), ('diagonal', 0.044), ('sharing', 0.043), ('matute', 0.043), ('pavlovian', 0.043), ('qf', 0.043), ('stearns', 0.043), ('whitens', 0.043), ('widrow', 0.043), ('xnxn', 0.043), ('ie', 0.041), ('novelty', 0.041), ('phenomenon', 0.041), ('unsupervised', 0.041), ('weight', 0.039), ('dayan', 0.039), ('suggested', 0.039), ('inference', 0.038), ('forward', 0.038), ('equation', 0.038), ('stimulus', 0.037), ('sham', 0.037), ('relationship', 0.036), ('explanation', 0.034), ('feedforward', 0.033), ('governing', 0.033), ('drift', 0.033), ('normative', 0.033), ('classical', 0.032), ('noise', 0.032), ('rn', 0.032), ('variation', 0.032), ('symmetric', 0.032), ('correlation', 0.032), ('lin', 0.031), ('reinforcement', 0.031), ('average', 0.031), ('sb', 0.031), ('growing', 0.031), ('looking', 0.029), ('update', 0.029), ('standard', 0.029), ('realized', 0.029), ('ga', 0.029), ('ii', 0.029), ('explained', 0.028), ('changes', 0.028), ('implementation', 0.028), ('filtering', 0.028), ('denominator', 0.027), ('forcing', 0.027), ('concern', 0.027), ('match', 0.027), ('psychology', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 49 nips-2000-Explaining Away in Weight Space
Author: Peter Dayan, Sham Kakade
Abstract: Explaining away has mostly been considered in terms of inference of states in belief networks. We show how it can also arise in a Bayesian context in inference about the weights governing relationships such as those between stimuli and reinforcers in conditioning experiments such as bacA, 'Ward blocking. We show how explaining away in weight space can be accounted for using an extension of a Kalman filter model; provide a new approximate way of looking at the Kalman gain matrix as a whitener for the correlation matrix of the observation process; suggest a network implementation of this whitener using an architecture due to Goodall; and show that the resulting model exhibits backward blocking.
2 0.12783356 89 nips-2000-Natural Sound Statistics and Divisive Normalization in the Auditory System
Author: Odelia Schwartz, Eero P. Simoncelli
Abstract: We explore the statistical properties of natural sound stimuli preprocessed with a bank of linear filters. The responses of such filters exhibit a striking form of statistical dependency, in which the response variance of each filter grows with the response amplitude of filters tuned for nearby frequencies. These dependencies may be substantially reduced using an operation known as divisive normalization, in which the response of each filter is divided by a weighted sum of the rectified responses of other filters. The weights may be chosen to maximize the independence of the normalized responses for an ensemble of natural sounds. We demonstrate that the resulting model accounts for nonlinearities in the response characteristics of the auditory nerve, by comparing model simulations to electrophysiological recordings. In previous work (NIPS, 1998) we demonstrated that an analogous model derived from the statistics of natural images accounts for non-linear properties of neurons in primary visual cortex. Thus, divisive normalization appears to be a generic mechanism for eliminating a type of statistical dependency that is prevalent in natural signals of different modalities. Signals in the real world are highly structured. For example, natural sounds typically contain both harmonic and rythmic structure. It is reasonable to assume that biological auditory systems are designed to represent these structures in an efficient manner [e.g., 1,2]. Specifically, Barlow hypothesized that a role of early sensory processing is to remove redundancy in the sensory input, resulting in a set of neural responses that are statistically independent. Experimentally, one can test this hypothesis by examining the statistical properties of neural responses under natural stimulation conditions [e.g., 3,4], or the statistical dependency of pairs (or groups) of neural responses. Due to their technical difficulty, such multi-cellular experiments are only recently becoming possible, and the earliest reports in vision appear consistent with the hypothesis [e.g., 5]. An alternative approach, which we follow here, is to develop a neural model from the statistics of natural signals and show that response properties of this model are similar to those of biological sensory neurons. A number of researchers have derived linear filter models using statistical criterion. For visual images, this results in linear filters localized in frequency, orientation and phase [6, 7]. Similar work in audition has yielded filters localized in frequency and phase [8]. Although these linear models provide an important starting point for neural modeling, sensory neurons are highly nonlinear. In addition, the statistical properties of natural signals are too complex to expect a linear transformation to result in an independent set of components. Recent results indicate that nonlinear gain control plays an important role in neural processing. Ruderman and Bialek [9] have shown that division by a local estimate of standard deviation can increase the entropy of responses of center-surround filters to natural images. Such a model is consistent with the properties of neurons in the retina and lateral geniculate nucleus. Heeger and colleagues have shown that the nonlinear behaviors of neurons in primary visual cortex may be described using a form of gain control known as divisive normalization [10], in which the response of a linear kernel is rectified and divided by the sum of other rectified kernel responses and a constant. We have recently shown that the responses of oriented linear filters exhibit nonlinear statistical dependencies that may be substantially reduced using a variant of this model, in which the normalization signal is computed from a weighted sum of other rectified kernel responses [11, 12]. The resulting model, with weighting parameters determined from image statistics, accounts qualitatively for physiological nonlinearities observed in primary visual cortex. In this paper, we demonstrate that the responses of bandpass linear filters to natural sounds exhibit striking statistical dependencies, analogous to those found in visual images. A divisive normalization procedure can substantially remove these dependencies. We show that this model, with parameters optimized for a collection of natural sounds, can account for nonlinear behaviors of neurons at the level of the auditory nerve. Specifically, we show that: 1) the shape offrequency tuning curves varies with sound pressure level, even though the underlying linear filters are fixed; and 2) superposition of a non-optimal tone suppresses the response of a linear filter in a divisive fashion, and the amount of suppression depends on the distance between the frequency of the tone and the preferred frequency of the filter. 1 Empirical observations of natural sound statistics The basic statistical properties of natural sounds, as observed through a linear filter, have been previously documented by Attias [13]. In particular, he showed that, as with visual images, the spectral energy falls roughly according to a power law, and that the histograms of filter responses are more kurtotic than a Gaussian (i.e., they have a sharp peak at zero, and very long tails). Here we examine the joint statistical properties of a pair of linear filters tuned for nearby temporal frequencies. We choose a fixed set of filters that have been widely used in modeling the peripheral auditory system [14]. Figure 1 shows joint histograms of the instantaneous responses of a particular pair of linear filters to five different types of natural sound, and white noise. First note that the responses are approximately decorrelated: the expected value of the y-axis value is roughly zero for all values of the x-axis variable. The responses are not, however, statistically independent: the width of the distribution of responses of one filter increases with the response amplitude of the other filter. If the two responses were statistically independent, then the response of the first filter should not provide any information about the distribution of responses of the other filter. We have found that this type of variance dependency (sometimes accompanied by linear correlation) occurs in a wide range of natural sounds, ranging from animal sounds to music. We emphasize that this dependency is a property of natural sounds, and is not due purely to our choice of linear filters. For example, no such dependency is observed when the input consists of white noise (see Fig. 1). The strength of this dependency varies for different pairs of linear filters . In addition, we see this type of dependency between instantaneous responses of a single filter at two Speech o -1 Drums • Monkey Cat White noise Nocturnal nature I~ ~; ~ • Figure 1: Joint conditional histogram of instantaneous linear responses of two bandpass filters with center frequencies 2000 and 2840 Hz. Pixel intensity corresponds to frequency of occurrence of a given pair of values, except that each column has been independently rescaled to fill the full intensity range. For the natural sounds, responses are not independent: the standard deviation of the ordinate is roughly proportional to the magnitude of the abscissa. Natural sounds were recorded from CDs and converted to sampling frequency of 22050 Hz. nearby time instants. Since the dependency involves the variance of the responses, we can substantially reduce it by dividing. In particular, the response of each filter is divided by a weighted sum of responses of other rectified filters and an additive constant. Specifically: L2 Ri = 2: (1) 12 j WjiLj + 0'2 where Li is the instantaneous linear response of filter i, strength of suppression of filter i by filter j. 0' is a constant and Wji controls the We would like to choose the parameters of the model (the weights Wji, and the constant 0') to optimize the independence of the normalized response to an ensemble of natural sounds. Such an optimization is quite computationally expensive. We instead assume a Gaussian form for the underlying conditional distribution, as described in [15]: P (LiILj,j E Ni ) '
3 0.12606311 43 nips-2000-Dopamine Bonuses
Author: Sham Kakade, Peter Dayan
Abstract: Substantial data support a temporal difference (TO) model of dopamine (OA) neuron activity in which the cells provide a global error signal for reinforcement learning. However, in certain circumstances, OA activity seems anomalous under the TO model, responding to non-rewarding stimuli. We address these anomalies by suggesting that OA cells multiplex information about reward bonuses, including Sutton's exploration bonuses and Ng et al's non-distorting shaping bonuses. We interpret this additional role for OA in terms of the unconditional attentional and psychomotor effects of dopamine, having the computational role of guiding exploration. 1
4 0.11511258 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics
Author: Thomas Natschläger, Wolfgang Maass, Eduardo D. Sontag, Anthony M. Zador
Abstract: Experimental data show that biological synapses behave quite differently from the symbolic synapses in common artificial neural network models. Biological synapses are dynamic, i.e., their
5 0.11239223 121 nips-2000-Sparse Kernel Principal Component Analysis
Author: Michael E. Tipping
Abstract: 'Kernel' principal component analysis (PCA) is an elegant nonlinear generalisation of the popular linear data analysis method, where a kernel function implicitly defines a nonlinear transformation into a feature space wherein standard PCA is performed. Unfortunately, the technique is not 'sparse', since the components thus obtained are expressed in terms of kernels associated with every training vector. This paper shows that by approximating the covariance matrix in feature space by a reduced number of example vectors, using a maximum-likelihood approach, we may obtain a highly sparse form of kernel PCA without loss of effectiveness. 1
6 0.11075812 137 nips-2000-The Unscented Particle Filter
7 0.10431254 102 nips-2000-Position Variance, Recurrence and Perceptual Learning
8 0.09215676 24 nips-2000-An Information Maximization Approach to Overcomplete and Recurrent Representations
9 0.089499131 88 nips-2000-Multiple Timescales of Adaptation in a Neural Code
10 0.080414303 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
11 0.078706145 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
12 0.076819174 51 nips-2000-Factored Semi-Tied Covariance Matrices
13 0.07557223 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles
14 0.07276649 37 nips-2000-Convergence of Large Margin Separable Linear Classification
15 0.071126767 129 nips-2000-Temporally Dependent Plasticity: An Information Theoretic Account
16 0.066693462 146 nips-2000-What Can a Single Neuron Compute?
17 0.065770306 122 nips-2000-Sparse Representation for Gaussian Process Models
19 0.054048162 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals
20 0.053448875 120 nips-2000-Sparse Greedy Gaussian Process Regression
topicId topicWeight
[(0, 0.212), (1, -0.103), (2, -0.009), (3, -0.024), (4, 0.013), (5, 0.062), (6, -0.058), (7, 0.018), (8, 0.051), (9, -0.153), (10, -0.027), (11, -0.036), (12, 0.002), (13, 0.054), (14, 0.122), (15, -0.242), (16, -0.09), (17, 0.18), (18, 0.01), (19, 0.107), (20, 0.273), (21, 0.092), (22, 0.062), (23, -0.089), (24, 0.139), (25, 0.058), (26, 0.131), (27, 0.095), (28, 0.089), (29, -0.05), (30, -0.003), (31, 0.063), (32, -0.011), (33, 0.018), (34, 0.045), (35, 0.071), (36, 0.141), (37, 0.111), (38, 0.026), (39, -0.054), (40, 0.034), (41, 0.094), (42, -0.016), (43, -0.06), (44, 0.004), (45, 0.225), (46, -0.013), (47, 0.037), (48, 0.24), (49, 0.053)]
simIndex simValue paperId paperTitle
same-paper 1 0.96679419 49 nips-2000-Explaining Away in Weight Space
Author: Peter Dayan, Sham Kakade
Abstract: Explaining away has mostly been considered in terms of inference of states in belief networks. We show how it can also arise in a Bayesian context in inference about the weights governing relationships such as those between stimuli and reinforcers in conditioning experiments such as bacA, 'Ward blocking. We show how explaining away in weight space can be accounted for using an extension of a Kalman filter model; provide a new approximate way of looking at the Kalman gain matrix as a whitener for the correlation matrix of the observation process; suggest a network implementation of this whitener using an architecture due to Goodall; and show that the resulting model exhibits backward blocking.
2 0.59482545 43 nips-2000-Dopamine Bonuses
Author: Sham Kakade, Peter Dayan
Abstract: Substantial data support a temporal difference (TO) model of dopamine (OA) neuron activity in which the cells provide a global error signal for reinforcement learning. However, in certain circumstances, OA activity seems anomalous under the TO model, responding to non-rewarding stimuli. We address these anomalies by suggesting that OA cells multiplex information about reward bonuses, including Sutton's exploration bonuses and Ng et al's non-distorting shaping bonuses. We interpret this additional role for OA in terms of the unconditional attentional and psychomotor effects of dopamine, having the computational role of guiding exploration. 1
3 0.43434593 89 nips-2000-Natural Sound Statistics and Divisive Normalization in the Auditory System
Author: Odelia Schwartz, Eero P. Simoncelli
Abstract: We explore the statistical properties of natural sound stimuli preprocessed with a bank of linear filters. The responses of such filters exhibit a striking form of statistical dependency, in which the response variance of each filter grows with the response amplitude of filters tuned for nearby frequencies. These dependencies may be substantially reduced using an operation known as divisive normalization, in which the response of each filter is divided by a weighted sum of the rectified responses of other filters. The weights may be chosen to maximize the independence of the normalized responses for an ensemble of natural sounds. We demonstrate that the resulting model accounts for nonlinearities in the response characteristics of the auditory nerve, by comparing model simulations to electrophysiological recordings. In previous work (NIPS, 1998) we demonstrated that an analogous model derived from the statistics of natural images accounts for non-linear properties of neurons in primary visual cortex. Thus, divisive normalization appears to be a generic mechanism for eliminating a type of statistical dependency that is prevalent in natural signals of different modalities. Signals in the real world are highly structured. For example, natural sounds typically contain both harmonic and rythmic structure. It is reasonable to assume that biological auditory systems are designed to represent these structures in an efficient manner [e.g., 1,2]. Specifically, Barlow hypothesized that a role of early sensory processing is to remove redundancy in the sensory input, resulting in a set of neural responses that are statistically independent. Experimentally, one can test this hypothesis by examining the statistical properties of neural responses under natural stimulation conditions [e.g., 3,4], or the statistical dependency of pairs (or groups) of neural responses. Due to their technical difficulty, such multi-cellular experiments are only recently becoming possible, and the earliest reports in vision appear consistent with the hypothesis [e.g., 5]. An alternative approach, which we follow here, is to develop a neural model from the statistics of natural signals and show that response properties of this model are similar to those of biological sensory neurons. A number of researchers have derived linear filter models using statistical criterion. For visual images, this results in linear filters localized in frequency, orientation and phase [6, 7]. Similar work in audition has yielded filters localized in frequency and phase [8]. Although these linear models provide an important starting point for neural modeling, sensory neurons are highly nonlinear. In addition, the statistical properties of natural signals are too complex to expect a linear transformation to result in an independent set of components. Recent results indicate that nonlinear gain control plays an important role in neural processing. Ruderman and Bialek [9] have shown that division by a local estimate of standard deviation can increase the entropy of responses of center-surround filters to natural images. Such a model is consistent with the properties of neurons in the retina and lateral geniculate nucleus. Heeger and colleagues have shown that the nonlinear behaviors of neurons in primary visual cortex may be described using a form of gain control known as divisive normalization [10], in which the response of a linear kernel is rectified and divided by the sum of other rectified kernel responses and a constant. We have recently shown that the responses of oriented linear filters exhibit nonlinear statistical dependencies that may be substantially reduced using a variant of this model, in which the normalization signal is computed from a weighted sum of other rectified kernel responses [11, 12]. The resulting model, with weighting parameters determined from image statistics, accounts qualitatively for physiological nonlinearities observed in primary visual cortex. In this paper, we demonstrate that the responses of bandpass linear filters to natural sounds exhibit striking statistical dependencies, analogous to those found in visual images. A divisive normalization procedure can substantially remove these dependencies. We show that this model, with parameters optimized for a collection of natural sounds, can account for nonlinear behaviors of neurons at the level of the auditory nerve. Specifically, we show that: 1) the shape offrequency tuning curves varies with sound pressure level, even though the underlying linear filters are fixed; and 2) superposition of a non-optimal tone suppresses the response of a linear filter in a divisive fashion, and the amount of suppression depends on the distance between the frequency of the tone and the preferred frequency of the filter. 1 Empirical observations of natural sound statistics The basic statistical properties of natural sounds, as observed through a linear filter, have been previously documented by Attias [13]. In particular, he showed that, as with visual images, the spectral energy falls roughly according to a power law, and that the histograms of filter responses are more kurtotic than a Gaussian (i.e., they have a sharp peak at zero, and very long tails). Here we examine the joint statistical properties of a pair of linear filters tuned for nearby temporal frequencies. We choose a fixed set of filters that have been widely used in modeling the peripheral auditory system [14]. Figure 1 shows joint histograms of the instantaneous responses of a particular pair of linear filters to five different types of natural sound, and white noise. First note that the responses are approximately decorrelated: the expected value of the y-axis value is roughly zero for all values of the x-axis variable. The responses are not, however, statistically independent: the width of the distribution of responses of one filter increases with the response amplitude of the other filter. If the two responses were statistically independent, then the response of the first filter should not provide any information about the distribution of responses of the other filter. We have found that this type of variance dependency (sometimes accompanied by linear correlation) occurs in a wide range of natural sounds, ranging from animal sounds to music. We emphasize that this dependency is a property of natural sounds, and is not due purely to our choice of linear filters. For example, no such dependency is observed when the input consists of white noise (see Fig. 1). The strength of this dependency varies for different pairs of linear filters . In addition, we see this type of dependency between instantaneous responses of a single filter at two Speech o -1 Drums • Monkey Cat White noise Nocturnal nature I~ ~; ~ • Figure 1: Joint conditional histogram of instantaneous linear responses of two bandpass filters with center frequencies 2000 and 2840 Hz. Pixel intensity corresponds to frequency of occurrence of a given pair of values, except that each column has been independently rescaled to fill the full intensity range. For the natural sounds, responses are not independent: the standard deviation of the ordinate is roughly proportional to the magnitude of the abscissa. Natural sounds were recorded from CDs and converted to sampling frequency of 22050 Hz. nearby time instants. Since the dependency involves the variance of the responses, we can substantially reduce it by dividing. In particular, the response of each filter is divided by a weighted sum of responses of other rectified filters and an additive constant. Specifically: L2 Ri = 2: (1) 12 j WjiLj + 0'2 where Li is the instantaneous linear response of filter i, strength of suppression of filter i by filter j. 0' is a constant and Wji controls the We would like to choose the parameters of the model (the weights Wji, and the constant 0') to optimize the independence of the normalized response to an ensemble of natural sounds. Such an optimization is quite computationally expensive. We instead assume a Gaussian form for the underlying conditional distribution, as described in [15]: P (LiILj,j E Ni ) '
4 0.42411321 137 nips-2000-The Unscented Particle Filter
Author: Rudolph van der Merwe, Arnaud Doucet, Nando de Freitas, Eric A. Wan
Abstract: In this paper, we propose a new particle filter based on sequential importance sampling. The algorithm uses a bank of unscented filters to obtain the importance proposal distribution. This proposal has two very
5 0.38991064 102 nips-2000-Position Variance, Recurrence and Perceptual Learning
Author: Zhaoping Li, Peter Dayan
Abstract: Stimulus arrays are inevitably presented at different positions on the retina in visual tasks, even those that nominally require fixation. In particular, this applies to many perceptual learning tasks. We show that perceptual inference or discrimination in the face of positional variance has a structurally different quality from inference about fixed position stimuli, involving a particular, quadratic, non-linearity rather than a purely linear discrimination. We show the advantage taking this non-linearity into account has for discrimination, and suggest it as a role for recurrent connections in area VI, by demonstrating the superior discrimination performance of a recurrent network. We propose that learning the feedforward and recurrent neural connections for these tasks corresponds to the fast and slow components of learning observed in perceptual learning tasks.
6 0.38472283 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics
7 0.3534025 24 nips-2000-An Information Maximization Approach to Overcomplete and Recurrent Representations
8 0.34004673 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles
9 0.32517979 88 nips-2000-Multiple Timescales of Adaptation in a Neural Code
10 0.31562111 51 nips-2000-Factored Semi-Tied Covariance Matrices
11 0.30621454 16 nips-2000-Active Inference in Concept Learning
12 0.30216455 121 nips-2000-Sparse Kernel Principal Component Analysis
14 0.27899683 34 nips-2000-Competition and Arbors in Ocular Dominance
15 0.27754241 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
16 0.25122169 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities
17 0.24694739 129 nips-2000-Temporally Dependent Plasticity: An Information Theoretic Account
18 0.24589001 113 nips-2000-Robust Reinforcement Learning
19 0.24082008 27 nips-2000-Automatic Choice of Dimensionality for PCA
20 0.22025672 37 nips-2000-Convergence of Large Margin Separable Linear Classification
topicId topicWeight
[(10, 0.043), (14, 0.289), (17, 0.108), (32, 0.016), (33, 0.039), (42, 0.018), (54, 0.011), (55, 0.028), (62, 0.049), (65, 0.024), (67, 0.074), (76, 0.04), (79, 0.026), (81, 0.055), (90, 0.036), (91, 0.023), (94, 0.01), (97, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.81809658 49 nips-2000-Explaining Away in Weight Space
Author: Peter Dayan, Sham Kakade
Abstract: Explaining away has mostly been considered in terms of inference of states in belief networks. We show how it can also arise in a Bayesian context in inference about the weights governing relationships such as those between stimuli and reinforcers in conditioning experiments such as bacA, 'Ward blocking. We show how explaining away in weight space can be accounted for using an extension of a Kalman filter model; provide a new approximate way of looking at the Kalman gain matrix as a whitener for the correlation matrix of the observation process; suggest a network implementation of this whitener using an architecture due to Goodall; and show that the resulting model exhibits backward blocking.
2 0.49301654 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics
Author: Thomas Natschläger, Wolfgang Maass, Eduardo D. Sontag, Anthony M. Zador
Abstract: Experimental data show that biological synapses behave quite differently from the symbolic synapses in common artificial neural network models. Biological synapses are dynamic, i.e., their
3 0.48418462 146 nips-2000-What Can a Single Neuron Compute?
Author: Blaise Agüera y Arcas, Adrienne L. Fairhall, William Bialek
Abstract: In this paper we formulate a description of the computation performed by a neuron as a combination of dimensional reduction and nonlinearity. We implement this description for the HodgkinHuxley model, identify the most relevant dimensions and find the nonlinearity. A two dimensional description already captures a significant fraction of the information that spikes carry about dynamic inputs. This description also shows that computation in the Hodgkin-Huxley model is more complex than a simple integrateand-fire or perceptron model. 1
4 0.48248038 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
Author: Zoubin Ghahramani, Matthew J. Beal
Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1
5 0.47844839 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
Author: Javier R. Movellan, Paul Mineiro, Ruth J. Williams
Abstract: This paper explores a framework for recognition of image sequences using partially observable stochastic differential equation (SDE) models. Monte-Carlo importance sampling techniques are used for efficient estimation of sequence likelihoods and sequence likelihood gradients. Once the network dynamics are learned, we apply the SDE models to sequence recognition tasks in a manner similar to the way Hidden Markov models (HMMs) are commonly applied. The potential advantage of SDEs over HMMS is the use of continuous state dynamics. We present encouraging results for a video sequence recognition task in which SDE models provided excellent performance when compared to hidden Markov models. 1
6 0.47232744 102 nips-2000-Position Variance, Recurrence and Perceptual Learning
7 0.4721784 122 nips-2000-Sparse Representation for Gaussian Process Models
8 0.46828321 74 nips-2000-Kernel Expansions with Unlabeled Examples
9 0.46330455 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
10 0.46155053 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
11 0.4589628 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
12 0.45755944 79 nips-2000-Learning Segmentation by Random Walks
13 0.45707068 4 nips-2000-A Linear Programming Approach to Novelty Detection
14 0.45676899 37 nips-2000-Convergence of Large Margin Separable Linear Classification
15 0.45560953 111 nips-2000-Regularized Winnow Methods
16 0.45320374 82 nips-2000-Learning and Tracking Cyclic Human Motion
17 0.45251366 133 nips-2000-The Kernel Gibbs Sampler
18 0.45234478 80 nips-2000-Learning Switching Linear Models of Human Motion
19 0.45210996 130 nips-2000-Text Classification using String Kernels
20 0.45144859 21 nips-2000-Algorithmic Stability and Generalization Performance