nips nips2002 nips2002-18 knowledge-graph by maker-knowledge-mining

18 nips-2002-Adaptation and Unsupervised Learning

Source: pdf

Author: Peter Dayan, Maneesh Sahani, Gregoire Deback

Abstract: Adaptation is a ubiquitous neural and psychological phenomenon, with a wealth of instantiations and implications. Although a basic form of plasticity, it has, bar some notable exceptions, attracted computational theory of only one main variety. In this paper, we study adaptation from the perspective of factor analysis, a paradigmatic technique of unsupervised learning. We use factor analysis to re-interpret a standard view of adaptation, and apply our new model to some recent data on adaptation in the domain of face discrimination.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we study adaptation from the perspective of factor analysis, a paradigmatic technique of unsupervised learning. [sent-8, score-0.623]

2 We use factor analysis to re-interpret a standard view of adaptation, and apply our new model to some recent data on adaptation in the domain of face discrimination. [sent-9, score-0.715]

3 Essentially all sensory and central systems show adaptation at a wide variety of temporal scales, and to a wide variety of aspects of their informational milieu. [sent-11, score-0.597]

4 That adaptation is so pervasive makes it most unlikely that a single theoretical framework will be able to provide a compelling treatment. [sent-14, score-0.443]

5 Nevertheless, adaptation should be just as much a tool for theorists interested in modeling neural statistical learning as for psychophysicists interested in neural processing. [sent-15, score-0.443]

6 Put abstractly, adaptation involves short or long term changes to aspects of the statistics of the environment experienced by a system. [sent-16, score-0.478]

7 Conversely, thoughts about adaptation lay at the heart of the earliest suggestions that redundancy reduction and information maximization should play a central role in models of cortical unsupervised learning. [sent-18, score-0.804]

8 4–6, 8, 23 Redundancy reduction theories of adaptation reached their apogee in the work of Linsker, 26 Atick, Li & colleagues2, 3, 25 and van Hateren. [sent-19, score-0.503]

9 40 Their mathematical framework (see section 2) is that of maximizing information transmission subject to various sources of noise and limitations on the strength of key signals. [sent-20, score-0.214]

10 Adaptation, by affecting noise levels and informational content (notably probabilistic priors), leads to altered stimulus processing. [sent-22, score-0.185]

11 Early work concentrated on the effects of sensory noise on visual receptive ﬁelds; more recent studies 41 have used the same framework to study stimulus speciﬁc adaptation. [sent-23, score-0.249]

12 Redundancy reduction is one major conceptual plank in the modern theory of unsupervised learning. [sent-24, score-0.141]

13 is the explicit input, combining signal and noise ; is the explicit output, to be corrupted by noise to give . [sent-27, score-0.215]

14 We seek the ﬁlter that minimizes redundancy subject to a power constraint. [sent-28, score-0.179]

15 The empirical mean is ; the uniquenesses capture unmodeled variance and additional noise such as . [sent-31, score-0.3]

16 19 Here, we consider adaptation from the perspective of factor analysis, 15 which is one of the most fundamental forms of generative model. [sent-34, score-0.59]

17 After describing the factor analysis model and its relationship with redundancy reduction models of adaptation in section 3, section 4 studies loci of adaptation in one version of this model. [sent-35, score-1.216]

18 As examples, we consider adaptation of early visual receptive ﬁelds to light levels, 38 orientation detection to a persistent bias (the tilt aftereffect),9, 16 and a recent report of adaptation of face discrimination to morphed anti-faces. [sent-36, score-1.484]

19 Here, dimensional photoreceptor input , which is the sum of a signal and detector noise , is ﬁltered by a retinal matrix to produce an -dimensional output for communication down the optic nerve , against a background of additional noise . [sent-38, score-0.358]

20 We assume that the signal is Gaussian, with mean and covariance , and the noise terms are white and Gaussian, with mean and covariances and , respectively; all are mutually independent. [sent-39, score-0.235]

21 Here, the signal is translation invariant, ie is a circulant . [sent-41, score-0.173]

22 Given no input noise ( ), the mutual information between and is (1) where is the entropy function (which, for a Gaussian distribution, is proportional to the determinant of its covariance matrix). [sent-43, score-0.18]

23 We consider maximizing this with respect to , a calculation which only makes sense in the face of a constraint, such as on the average power tr . [sent-44, score-0.181]

24 A;B) Filter power as a function of spatial frequency for the redundancy reduction (A: RR) and factor analysis (B: FA) solutions for the case of translation invariance, for low (solid: ) and high (dashed ) input noise and . [sent-48, score-0.537]

25 C) Data9 (crosses) and RR solution41 (solid) for the tilt aftereffect. [sent-51, score-0.207]

26 Adaptation was based on reducing the uniquenesses for units activated by the adapting stimulus (ﬁtting the width and strength of this adaptation to the data). [sent-54, score-0.903]

27 "£ 0(' &£ $" 1 rt 2q §¦¥£ ¢DF C ¤ ¡ A w@ In the face of input noise, whitening is dangerous for those channels for which , since noise rather than signal would be ampliﬁed by the . [sent-56, score-0.494]

28 The solid curve shows the diagonal components of for small input noise. [sent-60, score-0.138]

29 Intermediate frequencies with input power well above the noise level are comparatively ampliﬁed against the output noise , On the other hand, the dashed line shows the same components for high input noise. [sent-62, score-0.382]

30 This ﬁlter is a low-pass ﬁlter, as only those few components with sufﬁcient input power are signiﬁcantly transmitted. [sent-63, score-0.122]

31 3 High relative input noise arises in cases of low illumination; low noise in cases of high illumination. [sent-68, score-0.218]

32 42 Wainwright41 (see also10 ) suggested an account along exactly these lines for more stimulusspeciﬁc forms of adaptation such as the tilt aftereffect shown in ﬁgure 2C. [sent-71, score-0.795]

33 Here (conceptually), subjects are presented with a vertical grating ( ) for an adapting period of a few seconds, and then are asked, by one of a number of means, to assess the orientation of test gratings. [sent-72, score-0.308]

34 The crosses in ﬁgure 2C shows the error in their estimates; the adapting orientation appears to repel nearby angles, so that true values of near are reported P rt tq g rt q rsq t T R P Ul SkQ T l R Q as being further away. [sent-73, score-0.529]

35 Wainwright modeled this in the light of a neural population code for representing orientation and a ﬁlter related to that of equation 3. [sent-74, score-0.183]

36 Thus, as in the solid line of ﬁgure 2A, the transmission through the adapted ﬁlter of this signal should be temporarily reduced. [sent-76, score-0.138]

37 If the recipient structures that use the equivalent of to calculate the orientation of a test grating are unaware of this adaptation, then, as in the solid line of ﬁgure 2C, an estimation error like that shown by the subjects will result. [sent-77, score-0.27]

38 T R P 5l 2kQ Y 3 Factor Analysis and Adaptation We sought to understand the adaptation of equation 3 and ﬁgure 2A in a factor analysis model. [sent-78, score-0.634]

39 Factor analysis15 is one of the simplest probabilistic generative schemes used to model the unsupervised learning of cortical representations, and underlies many more sophisticated approaches. [sent-79, score-0.18]

40 The case of uniform input noise is particularly interesting, because it is central to the relationship between factor analysis and principal components analysis. [sent-80, score-0.347]

41 20, 34, 39 rt sq Figure 1B shows the elements of a factor analysis model (see Dayan & Abbott 12 for a relevant tutorial introduction). [sent-81, score-0.313]

42 Marginalizing out , equation 4 speciﬁes a Gaussian distribution for , and, indeed, the maximum likelihood values for the parameters given some input data are to set to the empirical mean of the that are and by maximizing the likelihood of the empirical covariance presented, and to set . [sent-83, score-0.179]

43 In most instances of unsupervised learning, the focus is on the recognition or analysis model, 30 which maps a presented input into the values of the latent variable which might have generated it, and thereby form its possible internal representations. [sent-86, score-0.208]

44 However, we later consider forms of adaptation that weaken this dependency. [sent-90, score-0.443]

45 Y ¥ P ) $ ¨ I ¥ Y Y 0 ¦I I ) Y In general, factor analysis and principal components analysis lead to different results. [sent-91, score-0.251]

46 Indeed, although the latter is performed by an eigendecomposition of the covariance matrix of the inputs, the former requires execution of one of a variety of iterative procedures on the same covariance matrix. [sent-92, score-0.123]

47 21, 22, 35 However, if the uniquenesses are forced to be equal, ie , then these procedures are almost the same. [sent-93, score-0.244]

48 (10) P f ffbW` P The similarity between this and the approximate redundancy reduction expression of equation 3 is evident. [sent-99, score-0.249]

49 Just like that ﬁlter, adaptation to high and low light levels (high and low signal/noise ratios), leads to a transition from bandpass to lowpass ﬁltering in . [sent-100, score-0.474]

50 Also, there is no power constraint imposed; rather something similar derives from the generative model’s prior over the latent variables . [sent-102, score-0.135]

51 This analysis is particularly well suited to the standard treatment of redundancy reduction to each of the case of ﬁgure 2A, since adding independent noise of the same strength input variables can automatically be captured by adding to the common uniqueness . [sent-103, score-0.493]

52 However, even though the signal is translation invariant in this case, it need not be that the maximum likelihood factor analysis solution has the property that is proportional to . [sent-104, score-0.21]

53 However, it is to a close approximation, and ﬁgure 2B shows that the strength of in the maximum likelihood (evaluated as in the ﬁgure the principal components of caption) shows the same structure of adaptation as in the probabilistic principal components solution, as a function of . [sent-105, score-0.728]

54 Figure 2D shows a version of the tilt illusion coming from a factor analysis model given population coded input (with Gaussian tuning curves with an orientation bandwidth of ) and a single factor. [sent-106, score-0.54]

55 However, in a regime in which a linear approximation holds, the one factor can represent the systematic covariation in the activity of the population coming from the single dimension of angular variation in the input. [sent-108, score-0.184]

56 A close match in this model to Wainwright’s41 suggestion is that the uniquenesses for the input units (around ) that are reliably activated by an adapting stimulus should be decreased, as if the single factor would predict a greater proportion of the variability in the activation of those units. [sent-110, score-0.478]

57 This makes of equation 5 , and so leads to a tilt aftereffect more sensitive to small variations in away from as an estimation bias. [sent-111, score-0.407]

58 Our model also shows the same effect as Wainwright’s41 in orientation discrimination, boosting sensitivity near the adapted and reducing it around half a tuning width away. [sent-114, score-0.125]

59 We use this to model a recently reported effect of adaptation on face discrimination. [sent-116, score-0.608]

60 A) Experimental24 mean propensity to averages over all faces, and, for FA, report Adam as a function of the strength of Adam in the input for no adaptation (‘o’); adaptation to anti-Adam (‘x’); and adaptation to anti-Henry (‘ ’). [sent-134, score-1.589]

61 B) Mean propensity in the factor analysis model for the same outcomes. [sent-136, score-0.19]

62 C;D) Experimental and model proportion of reports of Adam when adaptation was to anti-Adam; but various strengths of Henry are presented. [sent-138, score-0.472]

63 The model captures the decrease in Adam given presentation of anti-Henry through a normalization pool (solid); although it does not decrease to quite the same extent as the data. [sent-139, score-0.163]

64 Just reporting the face with the largest (dashed) shows no decrease in reporting Adam given presentation of anti-Henry. [sent-140, score-0.356]

65 ££ "£ # £ ¥£ ¡ §¦£ ¡ ¥¤£ §¦G£ ¡ ¤ ¢ A ¡¢ ¤ C £ G£ ¡ ¤ ¨ © ¦ Leopold and his colleagues24 studied adaptation in the complex stimulus domain of faces. [sent-142, score-0.489]

66 Their experiment involved four target faces (associated with names ‘Adam’, ‘Henry’, ‘Jim’, ‘John’) which were previously unfamiliar to subjects, together with morphed versions of these faces lying on ‘lines’ going through the target faces and the average of all four faces. [sent-143, score-0.368]

67 The task for the subjects was always to identify which of the four faces was presented; this is obviously impossible at the average face, but becomes progressively easier as the average face is morphed progressively further (by an amount called its strength) towards one of the target faces. [sent-145, score-0.47]

68 The circles in ﬁgure 3A show the mean performance of the subjects in choosing the correct face as a function of its strength; performance is essentially perfect of the way to the target face. [sent-146, score-0.277]

69 l A negative strength version of one of the target faces (eg anti-Adam) was then shown to the subjects for seconds before one of the positive strength faces was shown as a test. [sent-147, score-0.574]

70 As for the tilt aftereffect, discrimination is biased away from the adapted stimulus. [sent-149, score-0.284]

71 Figure 3C shows that adapting to anti-Adam offers the greatest boost to the event that Adam is reported to a test face (say Henry) that is not Adam, at the average face. [sent-150, score-0.276]

72 That presenting Henry should decrease the reporting of Adam is obvious, and is commented on in the paper. [sent-152, score-0.175]

73 However, that presenting anti-Henry should decrease the reporting of Adam is less obvious, since, by removing Henry as a competitor, one might have expected Adam to have received an additional boost. [sent-153, score-0.175]

74 Figure 3B;D shows our factor analysis model of these results. [sent-154, score-0.136]

75 Here, we consider a case with Adam visible units, and factors, one for each face, with generative weights governing the input activity associated with full strength versions of each face generated from independent distributions. [sent-155, score-0.428]

76 In this representation, morphing is easy, consist © ¡ hhh jq P ¥ u B q i £ ¢¢ ¢ ¢ £T ¢¢ Adam ing of presenting where is the strength and is noise (variance ). [sent-156, score-0.297]

77 For reasons discussed below, we considered a normalization pool 17, 37 for the outputs, treating as the probability that face would be reported, where is a discrimination parameters. [sent-159, score-0.272]

78 Adaptation to anti-Adam was represented by setting Adam , where is the strength of the adapting stimulus. [sent-160, score-0.228]

79 Figure 3B shows the model of the basic adaptation effect seen in ﬁgure 3A. [sent-161, score-0.472]

80 The results for two individual subjects presented in the paper24 are just as extreme; other subjects may have had softer decision biases. [sent-164, score-0.224]

81 The dashed line shows that without the normalization pool, presenting anti-Henry does indeed boost reporting of Adam, when anti-Adam was the adapting stimulus. [sent-166, score-0.367]

82 However, under the above normalization, decreasing boosts the relative strengths of Jim and John (through the minimization in the normalization pool), allowing them to compete, and so reduces the propensity to report Adam (solid line). [sent-167, score-0.16]

83 r wq Y 1 I ¡P fI` P hY P xnI z ¨ d¥¤ ¥ ©z ¨ §k ¥¤ ¤ ¦ ¤ ¦ 5 1 ¦ 1 ¤ 5 Discussion We have studied how plasticity associated with adaptation ﬁts with regular unsupervised learning models, in particular factor analysis. [sent-168, score-0.736]

84 It was obvious that there should be a close relationship; this was, however, obscured by aspects of the redundancy reduction models such as the existence of multiple sources of added noise and non-informational constraints. [sent-169, score-0.314]

85 Uniquenesses in factor analysis are exactly the correct noise model for the simple information maximization scheme. [sent-170, score-0.256]

86 We illustrated the model for the case of a simple, linear, model of the tilt aftereffect, and of adaptation in face discrimination. [sent-171, score-0.786]

87 17, 37 Under this current conceptual scheme for adaptation, assumed changes in the input statistics are fully compensated for by the factor analysis model (and the linear and Gaussian nature of the model implies that can be changed without any consequence for the generative or recognition models). [sent-173, score-0.232]

88 The dynamical form of the factor analysis model in equation 6 suggests other possible targets for adaptation. [sent-174, score-0.191]

89 Of particular interest is the possibility that the top-down weights and/or the uniquenesses might change whilst bottom-up weights remain constant. [sent-175, score-0.252]

90 The rationale for this comes from suggestive neurophysiological evidence that bottom-up pathways show delayed plasticity in certain circumstances; 13 and indeed it is exactly what happens in unsupervised learning techniques such as the wake-sleep algorithm. [sent-176, score-0.136]

91 Of course, factor analysis is insufﬁciently powerful to be an adequate model for cortical unsupervised learning or indeed all aspects of adaptation (as already evident in the limited range of applicability of the model of the tilt aftereffect). [sent-178, score-0.953]

92 For instance, it would seem unfortunate if all cells in primary visual cortex have to know the light level governing adaptation in order to be able correctly to interpret the information coming bottom-up from the thalamus. [sent-181, score-0.579]

93 In some cases, such as the approximate noise ﬁlter , there are alternative semantics for the adapted neural activity under which this is unnecessary; understanding how this generalizes is a major task for future work. [sent-182, score-0.117]

94 [4] Attneave, F (1954) Some informational aspects of visual perception. [sent-191, score-0.142]

95 [9] Campbell, FW & Maffei, L (1971) The tilt after-effect: a fresh look. [sent-202, score-0.207]

96 (2000) A functional angle on some after-effects in cortical vision, Proceedings of the Royal Society of London, Series B 267, 1705-1710. [sent-205, score-0.117]

97 [27] Maguire G, Hamasaki DI (1994) The retinal dopamine network alters the adaptational properties of retinal ganglion cells in the cat. [sent-235, score-0.132]

98 [37] Schwartz, O & Simoncelli, EP (2001) Natural signal statistics and sensory gain control. [sent-257, score-0.11]

99 [38] Shapley, R & Enroth-Cugell, C (1984) Visual adaptation and retinal gain control. [sent-259, score-0.509]

100 [41] Wainwright, MJ (1999) Visual adaptation as optimal information transmission. [sent-265, score-0.443]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('adam', 0.467), ('adaptation', 0.443), ('tilt', 0.207), ('uniquenesses', 0.186), ('henry', 0.148), ('aftereffect', 0.145), ('rt', 0.144), ('face', 0.136), ('redundancy', 0.134), ('strength', 0.129), ('subjects', 0.112), ('atick', 0.103), ('faces', 0.102), ('fa', 0.102), ('adapting', 0.099), ('factor', 0.099), ('lter', 0.098), ('reporting', 0.092), ('gure', 0.086), ('noise', 0.085), ('unsupervised', 0.081), ('retinal', 0.066), ('angle', 0.066), ('dayan', 0.066), ('barlow', 0.066), ('sensory', 0.065), ('orientation', 0.064), ('ffbw', 0.062), ('morphed', 0.062), ('wainwright', 0.061), ('solid', 0.061), ('jj', 0.061), ('reduction', 0.06), ('ie', 0.058), ('wq', 0.058), ('diag', 0.057), ('plasticity', 0.055), ('equation', 0.055), ('informational', 0.054), ('propensity', 0.054), ('visual', 0.053), ('ampli', 0.052), ('coming', 0.052), ('cortical', 0.051), ('hb', 0.049), ('psychometrika', 0.049), ('principal', 0.049), ('crosses', 0.049), ('input', 0.048), ('generative', 0.048), ('covariance', 0.047), ('df', 0.047), ('presenting', 0.047), ('normalization', 0.046), ('stimulus', 0.046), ('power', 0.045), ('pool', 0.045), ('discrimination', 0.045), ('signal', 0.045), ('da', 0.044), ('rr', 0.043), ('latent', 0.042), ('dashed', 0.042), ('circulant', 0.041), ('reskog', 0.041), ('zy', 0.041), ('abbott', 0.041), ('boost', 0.041), ('analysis', 0.037), ('decrease', 0.036), ('concreteness', 0.036), ('hhh', 0.036), ('leopold', 0.036), ('linsker', 0.036), ('whitening', 0.036), ('maximization', 0.035), ('aspects', 0.035), ('synaptic', 0.035), ('visible', 0.034), ('population', 0.033), ('sq', 0.033), ('weights', 0.033), ('grating', 0.033), ('jim', 0.033), ('lf', 0.033), ('adapted', 0.032), ('light', 0.031), ('wy', 0.031), ('boosts', 0.031), ('kg', 0.031), ('maneesh', 0.031), ('mean', 0.029), ('translation', 0.029), ('strengths', 0.029), ('tq', 0.029), ('progressively', 0.029), ('compete', 0.029), ('effect', 0.029), ('components', 0.029), ('matrix', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 18 nips-2002-Adaptation and Unsupervised Learning

Author: Peter Dayan, Maneesh Sahani, Gregoire Deback

2 0.11911754 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior

Author: Patrik O. Hoyer, Aapo Hyvärinen

Abstract: The responses of cortical sensory neurons are notoriously variable, with the number of spikes evoked by identical stimuli varying signiﬁcantly from trial to trial. This variability is most often interpreted as ‘noise’, purely detrimental to the sensory system. In this paper, we propose an alternative view in which the variability is related to the uncertainty, about world parameters, which is inherent in the sensory stimulus. Speciﬁcally, the responses of a population of neurons are interpreted as stochastic samples from the posterior distribution in a latent variable model. In addition to giving theoretical arguments supporting such a representational scheme, we provide simulations suggesting how some aspects of response variability might be understood in this framework.

3 0.09706375 103 nips-2002-How Linear are Auditory Cortical Responses?

Author: Maneesh Sahani, Jennifer F. Linden

Abstract: By comparison to some other sensory cortices, the functional properties of cells in the primary auditory cortex are not yet well understood. Recent attempts to obtain a generalized description of auditory cortical responses have often relied upon characterization of the spectrotemporal receptive ﬁeld (STRF), which amounts to a model of the stimulusresponse function (SRF) that is linear in the spectrogram of the stimulus. How well can such a model account for neural responses at the very ﬁrst stages of auditory cortical processing? To answer this question, we develop a novel methodology for evaluating the fraction of stimulus-related response power in a population that can be captured by a given type of SRF model. We use this technique to show that, in the thalamo-recipient layers of primary auditory cortex, STRF models account for no more than 40% of the stimulus-related power in neural responses.

4 0.09263368 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

Author: Peter Sykacek, Stephen J. Roberts

Abstract: We propose in this paper a probabilistic approach for adaptive inference of generalized nonlinear classiﬁcation that combines the computational advantage of a parametric solution with the ﬂexibility of sequential sampling techniques. We regard the parameters of the classiﬁer as latent states in a ﬁrst order Markov process and propose an algorithm which can be regarded as variational generalization of standard Kalman ﬁltering. The variational Kalman ﬁlter is based on two novel lower bounds that enable us to use a non-degenerate distribution over the adaptation rate. An extensive empirical evaluation demonstrates that the proposed method is capable of infering competitive classiﬁers both in stationary and non-stationary environments. Although we focus on classiﬁcation, the algorithm is easily extended to other generalized nonlinear models.

5 0.088331133 87 nips-2002-Fast Transformation-Invariant Factor Analysis

Author: Anitha Kannan, Nebojsa Jojic, Brendan J. Frey

Abstract: Dimensionality reduction techniques such as principal component analysis and factor analysis are used to discover a linear mapping between high dimensional data samples and points in a lower dimensional subspace. In [6], Jojic and Frey introduced mixture of transformation-invariant component analyzers (MTCA) that can account for global transformations such as translations and rotations, perform clustering and learn local appearance deformations by dimensionality reduction. However, due to enormous computational requirements of the EM algorithm for learning the model, O( ) where is the dimensionality of a data sample, MTCA was not practical for most applications. In this paper, we demonstrate how fast Fourier transforms can reduce the computation to the order of log . With this speedup, we show the effectiveness of MTCA in various applications - tracking, video textures, clustering video sequences, object recognition, and object detection in images. ¡ ¤ ¤ ¤ ¤

6 0.085257821 10 nips-2002-A Model for Learning Variance Components of Natural Images

7 0.081040993 79 nips-2002-Evidence Optimization Techniques for Estimating Stimulus-Response Functions

8 0.079597838 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

9 0.078994945 193 nips-2002-Temporal Coherence, Natural Image Sequences, and the Visual Cortex

10 0.07548704 26 nips-2002-An Estimation-Theoretic Framework for the Presentation of Multiple Stimuli

11 0.070850216 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons

12 0.07010705 141 nips-2002-Maximally Informative Dimensions: Analyzing Neural Responses to Natural Signals

13 0.069407284 118 nips-2002-Kernel-Based Extraction of Slow Features: Complex Cells Learn Disparity and Translation Invariance from Natural Images

14 0.068075791 206 nips-2002-Visual Development Aids the Acquisition of Motion Velocity Sensitivities

15 0.067611836 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

16 0.066170312 153 nips-2002-Neural Decoding of Cursor Motion Using a Kalman Filter

17 0.063836806 38 nips-2002-Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement

18 0.063448392 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces

19 0.06272158 75 nips-2002-Dynamical Causal Learning

20 0.062697612 148 nips-2002-Morton-Style Factorial Coding of Color in Primary Visual Cortex

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.204), (1, 0.101), (2, 0.007), (3, 0.066), (4, -0.023), (5, 0.004), (6, -0.065), (7, -0.003), (8, 0.044), (9, 0.019), (10, 0.005), (11, -0.023), (12, 0.069), (13, 0.057), (14, 0.024), (15, 0.072), (16, 0.026), (17, -0.029), (18, 0.034), (19, -0.107), (20, 0.037), (21, -0.097), (22, 0.078), (23, 0.014), (24, -0.05), (25, 0.029), (26, 0.035), (27, -0.049), (28, 0.051), (29, -0.031), (30, -0.047), (31, -0.041), (32, 0.041), (33, -0.015), (34, 0.029), (35, -0.032), (36, 0.031), (37, -0.035), (38, -0.006), (39, -0.194), (40, -0.029), (41, 0.092), (42, 0.099), (43, 0.044), (44, 0.005), (45, 0.083), (46, 0.057), (47, 0.083), (48, 0.001), (49, 0.052)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94755667 18 nips-2002-Adaptation and Unsupervised Learning

Author: Peter Dayan, Maneesh Sahani, Gregoire Deback

2 0.70793217 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex

Author: Peter Dayan, Angela J. Yu

Abstract: Inference and adaptation in noisy and changing, rich sensory environments are rife with a variety of speciﬁc sorts of variability. Experimental and theoretical studies suggest that these different forms of variability play different behavioral, neural and computational roles, and may be reported by different (notably neuromodulatory) systems. Here, we reﬁne our previous theory of acetylcholine’s role in cortical inference in the (oxymoronic) terms of expected uncertainty, and advocate a theory for norepinephrine in terms of unexpected uncertainty. We suggest that norepinephrine reports the radical divergence of bottom-up inputs from prevailing top-down interpretations, to inﬂuence inference and plasticity. We illustrate this proposal using an adaptive factor analysis model.

3 0.50724983 146 nips-2002-Modeling Midazolam's Effect on the Hippocampus and Recognition Memory

Author: Kenneth J. Malmberg, René Zeelenberg, Richard M. Shiffrin

Abstract: The benz.odiaze:pine '~1idazolam causes dense,but temporary ~ anterograde amnesia, similar to that produced by- hippocampal damage~Does the action of M'idazola:m on the hippocanlpus cause less storage, or less accurate storage, .of information in episodic. long-term menlory?- \rVe used a sinlple variant of theREJv1. JD.odel [18] to fit data collected. by IIirsbnla.n~Fisher, .IIenthorn,Arndt} and Passa.nnante [9] on the effects of Midazola.m, study time~ and normative \vQrd.. frequenc:y on both yes-no and remember-k.novv recognition m.emory. That a: simple strength. 'model fit well \\tas cont.rary to the expectations of 'flirshman et aLMore important,within the Bayesian based R.EM modeling frame\vork, the data were consistentw'ith the view that Midazolam causes less accurate storage~ rather than less storage, of infornlation in episodic mcm.ory..

4 0.50253367 176 nips-2002-Replay, Repair and Consolidation

Author: Szabolcs Káli, Peter Dayan

Abstract: A standard view of memory consolidation is that episodes are stored temporarily in the hippocampus, and are transferred to the neocortex through replay. Various recent experimental challenges to the idea of transfer, particularly for human memory, are forcing its re-evaluation. However, although there is independent neurophysiological evidence for replay, short of transfer, there are few theoretical ideas for what it might be doing. We suggest and demonstrate two important computational roles associated with neocortical indices.

5 0.49719492 87 nips-2002-Fast Transformation-Invariant Factor Analysis

Author: Anitha Kannan, Nebojsa Jojic, Brendan J. Frey

6 0.4611769 153 nips-2002-Neural Decoding of Cursor Motion Using a Kalman Filter

7 0.45596421 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior

8 0.45035034 103 nips-2002-How Linear are Auditory Cortical Responses?

9 0.44960502 150 nips-2002-Multiple Cause Vector Quantization

10 0.44941556 71 nips-2002-Dopamine Induced Bistability Enhances Signal Processing in Spiny Neurons

11 0.44587132 26 nips-2002-An Estimation-Theoretic Framework for the Presentation of Multiple Stimuli

12 0.4412578 66 nips-2002-Developing Topography and Ocular Dominance Using Two aVLSI Vision Sensors and a Neurotrophic Model of Plasticity

13 0.43893552 79 nips-2002-Evidence Optimization Techniques for Estimating Stimulus-Response Functions

14 0.43394071 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

15 0.42976549 193 nips-2002-Temporal Coherence, Natural Image Sequences, and the Visual Cortex

16 0.42137209 199 nips-2002-Timing and Partial Observability in the Dopamine System

17 0.41438875 177 nips-2002-Retinal Processing Emulation in a Programmable 2-Layer Analog Array Processor CMOS Chip

18 0.4131631 128 nips-2002-Learning a Forward Model of a Reflex

19 0.41270375 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

20 0.40204522 183 nips-2002-Source Separation with a Sensor Array using Graphical Models and Subband Filtering

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.042), (42, 0.08), (54, 0.085), (55, 0.041), (64, 0.039), (67, 0.39), (68, 0.022), (74, 0.052), (92, 0.029), (98, 0.11)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93698019 71 nips-2002-Dopamine Induced Bistability Enhances Signal Processing in Spiny Neurons

Author: Aaron J. Gruber, Sara A. Solla, James C. Houk

Abstract: Single unit activity in the striatum of awake monkeys shows a marked dependence on the expected reward that a behavior will elicit. We present a computational model of spiny neurons, the principal neurons of the striatum, to assess the hypothesis that direct neuromodulatory effects of dopamine through the activation of D 1 receptors mediate the reward dependency of spiny neuron activity. Dopamine release results in the amplification of key ion currents, leading to the emergence of bistability, which not only modulates the peak firing rate but also introduces a temporal and state dependence of the model's response, thus improving the detectability of temporally correlated inputs. 1

same-paper 2 0.86332905 18 nips-2002-Adaptation and Unsupervised Learning

Author: Peter Dayan, Maneesh Sahani, Gregoire Deback

3 0.82429469 107 nips-2002-Identity Uncertainty and Citation Matching

Author: Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, Ilya Shpitser

Abstract: Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identiﬁers or when those identiﬁers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching—the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to deﬁne a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with speciﬁc methods for generating efﬁcient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers. 1 INTRODUCTION Citation matching is the problem currently handled by systems such as Citeseer [1]. 1 Such systems process a large number of scientiﬁc publications to extract their citation lists. By grouping together all co-referring citations (and, if possible, linking to the actual cited paper), the system constructs a database of “paper” entities linked by the “cites(p 1 , p2 )” relation. This is an example of the general problem of determining the existence of a set of objects, and their properties and relations, given a collection of “raw” perceptual data; this problem is faced by intelligence analysts and intelligent agents as well as by citation systems. A key aspect of this problem is determining when two observations describe the same object; only then can evidence be combined to develop a more complete description of the object. Objects seldom carry unique identiﬁers around with them, so identity uncertainty is ubiquitous. For example, Figure 1 shows two citations that probably refer to the same paper, despite many superﬁcial differences. Citations appear in many formats and are rife with errors of all kinds. As a result, Citeseer—which is speciﬁcally designed to overcome such problems—currently lists more than 100 distinct AI textbooks published by Russell 1 See citeseer.nj.nec.com. Citeseer is now known as ResearchIndex. [Lashkari et al 94] Collaborative Interface Agents, Yezdi Lashkari, Max Metral, and Pattie Maes, Proceedings of the Twelfth National Conference on Articial Intelligence, MIT Press, Cambridge, MA, 1994. Metral M. Lashkari, Y. and P. Maes. Collaborative interface agents. In Conference of the American Association for Artiﬁcial Intelligence, Seattle, WA, August 1994. Figure 1: Two citations that probably refer to the same paper. and Norvig on or around 1995, from roughly 1000 citations. Identity uncertainty has been studied independently in several ﬁelds. Record linkage [2] is a method for matching up the records in two ﬁles, as might be required when merging two databases. For each pair of records, a comparison vector is computed that encodes the ways in which the records do and do not match up. EM is used to learn a naive-Bayes distribution over this vector for both matched and unmatched record pairs, so that the pairwise match probability can then be calculated using Bayes’ rule. Linkage decisions are typically made in a greedy fashion based on closest match and/or a probability threshold, so the overall process is order-dependent and may be inconsistent. The model does not provide for a principled way to combine matched records. A richer probability model is developed by Cohen et al [3], who model the database as a combination of some “original” records that are correct and some number of erroneous versions. They give an efﬁcient greedy algorithm for ﬁnding a single locally optimal assignment of records into groups. Data association [4] is the problem of assigning new observations to existing trajectories when multiple objects are being tracked; it also arises in robot mapping when deciding if an observed landmark is the same as one previously mapped. While early data association systems used greedy methods similar to record linkage, recent systems have tried to ﬁnd high-probability global solutions [5] or to approximate the true posterior over assignments [6]. The latter method has also been applied to the problem of stereo correspondence, in which a computer vision system must determine how to match up features observed in two or more cameras [7]. Data association systems usually have simple observation models (e.g., Gaussian noise) and assume that observations at each time step are all distinct. More general patterns of identity occur in natural language text, where the problem of anaphora resolution involves determining whether phrases (especially pronouns) co-refer; some recent work [8] has used an early form of relational probability model, although with a somewhat counterintuitive semantics. Citeseer is the best-known example of work on citation matching [1]. The system groups citations using a form of greedy agglomerative clustering based on a text similarity metric (see Section 6). McCallum et al [9] use a similar technique, but also develop clustering algorithms designed to work well with large numbers of small clusters (see Section 5). With the exception of [8], all of the preceding systems have used domain-speciﬁc algorithms and data structures; the probabilistic approaches are based on a ﬁxed probability model. In previous work [10], we have suggested a declarative approach to identity uncertainty using a formal language—an extension of relational probability models [11]. Here, we describe the ﬁrst substantial application of the approach. Section 2 explains how to specify a generative probability model of the domain. The key technical point (Section 3) is that the possible worlds include not only objects and relations but also mappings from terms in the language to objects in the domain, and the probability model must include a prior over such mappings. Once the extended model has been deﬁned, Section 4 details the probability distributions used. A general-purpose inference method is applied to the model. We have found Markov chain Monte Carlo (MCMC) to be effective for this and other applications (see Section 5); here, we include a method for generating effective proposals based on ideas from [9]. The system also incorporates an EM algorithm for learning the local probability models, such as the model of how author names are abbreviated, reordered, and misspelt in citations. Section 6 evaluates the performance of four datasets originally used to test the Citeseer algorithms [1]. As well as providing signiﬁcantly better performance, our system is able to reason simultaneously about papers, authors, titles, and publication types, and does a good job of extracting this information from the grouped citations. For example, an author’s name can be identiﬁed more accurately by combining information from multiple citations of several different papers. The errors made by our system point to some interesting unmodeled aspects of the citation process. 2 RPMs Reasoning about identity requires reasoning about objects, which requires at least some of the expressive power of a ﬁrst-order logical language. Our approach builds on relational probability models (RPMs) [11], which let us specify probability models over possible worlds deﬁned by objects, properties, classes, and relations. 2.1 Basic RPMs At its most basic, an RPM, as deﬁned by Koller et al [12], consists of • A set C of classes denoting sets of objects, related by subclass/superclass relations. • A set I of named instances denoting objects, each an instance of one class. • A set A of complex attributes denoting functional relations. Each complex attribute A has a domain type Dom[A] ∈ C and a range type Range[A] ∈ C. • A set B of simple attributes denoting functions. Each simple attribute B has a domain type Dom[B] ∈ C and a range V al[B]. • A set of conditional probability models P (B|P a[B]) for the simple attributes. P a[B] is the set of B’s parents, each of which is a nonempty chain of (appropriately typed) attributes σ = A1 . · · · .An .B , where B is a simple attribute. Probability models may be attached to instances or inherited from classes. The parent links should be such that no cyclic dependencies are formed. • A set of instance statements, which set the value of a complex attribute to an instance of the appropriate class. We also use a slight variant of an additional concept from [11]: number uncertainty, which allows for multi-valued complex attributes of uncertain cardinality. We deﬁne each such attribute A as a relation rather than a function, and we associate with it a simple attribute #[A] (i.e., the number of values of A) with a domain type Dom[A] and a range {0, 1, . . . , max #[A]}. 2.2 RPMs for citations Figure 2 outlines an RPM for the example citations of Figure 1. There are four classes, the self-explanatory Author, Paper, and Citation, as well as AuthorAsCited, which represents not actual authors, but author names as they appear when cited. Each citation we wish to match leads to the creation of a Citation instance; instances of the remaining three classes are then added as needed to ﬁll all the complex attributes. E.g., for the ﬁrst citation of Figure 1, we would create a Citation instance C1 , set its text attribute to the string “Metral M. ...August 1994.”, and set its paper attribute to a newly created Paper instance, which we will call P1 . We would then introduce max(#[author]) (here only 3, for simplicity) AuthorAsCited instances (D11 , D12 , and D13 ) to ﬁll the P1 .obsAuthors (i.e., observed authors) attribute, and an equal number of Author instances (A 11 , A12 , and A13 ) to ﬁll both the P1 .authors[i] and the D1i .author attributes. (The complex attributes would be set using instance statements, which would then also constrain the cited authors to be equal to the authors of the actual paper. 2 ) Assuming (for now) that the value of C1 .parse 2 Thus, uncertainty over whether the authors are ordered correctly can be modeled using probabilistic instance statements. A11 Author A12 surname #(fnames) fnames A13 A21 D11 AuthorAsCited surname #(fnames) fnames author A22 A23 D12 D13 D21 D22 Paper D23 Citation #(authors) authors title publication type P1 P2 #(obsAuthors) obsAuthors obsTitle parse C1 C2 text paper Figure 2: An RPM for our Citeseer example. The large rectangles represent classes: the dark arrows indicate the ranges of their complex attributes, and the light arrows lay out all the probabilistic dependencies of their basic attributes. The small rectangles represent instances, linked to their classes with thick grey arrows. We omit the instance statements which set many of the complex attributes. is observed, we can set the values of all the basic attributes of the Citation and AuthorAsCited instances. (E.g., given the correct parse, D11 .surname would be set to Lashkari, and D12 .fnames would be set to (Max)). The remaining basic attributes — those of the Paper and Author instances — represent the “true” attributes of those objects, and their values are unobserved. The standard semantics of RPMs includes the unique names assumption, which precludes identity uncertainty. Under this assumption, any two papers are assumed to be different unless we know for a fact that they are the same. In other words, although there are many ways in which the terms of the language can map to the objects in a possible world, only one of these identity mappings is legal: the one with the fewest co-referring terms. It is then possible to express the RPM as an equivalent Bayesian network: each of the basic attributes of each of the objects becomes a node, with the appropriate parents and probability model. RPM inference usually involves the construction of such a network. The Bayesian network equivalent to our RPM is shown in Figure 3. 3 IDENTITY UNCERTAINTY In our application, any two citations may or may not refer to the same paper. Thus, for citations C1 and C2 , there is uncertainty as to whether the corresponding papers P 1 and P2 are in fact the same object. If they are the same, they will share one set of basic attributes; A11. surname D12. #(fnames) D12. surname A11. fnames D11. #(fnames) D12. fnames A21. #(fnames) A13. surname A12. fnames A21. fnames A13. fnames A13. #(fnames) D13. surname D11. fnames D11. surname D13. #(fnames) C1. #(authors) P1. title C1. text P1. pubtype C1. obsTitle A21. surname A23. surname A22. fnames D22. #(fnames) D12. surname D21. #(fnames) D22. fnames A23. fnames A23. #(fnames) D23. surname D21. fnames D13. fnames C1. parse A22. #(fnames) A22. surname A12. #(fnames) A12. surname A11. #(fnames) D23. fnames D21. surname D23. #(fnames) C2. #(authors) P2. title C2. parse C2. text C2. obsTitle P2. pubtype Figure 3: The Bayesian network equivalent to our RPM, assuming C 1 = C2 . if they are distinct, there will be two sets. Thus, the possible worlds of our probability model may differ in the number of random variables, and there will be no single equivalent Bayesian network. The approach we have taken to this problem [10] is to extend the representation of a possible world so that it includes not only the basic attributes of a set of objects, but also the number of objects n and an identity clustering ι, that is, a mapping from terms in the language (such as P1 ) to objects in the world. We are interested only in whether terms co-refer or not, so ι can be represented by a set of equivalence classes of terms. For example, if P1 and P2 are the only terms, and they co-refer, then ι is {{P1 , P2 }}; if they do not co-refer, then ι is {{P1 }, {P2 }}. We deﬁne a probability model for the space of extended possible worlds by specifying the prior P (n) and the conditional distribution P (ι|n). As in standard RPMs, we assume that the class of every instance is known. Hence, we can simplify these distributions further by factoring them by class, so that, e.g., P (ι) = C∈C P (ιC ). We then distinguish two cases: • For some classes (such as the citations themselves), the unique names assumptions remains appropriate. Thus, we deﬁne P (ιCitation ) to assign a probability of 1.0 to the one assignment where each citation object is unique. • For classes such as Paper and Author, whose elements are subject to identity uncertainty, we specify P (n) using a high-variance log-normal distribution. 3 Then we make appropriate uniformity assumptions to construct P (ιC ). Speciﬁcally, we assume that each paper is a priori equally likely to be cited, and that each author is a priori equally likely to write a paper. Here, “a priori” means prior to obtaining any information about the object in question, so the uniformity assumption is entirely reasonable. With these assumptions, the probability of an assignment ι C,k,m that maps k named instances to m distinct objects, when C contains n objects, is given by 1 n! P (ιC,k,m ) = (n − m)! nk When n > m, the world contains objects unreferenced by any of the terms. However, these ﬁller objects are obviously irrelevant (if they affected the attributes of some named term, they would have been named as functions of that term.) Therefore, we never have to create them, or worry about their attribute values. Our model assumes that the cardinalities and identity clusterings of the classes are independent of each other, as well as of the attribute values. We could remove these assumptions. For one, it would be straightforward to specify a class-wise dependency model for n or ι using standard Bayesian network semantics, where the network nodes correspond to the cardinality attributes of the classes. E.g., it would be reasonable to let the total number of papers depend on the total number of authors. Similarly, we could allow ι to depend on the attribute values—e.g., the frequency of citations to a given paper might depend on the fame of the authors—provided we did not introduce cyclic dependencies. 4 The Probability Model We will now ﬁll in the details of the conditional probability models. Our priors over the “true” attributes are constructed off-line, using the following resources: the 1990 Census data on US names, a large A.I. BibTeX bibliography, and a hand-parsed collection of 500 citations. We learn several bigram models (actually, linear combinations of a bigram model and a unigram model): letter-based models of ﬁrst names, surnames, and title words, as well as higher-level models of various parts of the citation string. More speciﬁcally, the values of Author.fnames and Author.surname are modeled as having a a 0.9 chance of being 3 Other models are possible; for example, in situations where objects appear and disappear, P (ι) can be modeled implicitly by specifying the arrival, transition, and departure rates [6]. drawn from the relevant US census ﬁle, and a 0.1 chance of being generated using a bigram model learned from that ﬁle. The prior over Paper.titles is deﬁned using a two-tier bigram model constructed using the bibliography, while the distributions over Author.#(fnames), Paper.#(authors), and Paper.pubType 4 are derived from our hand-parsed ﬁle. The conditional distributions of the “observed” variables given their true values (i.e., the corruption models of Citation.obsTitle, AuthorAsCited.surname, and AuthorAsCited.fnames) are modeled as noisy channels where each letter, or word, has a small probability of being deleted, or, alternatively, changed, and there is also a small probability of insertion. AuthorAsCited.fnames may also be abbreviated as an initial. The parameters of the corruption models are learnt online, using stochastic EM. Let us now return to Citation.parse, which cannot be an observed variable, since citation parsing, or even citation subﬁeld extraction, is an unsolved problem. It is therefore fortunate that our approach lets us handle uncertainty over parses so naturally. The state space of Citation.parse has two different components. First of all, it keeps track of the citation style, deﬁned as the ordering of the author and title subﬁelds, as well as the format in which the author names are written. The prior over styles is learned using our hand-segmented ﬁle. Secondly, it keeps track of the segmentation of Citation.text, which is divided into an author segment, a title segment, and three ﬁller segments (one before, one after, and one in between.) We assume a uniform distribution over segmentations. Citation.parse greatly constrains Citation.text: the title segment of Citation.text must match the value of Citation.obsTitle, while its author segment must match the combined values of the simple attributes of Citation.obsAuthors. The distributions over the remaining three segments of Citation.text are deﬁned using bigram models, with the model used for the ﬁnal segment chosen depending on the publication type. These models were, once more, learned using our pre-segmented ﬁle. 5 INFERENCE With the introduction of identity uncertainty, our model grows from a single Bayesian network to a collection of networks, one for each possible value of ι. This collection can be rather large, since the number of ways in which a set can be partitioned grows very quickly with the size of the set. 5 Exact inference is, therefore, impractical. We use an approximate method based on Markov chain Monte Carlo. 5.1 MARKOV CHAIN MONTE CARLO MCMC [13] is a well-known method for approximating an expectation over some distribution π(x), commonly used when the state space of x is too large to sum over. The weighted sum over the values of x is replaced by a sum over samples from π(x), which are generated using a Markov chain constructed to have π(x) as a stationary distribution. There are several ways of building up an appropriate Markov chain. In the Metropolis– Hastings method (M-H), transitions in the chain are constructed in two steps. First, a candidate next state x is generated from the current state x, using the (more or less arbitrary) proposal distribution q(x |x). The probability that the move to x is actually made is the acceptance probability, deﬁned as α(x |x) = min 1, π(x )q(x|x ) . π(x)q(x |x) Such a Markov chain will have the right stationary distribution π(x) as long as q is deﬁned in such a way that the chain is ergodic. It is even possible to factor q into separate proposals for various subsets of variables. In those situations, the variables that are not changed by the transition cancel in the ratio π(x )/π(x), so the required calculation can be quite simple. 4 Publication types range over {article, conference paper, book, thesis, and tech report} This sequence is described by the Bell numbers, whose asymptotic behaviour is more than exponential. 5 5.2 THE CITATION-MATCHING ALGORITHM The state space of our MCMC algorithm is the space of all the possible worlds, where each possible world contains an identity clustering ι, a set of class cardinalities n, and the values of all the basic attributes of all the objects. Since the ι is given in each world, the distribution over the attributes can be represented using a Bayesian network as described in Section 3. Therefore, the probability of a state is simply the product pf P (n), P (ι), and the probability of the hidden attributes of the network. Our algorithm uses a factored q function. One of our proposals attempts to change n using a simple random walk. The other suggests, ﬁrst, a change to ι, and then, values for all the hidden attributes of all the objects (or clusters in ι) affected by that change. The algorithm for proposing a change in ιC works as follows: Select two clusters a1 , a2 ∈ ιC 6 Create two empty clusters b1 and b2 place each instance i ∈ a1 ∪ a2 u.a.r. into b1 or b2 Propose ιC = ιC − {a1, a2} ∪ {b1, b2} Given a proposed ιC , suggesting values for the hidden attributes boils down to recovering their true values from (possibly) corrupt observations, e.g., guessing the true surname of the author currently known both as “Simth” and “Smith”. Since our title and name noise models are symmetric, our basic strategy is to apply these noise models to one of the observed values. In the case of surnames, we have the additional resource of a dictionary of common names, so, some of the time, we instead pick one of the set of dictionary entries that are within a few corruptions of our observed names. (One must, of course, careful to account for this hybrid approach in our acceptance probability calculations.) Parses are handled differently: we preprocess each citation, organizing its plausible segmentations into a list ordered in terms of descending probability. At runtime, we simply sample from these discrete distributions. Since we assume that boundaries occur only at punctuation marks, and discard segmentations of probability < 10−6 , the lists are usually quite short. 7 The publication type variables, meanwhile, are not sampled at all. Since their range is so small, we sum them out. 5.3 SCALING UP One of the acknowledged ﬂaws of the MCMC algorithm is that it often fails to scale. In this application, as the number of papers increases, the simplest approach — one where the two clusters a1 and a2 are picked u.a.r — is likely to lead to many rejected proposals, as most pairs of clusters will have little in common. The resulting Markov chain will mix slowly. Clearly, we would prefer to focus our proposals on those pairs of clusters which are actually likely to exchange their instances. We have implemented an approach based on the efﬁcient clustering algorithm of McCallum et al [9], where a cheap distance metric is used to preprocess a large dataset and fragment it into many canopies, or smaller, overlapping sets of elements that have a non-zero probability of matching. We do the same, using word-matching as our metric, and setting the thresholds to 0.5 and 0.2. Then, at runtime, our q(x |x) function proposes ﬁrst a canopy c, and then a pair of clusters u.a.r. from c. (q(x|x ) is calculated by summing over all the canopies which contain any of the elements of the two clusters.) 6 EXPERIMENTAL RESULTS We have applied the MCMC-based algorithm to the hand-matched datasets used in [1]. (Each of these datasets contains several hundred citations of machine learning papers, about half of them in clusters ranging in size from two to twenty-one citations.) We have also 6 7 Note that if the same cluster is picked twice, it will probably be split. It would also be possible to sample directly from a model such as a hierarchical HMM Face Reinforcement Reasoning Constraint 349 citations, 242 papers 406 citations, 148 papers 514 citations, 296 papers 295 citations, 199 papers Phrase matching 94% 79% 86% 89% RPM + MCMC 97% 94% 96% 93% Table 1: Results on four Citeseer data sets, for the text matching and MCMC algorithms. The metric used is the percentage of actual citation clusters recovered perfectly; for the MCMC-based algorithm, this is an average over all the MCMC-generated samples. implemented their phrase matching algorithm, a greedy agglomerative clustering method based on a metric that measures the degrees to which the words and phrases of any two citations overlap. (They obtain their “phrases” by segmenting each citation at all punctuation marks, and then taking all the bigrams of all the segments longer than two words.) The results of our comparison are displayed in Figure 1, in terms of the Citeseer error metric. Clearly, the algorithm we have developed easily beats our implementation of phrase matching. We have also applied our algorithm to a large set of citations referring to the textbook Artiﬁcial Intelligence: A Modern Approach. It clusters most of them correctly, but there are a couple of notable exceptions. Whenever several citations share the same set of unlikely errors, they are placed together in a separate cluster. This occurs because we do not currently model the fact that erroneous citations are often copied from reference list to reference list, which could be handled by extending the model to include a copiedFrom attribute. Another possible extension would be the addition of a topic attribute to both papers and authors: tracking the authors’ research topics might enable the system to distinguish between similarly-named authors working in different ﬁelds. Generally speaking, we expect that relational probabilistic languages with identity uncertainty will be a useful tool for creating knowledge from raw data. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] S. Lawrence, K. Bollacker, and C. Lee Giles. Autonomous citation matching. In Agents, 1999. I. Fellegi and A. Sunter. A theory for record linkage. In JASA, 1969. W. Cohen, H. Kautz, and D. McAllester. Hardening soft information sources. In KDD, 2000. Y. Bar-Shalom and T. E. Fortman. Tracking and Data Association. Academic Press, 1988. I. J. Cox and S. Hingorani. An efﬁcient implementation and evaluation of Reid’s multiple hypothesis tracking algorithm for visual tracking. In IAPR-94, 1994. H. Pasula, S. Russell, M. Ostland, and Y. Ritov. Tracking many objects with many sensors. In IJCAI-99, 1999. F. Dellaert, S. Seitz, C. Thorpe, and S. Thrun. Feature correspondence: A markov chain monte carlo approach. In NIPS-00, 2000. E. Charniak and R. P. Goldman. A Bayesian model of plan recognition. AAAI, 1993. A. McCallum, K. Nigam, and L. H. Ungar. Efﬁcient clustering of high-dimensional data sets with application to reference matching. In KDD-00, 2000. H. Pasula and S. Russell. Approximate inference for ﬁrst-order probabilistic languages. In IJCAI-01, 2001. A. Pfeffer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford, 2000. A. Pfeffer and D. Koller. Semantics and inference for recursive probability models. In AAAI/IAAI, 2000. W.R. Gilks, S. Richardson, and D.J. Spiegelhalter. Markov chain Monte Carlo in practice. Chapman and Hall, London, 1996.

4 0.71732926 142 nips-2002-Maximum Likelihood and the Information Bottleneck

Author: Noam Slonim, Yair Weiss

Abstract: The information bottleneck (IB) method is an information-theoretic formulation for clustering problems. Given a joint distribution , this method constructs a new variable that deﬁnes partitions over the values of that are informative about . Maximum likelihood (ML) of mixture models is a standard statistical approach to clustering problems. In this paper, we ask: how are the two methods related ? We deﬁne a simple mapping between the IB problem and the ML problem for the multinomial mixture model. We show that under this mapping the or problems are strongly related. In fact, for uniform input distribution over for large sample size, the problems are mathematically equivalent. Speciﬁcally, in these cases, every ﬁxed point of the IB-functional deﬁnes a ﬁxed point of the (log) likelihood and vice versa. Moreover, the values of the functionals at the ﬁxed points are equal under simple transformations. As a result, in these cases, every algorithm that solves one of the problems, induces a solution for the other. ©§ ¥£ ¨¦¤¢

5 0.51839185 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex

Author: Peter Dayan, Angela J. Yu

6 0.48200557 199 nips-2002-Timing and Partial Observability in the Dopamine System

7 0.47615451 180 nips-2002-Selectivity and Metaplasticity in a Unified Calcium-Dependent Model

8 0.45897883 193 nips-2002-Temporal Coherence, Natural Image Sequences, and the Visual Cortex

9 0.45748776 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior

10 0.45733953 186 nips-2002-Spike Timing-Dependent Plasticity in the Address Domain

11 0.45722771 176 nips-2002-Replay, Repair and Consolidation

12 0.45511824 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

13 0.45430011 76 nips-2002-Dynamical Constraints on Computing with Spike Timing in the Cortex

14 0.45328164 3 nips-2002-A Convergent Form of Approximate Policy Iteration

15 0.45213714 10 nips-2002-A Model for Learning Variance Components of Natural Images

16 0.44817919 46 nips-2002-Boosting Density Estimation

17 0.44529808 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

18 0.44374961 12 nips-2002-A Neural Edge-Detection Model for Enhanced Auditory Sensitivity in Modulated Noise

19 0.44119209 163 nips-2002-Prediction and Semantic Association

20 0.44102323 128 nips-2002-Learning a Forward Model of a Reflex