nips nips2004 nips2004-170 knowledge-graph by maker-knowledge-mining

170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account

Source: pdf

Author: Aaron C. Courville, Nathaniel D. Daw, David S. Touretzky

Abstract: We propose a probabilistic, generative account of conﬁgural learning phenomena in classical conditioning. Conﬁgural learning experiments probe how animals discriminate and generalize between patterns of simultaneously presented stimuli (such as tones and lights) that are differentially predictive of reinforcement. Previous models of these issues have been successful more on a phenomenological than an explanatory level: they reproduce experimental ﬁndings but, lacking formal foundations, provide scant basis for understanding why animals behave as they do. We present a theory that clariﬁes seemingly arbitrary aspects of previous models while also capturing a broader set of data. Key patterns of data, e.g. concerning animals’ readiness to distinguish patterns with varying degrees of overlap, are shown to follow from statistical inference.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Similarity and discrimination in classical conditioning: A latent variable account Aaron C. [sent-1, score-0.26]

2 Conﬁgural learning experiments probe how animals discriminate and generalize between patterns of simultaneously presented stimuli (such as tones and lights) that are differentially predictive of reinforcement. [sent-10, score-0.586]

3 Previous models of these issues have been successful more on a phenomenological than an explanatory level: they reproduce experimental ﬁndings but, lacking formal foundations, provide scant basis for understanding why animals behave as they do. [sent-11, score-0.297]

4 1 Introduction Classical conditioning experiments probe how organisms learn to predict signiﬁcant events such as the receipt of food or shock. [sent-16, score-0.239]

5 While there is a history of detailed quantitative theories about these experiments, only recently has there been a sustained attempt to understand them in terms of sound statistical prediction [1]. [sent-17, score-0.101]

6 A statistical foundation helps to identify key theoretical issues (such as uncertainty) underlying these experiments, to explain otherwise puzzling results, and to connect these behavioral theories with theories of neural computation, which are also increasingly framed in statistical terms. [sent-18, score-0.295]

7 A cluster of issues that has received great experimental and theoretical attention in conditioning — but not yet from a statistically grounded perspective — concerns discrimination and generalization between patterns of sensory input. [sent-19, score-0.361]

8 , a light and a tone each predict shock when presented alone, but not together). [sent-22, score-0.11]

9 While animals can learn such a discrimination, the seminal model of Rescorla and Wagner [2] cannot, since it assumes that the prediction is linear in the stimuli. [sent-23, score-0.208]

10 Traditionally, this problem was solved by introducing extra discriminative features to the model’s input (known as “conﬁgural units,” since they detect conjunctions of stimuli such as tone plus light), rendering the augmented problem linearly solvable [3]. [sent-24, score-0.301]

11 On this foundation rests a wealth of work probing how ani- mals learn and predict given compounds of stimuli. [sent-25, score-0.287]

12 The standard conﬁgural model is that of Pearce [4], in which responding to a compound is determined by previous experience with that and other similar compounds, through a process of generalization and weighted averaging. [sent-29, score-0.377]

13 Both theories match an impressive range of experimental data, but each is refuted by some experiments that the other captures. [sent-30, score-0.129]

14 Because the theories lack formal foundations, their details — particularly those on which they differ — are ad-hoc and poorly understood. [sent-32, score-0.101]

15 Here we leverage our Bayesian theory of conditioning [5] to shed new light on these issues. [sent-34, score-0.225]

16 Notably, analogizing conditioning to classiﬁcation, we take a generative rather than a discriminative approach. [sent-36, score-0.161]

17 That is, we assume animals are modeling their complete sensory experience (lights, tones, and shocks) rather than only the chance of shock conditioned on lights and tones. [sent-37, score-0.35]

18 We assume that stimuli are correlated with each other, and with reinforcement, through shared latent variables. [sent-38, score-0.361]

19 Because a latent variable can trigger multiple events, these causes play a role akin to conﬁgural units in previous theories, but offer stronger normative guidance. [sent-39, score-0.358]

20 Such inferences also determine whether an animal’s experience on a trial is best explained by multiple causes interacting additively, in the style of Rescorla–Wagner, or by a single cause triggering multiple events like one of Pearce’s conﬁgural units. [sent-41, score-0.263]

21 Our theory is meant to shed light on the normative reasons why animals behave as they do, rather than on how they might carry out computations like those we describe. [sent-43, score-0.323]

22 In practice, the inferences we discuss can be computed only approximately, and we intend no claim that animals are using the same approximations to them as we are. [sent-44, score-0.206]

23 2 Theories of Learning with Compound Stimuli Classical conditioning experiments probe animals’ anticipation of a reinforcer R such as food or footshock, given the presentation of initially neutral stimuli such as lights and tones. [sent-46, score-0.593]

24 By studying responding as a function of the pattern of previous reinforcer / stimulus pairings, the experiments assess learning. [sent-48, score-0.357]

25 To describe a conditioning task abstractly, we use capital letters for the stimuli and + and − to indicate whether they are reinforced. [sent-49, score-0.34]

26 For instance, the XOR task can be written as A+, B+, AB−, where AB− denotes simultaneous presentation of both stimuli unreinforced. [sent-50, score-0.207]

27 Typically, each type of trial is delivered repeatedly, and the development of responding is assessed. [sent-51, score-0.218]

28 We now describe the treatment of compound stimuli in the models of Rescorla and Wagner [2] and Pearce [4]. [sent-52, score-0.347]

29 In both models, the set of stimuli present on a trial is converted into an input vector x. [sent-53, score-0.314]

30 The strength of the conditioned response is modeled as proportional to a prediction of reinforcement v = x · w, the dot product between the input and a weight vector. [sent-54, score-0.121]

31 1 In Pearce’s model, and in augmented “added elements” versions of the Rescorla– Wagner model [3], additional “conﬁgural” units are also included, corresponding to conjunctions of stimuli. [sent-57, score-0.139]

32 In particular, it is assumed that a unique conﬁgural unit is added for each stimulus compound observed, such as ABC. [sent-58, score-0.322]

33 we might very well include elements for subcompounds such as AB) and unrealistic (given the profusion of uncontrolled stimuli simultaneously present in a real experiment). [sent-61, score-0.245]

34 The theories differ as to how they apportion activation over x and learning over w. [sent-62, score-0.141]

35 In the Rescorla–Wagner model, the input vector is binary: xi = 1 if the ith stimulus (or an exactly matching compound) is present, 0 otherwise. [sent-63, score-0.146]

36 The Pearce model instead spreads graded activation over x, based on a measure of similarity between the observed stimulus compound (or element) and the compounds represented by the model’s conﬁgural units. [sent-65, score-0.613]

37 In particular, if we denote the number of stimulus elements present in an observed stimulus pattern a as size(a), and in the pattern represented by the ith conﬁgural unit as size(i), then the activation of unit i by pattern a is given by xi = size(overlap(a, i))2 /(size(a) · size(i)). [sent-66, score-0.442]

38 The learning phase updates only the weight corresponding to the conﬁgural unit that exactly matches the observed stimulus conﬁguration. [sent-67, score-0.182]

39 3 Data on Learning with Compound Stimuli Both the elemental and conﬁgural models reproduce a number of well known experimental phenomena. [sent-70, score-0.127]

40 Overshadowing When a pair of stimuli AB+ is reinforced together, then tested separately, responding to either individual stimulus is often attenuated compared to a control in which the stimulus is trained alone (A+). [sent-74, score-0.789]

41 Both models reproduce overshadowing, though Rescorla–Wagner incorrectly predicts that it takes at least two AB+ pairings to materialize. [sent-75, score-0.135]

42 Summation The converse of overshadowing is summation: when two stimuli are individually reinforced, then tested together, there is often a greater response to the pair than to either element alone. [sent-76, score-0.44]

43 In a recent variation by Rescorla [6], animals were trained on a pair of compounds AB+ and CD+, then responses were measured to the trained compounds, the individual elements A, B, etc. [sent-77, score-0.558]

44 The transfer compounds elicited a moderate response, and the individual stimuli produced the weakest responding. [sent-80, score-0.596]

45 1 In Pearce’s presentation of his model, these units are added only after elements are observed alone. [sent-81, score-0.096]

46 The added elements Rescorla–Wagner model predicts this result due to the linear summation of the inﬂuences of all the units (A through D, AB, and CD — note that the added conﬁgural units are crucial). [sent-83, score-0.345]

47 However, because of the normalization term in the generalization rule, Pearce’s model often predicts no summation. [sent-84, score-0.139]

48 Here it predicts equal responding to the individual stimuli and to the transfer compounds. [sent-85, score-0.476]

49 There is controversy as to whether the model can realistically be reconciled with summation effects [4, 7], but on the whole, these phenomena seem more parsimoniously explained with an elemental account. [sent-86, score-0.26]

50 Overlap A large number of experiments (see [4] for a review) demonstrate that the more elements shared by two compounds, the longer it takes animals to learn to discriminate between them. [sent-87, score-0.252]

51 Though this may seem intuitive, elemental theories predict the opposite. [sent-88, score-0.167]

52 In one example, Redhead and Pearce [8] presented subjects with the patterns A+, BC+ reinforced and ABC− unreinforced. [sent-89, score-0.146]

53 Differential responding between A and ABC was achieved in fewer trials than that between BC and ABC. [sent-90, score-0.111]

54 Pearce’s conﬁgural theory predicts this result because the extra overlap between BC and ABC (compared to A vs. [sent-91, score-0.132]

55 ABC) causes each compound to activate the other’s conﬁgural unit more strongly. [sent-92, score-0.253]

56 Rescorla–Wagner predicts the opposite result, because compounds with more elements, e. [sent-94, score-0.322]

57 4 A latent variable model of stimulus generalization In this section we present a generative model of how stimuli and reinforcers are jointly delivered. [sent-97, score-0.668]

58 We will show how the model may be used to estimate the conditional probability of reinforcement (the quantity we assume drives animals’ responding) given some pattern of observed stimuli. [sent-98, score-0.113]

59 The theory is based on the one we presented in [5], and casts conditioning as inference over a set of sigmoid belief networks. [sent-99, score-0.193]

60 1 A Sigmoid Belief Network Model of Conditioning Consider a vector of random variables S representing stimuli on a trial, with the jth stimulus present when Sj = 1 and absent when Sj = 0. [sent-102, score-0.383]

61 One element of S is distinguished as the reinforcer R; the remainder (lights and tones) is denoted as Stim. [sent-103, score-0.131]

62 We encode the correlations between all stimuli (including the reinforcer) through common connections to a vector of latent variables, or causes, x where xi ∈ {0, 1}. [sent-104, score-0.361]

63 According to the generative process, on each trial the state of the latent variables is determined by independent Bernoulli draws (each latent variable has a weight determining its chance of activation [5]). [sent-105, score-0.542]

64 The probability of stimulus j being present is then determined by its relationship to the latent variables: (j) P (Sj | m, wm , x) = (1 + exp(−(wm )T x − wbias ))−1 , (1) (j) where the weight vector wm encodes the connection strengths between x and Sj for the model structure m. [sent-106, score-0.653]

65 We assume animals learn about the model structure itself, analogous to the experiencedependent introduction of conﬁgural units in previous theories. [sent-109, score-0.293]

66 In our theory, animals use experience to infer which network structures (from a set of candidates) and weights likely produced the observed stimuli and reinforcers. [sent-110, score-0.464]

67 2 Generalization: inference over latent variables Generalization between observed stimulus patterns is a key aspect of previous models. [sent-114, score-0.408]

68 Unlike Pearce’s rule, inference over x considers settings of the individual causes xi jointly (allowing for explaining away effects) and incorporates prior probabilities over each cause’s activation. [sent-119, score-0.099]

69 Nevertheless, the new rule broadly resembles its predecessor in that a cause is judged likely to be active (and contributes to predicting R) if the constellation of stimuli it predicts is similar to what is observed. [sent-120, score-0.334]

70 3 Learning to discriminate: inference over models We treat the model weights wm and the model structure m as uncertain quantities subject to standard Bayesian inference. [sent-122, score-0.273]

71 2 Conditioning on the data D produces a posterior distribution over the weights, over which we integrate to predict R: P (R | Stim, m, D) = P (R | Stim, m, wm , D)P (wm | m, D)dwm (3) Uncertainty over model structure is handled analogously. [sent-124, score-0.187]

72 The prior over models, P (m) is expressed as a distribution over nx , the number of latent variables, and over li , the number nx of links between the stimuli and each latent variable: P (m) = P (nx ) i=1 P (li ). [sent-126, score-0.615]

73 Progressively conditioning on experience to resolve prior uncertainty in the weights and model structure produces a gradual change in predictions akin to the incremental learning rules of previous models. [sent-133, score-0.288]

74 (a) Overshadowing (AB+): the predicted probability of reinforcement in response to presentations of the element A, the compound AB, and an individually trained control element (A+). [sent-148, score-0.394]

75 (b) Summation experiment (AB+, CD+): the predicted probability of reinforcement in response to separate presentations of the trained compounds (AB, CD), the transfer compounds (AD, BC) and the elements (A, B, etc. [sent-149, score-0.821]

76 (c) Depiction of the MAP model structure after overshadowing training. [sent-151, score-0.221]

77 Together with the generalization effects discussed above, these inference effects explain why animals can learn more readily to discriminate stimulus compounds that have less overlap. [sent-155, score-0.756]

78 After 5 AB+ pairings, the network with highest posterior probability, depicted in (c), contains one latent variable correlated with both stimuli and the reinforcer. [sent-158, score-0.39]

79 Overall, this tradeoff decreases the chance that x1 is active, suppressing the prediction of reinforcement relative to the control treatment, where A is reinforced in isolation (A+). [sent-161, score-0.186]

80 Unlike the Rescorla–Wagner model, ours correctly predicts that overshadowing can occur after even a single AB+ presentation. [sent-162, score-0.229]

81 Summation Figure 1(b) shows our model’s performance on Rescorla’s AB+ CD+ summation and transfer experiment [6], which is one of several summation experiments our model explains. [sent-163, score-0.302]

82 Consistent with experimental ﬁndings, the model predicts greatest responding to the trained compounds (AB, CD), moderate responding to transfer compounds (AD, BC), and least responding to the elements (A, B, etc. [sent-165, score-1.106]

83 The maximum a posteriori (MAP) model structure (Figure 1(d)) mimics the training compounds, with one latent variable connected to A, B, and R and another connected to C, D, and R. [sent-167, score-0.237]

84 The training compounds activate one latent variable strongly; the transfer compounds acti- 3. [sent-169, score-0.809]

85 (a) Learning curves showing the predicted probability of reinforcement in response to separate presentations of A, BC, and ABC as a function of number of trial blocks. [sent-179, score-0.272]

86 (b) The average number of latent variables over the 10000 MCMC sample models. [sent-180, score-0.184]

87 (c) - (e) Representations of MAP model structures after training with 4, 10, and 20 trial blocks (edge widths represent mean weight strength). [sent-181, score-0.17]

88 vate both latents weakly (together additively inﬂuencing the probability of reinforcement); the elements weakly activate only a single latent variable. [sent-182, score-0.306]

89 Overlap Figure 2(a) shows the model’s learning curves from the overlapping compound experiment, A+, BC+, ABC−. [sent-183, score-0.14]

90 Each trial block contains one trial of each type. [sent-184, score-0.214]

91 The model correctly predicts faster discrimination between A and ABC than between BC and ABC. [sent-185, score-0.138]

92 This pattern results from progressive increase in the number of inferred latent variables (b). [sent-186, score-0.184]

93 Early in training, probability density concentrates on small models with a single latent variable correlating all stimuli and the reinforcer (c). [sent-187, score-0.534]

94 After more trials, models with two latent variables become more probable, one correlating A and R and the other correlating B and C with both A and R, attempting to capture both BC+ and ABC− trial types. [sent-188, score-0.379]

95 With further training, the most likely models are those with three latents, each encoding one trial type (e). [sent-190, score-0.107]

96 Our theory also improves on its predecessors in other ways; for instance, because it includes learning about stimulus interrelationships it can explain second-order conditioning [5], which is not addressed by either the Pearce or the Rescorla–Wagner accounts. [sent-199, score-0.307]

97 A full account of summation phenomena, in particular, is beyond the scope of the present model. [sent-201, score-0.102]

98 We treat reinforcer delivery as binary and model a limited, saturating, summation in probabilities. [sent-202, score-0.229]

99 However, realistic summation almost certainly concerns reinforcement magnitudes as well (see, for example, [9]), and our model would need to be augmented to address them. [sent-203, score-0.242]

100 Because we have assumed that trials are IID, the model cannot yet account for effects of trial ordering (e. [sent-204, score-0.161]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gural', 0.452), ('rescorla', 0.277), ('compounds', 0.26), ('pearce', 0.251), ('wagner', 0.234), ('stimuli', 0.207), ('animals', 0.181), ('overshadowing', 0.167), ('latent', 0.154), ('stimulus', 0.146), ('ab', 0.141), ('compound', 0.14), ('conditioning', 0.133), ('wm', 0.133), ('abc', 0.131), ('bc', 0.117), ('responding', 0.111), ('trial', 0.107), ('summation', 0.102), ('stim', 0.102), ('theories', 0.101), ('reinforced', 0.1), ('reinforcer', 0.1), ('lights', 0.087), ('reinforcement', 0.086), ('cd', 0.078), ('con', 0.077), ('transfer', 0.071), ('probe', 0.066), ('elemental', 0.066), ('predicts', 0.062), ('units', 0.058), ('issues', 0.055), ('tones', 0.053), ('latents', 0.05), ('normative', 0.05), ('generalization', 0.05), ('nx', 0.05), ('experience', 0.049), ('discrimination', 0.049), ('patterns', 0.046), ('mcmc', 0.044), ('correlating', 0.044), ('presentations', 0.044), ('psychology', 0.043), ('animal', 0.043), ('overlap', 0.042), ('causes', 0.042), ('activation', 0.04), ('tone', 0.04), ('constellation', 0.04), ('pairings', 0.04), ('events', 0.04), ('behavioral', 0.038), ('phenomena', 0.038), ('elements', 0.038), ('light', 0.037), ('blocks', 0.036), ('unit', 0.036), ('activate', 0.035), ('response', 0.035), ('sj', 0.035), ('courville', 0.033), ('daw', 0.033), ('elicited', 0.033), ('gluck', 0.033), ('redhead', 0.033), ('shock', 0.033), ('wbias', 0.033), ('xor', 0.033), ('reproduce', 0.033), ('discriminate', 0.033), ('inference', 0.032), ('element', 0.031), ('variables', 0.03), ('myers', 0.029), ('additively', 0.029), ('quarterly', 0.029), ('variable', 0.029), ('ad', 0.029), ('theory', 0.028), ('generative', 0.028), ('experimental', 0.028), ('classical', 0.028), ('model', 0.027), ('weights', 0.027), ('effects', 0.027), ('structure', 0.027), ('trained', 0.027), ('augmented', 0.027), ('proportionally', 0.027), ('conjunctions', 0.027), ('rests', 0.027), ('attenuated', 0.027), ('shed', 0.027), ('individual', 0.025), ('active', 0.025), ('hippocampal', 0.025), ('inferences', 0.025), ('akin', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account

Author: Aaron C. Courville, Nathaniel D. Daw, David S. Touretzky

2 0.095008947 140 nips-2004-Optimal Information Decoding from Neuronal Populations with Specific Stimulus Selectivity

Author: Marcelo A. Montemurro, Stefano Panzeri

Abstract: A typical neuron in visual cortex receives most inputs from other cortical neurons with a roughly similar stimulus preference. Does this arrangement of inputs allow efﬁcient readout of sensory information by the target cortical neuron? We address this issue by using simple modelling of neuronal population activity and information theoretic tools. We ﬁnd that efﬁcient synaptic information transmission requires that the tuning curve of the afferent neurons is approximately as wide as the spread of stimulus preferences of the afferent neurons reaching the target neuron. By meta analysis of neurophysiological data we found that this is the case for cortico-cortical inputs to neurons in visual cortex. We suggest that the organization of V1 cortico-cortical synaptic inputs allows optimal information transmission. 1

3 0.088107586 163 nips-2004-Semi-parametric Exponential Family PCA

Author: Sajama Sajama, Alon Orlitsky

Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. 1

4 0.079157591 125 nips-2004-Multiple Relational Embedding

Author: Roland Memisevic, Geoffrey E. Hinton

Abstract: We describe a way of using multiple different types of similarity relationship to learn a low-dimensional embedding of a dataset. Our method chooses different, possibly overlapping representations of similarity by individually reweighting the dimensions of a common underlying latent space. When applied to a single similarity relation that is based on Euclidean distances between the input data points, the method reduces to simple dimensionality reduction. If additional information is available about the dataset or about subsets of it, we can use this information to clean up or otherwise improve the embedding. We demonstrate the potential usefulness of this form of semi-supervised dimensionality reduction on some simple examples. 1

5 0.07090041 46 nips-2004-Constraining a Bayesian Model of Human Visual Speed Perception

Author: Alan Stocker, Eero P. Simoncelli

Abstract: It has been demonstrated that basic aspects of human visual motion perception are qualitatively consistent with a Bayesian estimation framework, where the prior probability distribution on velocity favors slow speeds. Here, we present a reﬁned probabilistic model that can account for the typical trial-to-trial variabilities observed in psychophysical speed perception experiments. We also show that data from such experiments can be used to constrain both the likelihood and prior functions of the model. Speciﬁcally, we measured matching speeds and thresholds in a two-alternative forced choice speed discrimination task. Parametric ﬁts to the data reveal that the likelihood function is well approximated by a LogNormal distribution with a characteristic contrast-dependent variance, and that the prior distribution on velocity exhibits signiﬁcantly heavier tails than a Gaussian, and approximately follows a power-law function. Humans do not perceive visual motion veridically. Various psychophysical experiments have shown that the perceived speed of visual stimuli is affected by stimulus contrast, with low contrast stimuli being perceived to move slower than high contrast ones [1, 2]. Computational models have been suggested that can qualitatively explain these perceptual effects. Commonly, they assume the perception of visual motion to be optimal either within a deterministic framework with a regularization constraint that biases the solution toward zero motion [3, 4], or within a probabilistic framework of Bayesian estimation with a prior that favors slow velocities [5, 6]. The solutions resulting from these two frameworks are similar (and in some cases identical), but the probabilistic framework provides a more principled formulation of the problem in terms of meaningful probabilistic components. Speciﬁcally, Bayesian approaches rely on a likelihood function that expresses the relationship between the noisy measurements and the quantity to be estimated, and a prior distribution that expresses the probability of encountering any particular value of that quantity. A probabilistic model can also provide a richer description, by deﬁning a full probability density over the set of possible “percepts”, rather than just a single value. Numerous analyses of psychophysical experiments have made use of such distributions within the framework of signal detection theory in order to model perceptual behavior [7]. Previous work has shown that an ideal Bayesian observer model based on Gaussian forms µ posterior low contrast probability density probability density high contrast likelihood prior a posterior likelihood prior v ˆ v ˆ visual speed µ b visual speed Figure 1: Bayesian model of visual speed perception. a) For a high contrast stimulus, the likelihood has a narrow width (a high signal-to-noise ratio) and the prior induces only a small shift µ of the mean v of the posterior. b) For a low contrast stimuli, the measurement ˆ is noisy, leading to a wider likelihood. The shift µ is much larger and the perceived speed lower than under condition (a). for both likelihood and prior is sufﬁcient to capture the basic qualitative features of global translational motion perception [5, 6]. But the behavior of the resulting model deviates systematically from human perceptual data, most importantly with regard to trial-to-trial variability and the precise form of interaction between contrast and perceived speed. A recent article achieved better ﬁts for the model under the assumption that human contrast perception saturates [8]. In order to advance the theory of Bayesian perception and provide signiﬁcant constraints on models of neural implementation, it seems essential to constrain quantitatively both the likelihood function and the prior probability distribution. In previous work, the proposed likelihood functions were derived from the brightness constancy constraint [5, 6] or other generative principles [9]. Also, previous approaches deﬁned the prior distribution based on general assumptions and computational convenience, typically choosing a Gaussian with zero mean, although a Laplacian prior has also been suggested [4]. In this paper, we develop a more general form of Bayesian model for speed perception that can account for trial-to-trial variability. We use psychophysical speed discrimination data in order to constrain both the likelihood and the prior function. 1 1.1 Probabilistic Model of Visual Speed Perception Ideal Bayesian Observer Assume that an observer wants to obtain an estimate for a variable v based on a measurement m that she/he performs. A Bayesian observer “knows” that the measurement device is not ideal and therefore, the measurement m is affected by noise. Hence, this observer combines the information gained by the measurement m with a priori knowledge about v. Doing so (and assuming that the prior knowledge is valid), the observer will – on average – perform better in estimating v than just trusting the measurements m. According to Bayes’ rule 1 p(v|m) = p(m|v)p(v) (1) α the probability of perceiving v given m (posterior) is the product of the likelihood of v for a particular measurements m and the a priori knowledge about the estimated variable v (prior). α is a normalization constant independent of v that ensures that the posterior is a proper probability distribution. ^ ^ P(v2 > v1) 1 + Pcum=0.5 0 a b Pcum=0.875 vmatch vthres v2 Figure 2: 2AFC speed discrimination experiment. a) Two patches of drifting gratings were displayed simultaneously (motion without movement). The subject was asked to ﬁxate the center cross and decide after the presentation which of the two gratings was moving faster. b) A typical psychometric curve obtained under such paradigm. The dots represent the empirical probability that the subject perceived stimulus2 moving faster than stimulus1. The speed of stimulus1 was ﬁxed while v2 is varied. The point of subjective equality, vmatch , is the value of v2 for which Pcum = 0.5. The threshold velocity vthresh is the velocity for which Pcum = 0.875. It is important to note that the measurement m is an internal variable of the observer and is not necessarily represented in the same space as v. The likelihood embodies both the mapping from v to m and the noise in this mapping. So far, we assume that there is a monotonic function f (v) : v → vm that maps v into the same space as m (m-space). Doing so allows us to analytically treat m and vm in the same space. We will later propose a suitable form of the mapping function f (v). An ideal Bayesian observer selects the estimate that minimizes the expected loss, given the posterior and a loss function. We assume a least-squares loss function. Then, the optimal estimate v is the mean of the posterior in Equation (1). It is easy to see why this model ˆ of a Bayesian observer is consistent with the fact that perceived speed decreases with contrast. The width of the likelihood varies inversely with the accuracy of the measurements performed by the observer, which presumably decreases with decreasing contrast due to a decreasing signal-to-noise ratio. As illustrated in Figure 1, the shift in perceived speed towards slow velocities grows with the width of the likelihood, and thus a Bayesian model can qualitatively explain the psychophysical results [1]. 1.2 Two Alternative Forced Choice Experiment We would like to examine perceived speeds under a wide range of conditions in order to constrain a Bayesian model. Unfortunately, perceived speed is an internal variable, and it is not obvious how to design an experiment that would allow subjects to express it directly 1 . Perceived speed can only be accessed indirectly by asking the subject to compare the speed of two stimuli. For a given trial, an ideal Bayesian observer in such a two-alternative forced choice (2AFC) experimental paradigm simply decides on the basis of the two trial estimates v1 (stimulus1) and v2 (stimulus2) which stimulus moves faster. Each estimate v is based ˆ ˆ ˆ on a particular measurement m. For a given stimulus with speed v, an ideal Bayesian observer will produce a distribution of estimates p(ˆ|v) because m is noisy. Over trials, v the observers behavior can be described by classical signal detection theory based on the distributions of the estimates, hence e.g. the probability of perceiving stimulus2 moving 1 Although see [10] for an example of determining and even changing the prior of a Bayesian model for a sensorimotor task, where the estimates are more directly accessible. faster than stimulus1 is given as the cumulative probability Pcum (ˆ2 > v1 ) = v ˆ ∞ 0 p(ˆ2 |v2 ) v v2 ˆ 0 p(ˆ1 |v1 ) dˆ1 dˆ2 v v v (2) Pcum describes the full psychometric curve. Figure 2b illustrates the measured psychometric curve and its ﬁt from such an experimental situation. 2 Experimental Methods We measured matching speeds (Pcum = 0.5) and thresholds (Pcum = 0.875) in a 2AFC speed discrimination task. Subjects were presented simultaneously with two circular patches of horizontally drifting sine-wave gratings for the duration of one second (Figure 2a). Patches were 3deg in diameter, and were displayed at 6deg eccentricity to either side of a ﬁxation cross. The stimuli had an identical spatial frequency of 1.5 cycle/deg. One stimulus was considered to be the reference stimulus having one of two different contrast values (c1 =[0.075 0.5]) and one of ﬁve different speed values (u1 =[1 2 4 8 12] deg/sec) while the second stimulus (test) had one of ﬁve different contrast values (c2 =[0.05 0.1 0.2 0.4 0.8]) and a varying speed that was determined by an interleaved staircase procedure. For each condition there were 96 trials. Conditions were randomly interleaved, including a random choice of stimulus identity (test vs. reference) and motion direction (right vs. left). Subjects were asked to ﬁxate during stimulus presentation and select the faster moving stimulus. The threshold experiment differed only in that auditory feedback was given to indicate the correctness of their decision. This did not change the outcome of the experiment but increased signiﬁcantly the quality of the data and thus reduced the number of trials needed. 3 Analysis With the data from the speed discrimination experiments we could in principal apply a parametric ﬁt using Equation (2) to derive the prior and the likelihood, but the optimization is difﬁcult, and the ﬁt might not be well constrained given the amount of data we have obtained. The problem becomes much more tractable given the following weak assumptions: • We consider the prior to be relatively smooth. • We assume that the measurement m is corrupted by additive Gaussian noise with a variance whose dependence on stimulus speed and contrast is separable. • We assume that there is a mapping function f (v) : v → vm that maps v into the space of m (m-space). In that space, the likelihood is convolutional i.e. the noise in the measurement directly deﬁnes the width of the likelihood. These assumptions allow us to relate the psychophysical data to our probabilistic model in a simple way. The following analysis is in the m-space. The point of subjective equality (Pcum = 0.5) is deﬁned as where the expected values of the speed estimates are equal. We write E vm,1 ˆ vm,1 − E µ1 = E vm,2 ˆ = vm,2 − E µ2 (3) where E µ is the expected shift of the perceived speed compared to the veridical speed. For the discrimination threshold experiment, above assumptions imply that the variance var vm of the speed estimates vm is equal for both stimuli. Then, (2) predicts that the ˆ ˆ discrimination threshold is proportional to the standard deviation, thus vm,2 − vm,1 = γ var vm ˆ (4) likelihood a b prior vm Figure 3: Piece-wise approximation We perform a parametric ﬁt by assuming the prior to be piece-wise linear and the likelihood to be LogNormal (Gaussian in the m-space). where γ is a constant that depends on the threshold criterion Pcum and the exact shape of p(ˆm |vm ). v 3.1 Estimating the prior and likelihood In order to extract the prior and the likelihood of our model from the data, we have to ﬁnd a generic local form of the prior and the likelihood and relate them to the mean and the variance of the speed estimates. As illustrated in Figure 3, we assume that the likelihood is Gaussian with a standard deviation σ(c, vm ). Furthermore, the prior is assumed to be wellapproximated by a ﬁrst-order Taylor series expansion over the velocity ranges covered by the likelihood. We parameterize this linear expansion of the prior as p(vm ) = avm + b. We now can derive a posterior for this local approximation of likelihood and prior and then deﬁne the perceived speed shift µ(m). The posterior can be written as 2 vm 1 1 p(m|vm )p(vm ) = [exp(− )(avm + b)] α α 2σ(c, vm )2 where α is the normalization constant ∞ b p(m|vm )p(vm )dvm = π2σ(c, vm )2 α= 2 −∞ p(vm |m) = (5) (6) We can compute µ(m) as the ﬁrst order moment of the posterior for a given m. Exploiting the symmetries around the origin, we ﬁnd ∞ a(m) µ(m) = σ(c, vm )2 vp(vm |m)dvm ≡ (7) b(m) −∞ The expected value of µ(m) is equal to the value of µ at the expected value of the measurement m (which is the stimulus velocity vm ), thus a(vm ) σ(c, vm )2 E µ = µ(m)|m=vm = (8) b(vm ) Similarly, we derive var vm . Because the estimator is deterministic, the variance of the ˆ estimate only depends on the variance of the measurement m. For a given stimulus, the variance of the estimate can be well approximated by ∂ˆm (m) v var vm = var m ( ˆ |m=vm )2 (9) ∂m ∂µ(m) |m=vm )2 ≈ var m = var m (1 − ∂m Under the assumption of a locally smooth prior, the perceived velocity shift remains locally constant. The variance of the perceived speed vm becomes equal to the variance of the ˆ measurement m, which is the variance of the likelihood (in the m-space), thus var vm = σ(c, vm )2 ˆ (10) With (3) and (4), above derivations provide a simple dependency of the psychophysical data to the local parameters of the likelihood and the prior. 3.2 Choosing a Logarithmic speed representation We now want to choose the appropriate mapping function f (v) that maps v to the m-space. We deﬁne the m-space as the space in which the likelihood is Gaussian with a speedindependent width. We have shown that discrimination threshold is proportional to the width of the likelihood (4), (10). Also, we know from the psychophysics literature that visual speed discrimination approximately follows a Weber-Fechner law [11, 12], thus that the discrimination threshold increases roughly proportional with speed and so would the likelihood. A logarithmic speed representation would be compatible with the data and our choice of the likelihood. Hence, we transform the linear speed-domain v into a normalized logarithmic domain according to v + v0 vm = f (v) = ln( ) (11) v0 where v0 is a small normalization constant. The normalization is chosen to account for the expected deviation of equal variance behavior at the low end. Surprisingly, it has been found that neurons in the Medial Temporal area (Area MT) of macaque monkeys have speed-tuning curves that are very well approximated by Gaussians of constant width in above normalized logarithmic space [13]. These neurons are known to play a central role in the representation of motion. It seems natural to assume that they are strongly involved in tasks such as our performed psychophysical experiments. 4 Results Figure 4 shows the contrast dependent shift of speed perception and the speed discrimination threshold data for two subjects. Data points connected with a dashed line represent the relative matching speed (v2 /v1 ) for a particular contrast value c2 of the test stimulus as a function of the speed of the reference stimulus. Error bars are the empirical standard deviation of ﬁts to bootstrapped samples of the data. Clearly, low contrast stimuli are perceived to move slower. The effect, however, varies across the tested speed range and tends to become smaller for higher speeds. The relative discrimination thresholds for two different contrasts as a function of speed show that the Weber-Fechner law holds only approximately. The data are in good agreement with other data from the psychophysics literature [1, 11, 8]. For each subject, data from both experiments were used to compute a parametric leastsquares ﬁt according to (3), (4), (7), and (10). In order to test the assumption of a LogNormal likelihood we allowed the standard deviation to be dependent on contrast and speed, thus σ(c, vm ) = g(c)h(vm ). We split the speed range into six bins (subject2: ﬁve) and parameterized h(vm ) and the ratio a/b accordingly. Similarly, we parameterized g(c) for the seven contrast values. The resulting ﬁts are superimposed as bold lines in Figure 4. Figure 5 shows the ﬁtted parametric values for g(c) and h(v) (plotted in the linear domain), and the reconstructed prior distribution p(v) transformed back to the linear domain. The approximately constant values for h(v) provide evidence that a LogNormal distribution is an appropriate functional description of the likelihood. The resulting values for g(c) suggest for the likelihood width a roughly exponential decaying dependency on contrast with strong saturation for higher contrasts. discrimination threshold (relative) reference stimulus contrast c1: 0.075 0.5 subject 1 normalized matching speed 1.5 contrast c2 1 0.5 1 10 0.075 0.5 0.79 0.5 0.4 0.3 0.2 0.1 0 10 1 contrast: 1 10 discrimination threshold (relative) normalized matching speed subject 2 1.5 contrast c2 1 0.5 10 1 a 0.5 0.4 0.3 0.2 0.1 10 1 1 b speed of reference stimulus [deg/sec] 10 stimulus speed [deg/sec] Figure 4: Speed discrimination data for two subjects. a) The relative matching speed of a test stimulus with different contrast levels (c2 =[0.05 0.1 0.2 0.4 0.8]) to achieve subjective equality with a reference stimulus (two different contrast values c1 ). b) The relative discrimination threshold for two stimuli with equal contrast (c1,2 =[0.075 0.5]). reconstructed prior subject 1 p(v) [unnormalized] 1 Gaussian Power-Law g(c) 1 h(v) 2 0.9 1.5 0.8 0.1 n=-1.41 0.7 1 0.6 0.01 0.5 0.5 0.4 0.3 1 p(v) [unnormalized] subject 2 10 0.1 1 1 1 1 10 1 10 2 0.9 n=-1.35 0.1 1.5 0.8 0.7 1 0.6 0.01 0.5 0.5 0.4 1 speed [deg/sec] 10 0.3 0 0.1 1 contrast speed [deg/sec] Figure 5: Reconstructed prior distribution and parameters of the likelihood function. The reconstructed prior for both subjects show much heavier tails than a Gaussian (dashed ﬁt), approximately following a power-law function with exponent n ≈ −1.4 (bold line). 5 Conclusions We have proposed a probabilistic framework based on a Bayesian ideal observer and standard signal detection theory. We have derived a likelihood function and prior distribution for the estimator, with a fairly conservative set of assumptions, constrained by psychophysical measurements of speed discrimination and matching. The width of the resulting likelihood is nearly constant in the logarithmic speed domain, and decreases approximately exponentially with contrast. The prior expresses a preference for slower speeds, and approximately follows a power-law distribution, thus has much heavier tails than a Gaussian. It would be interesting to compare the here derived prior distributions with measured true distributions of local image velocities that impinge on the retina. Although a number of authors have measured the spatio-temporal structure of natural images [14, e.g. ], it is clearly difﬁcult to extract therefrom the true prior distribution because of the feedback loop formed through movements of the body, head and eyes. Acknowledgments The authors thank all subjects for their participation in the psychophysical experiments. References [1] P. Thompson. Perceived rate of movement depends on contrast. Vision Research, 22:377–380, 1982. [2] L.S. Stone and P. Thompson. Human speed perception is contrast dependent. Vision Research, 32(8):1535–1549, 1992. [3] A. Yuille and N. Grzywacz. A computational theory for the perception of coherent visual motion. Nature, 333(5):71–74, May 1988. [4] Alan Stocker. Constraint Optimization Networks for Visual Motion Perception - Analysis and Synthesis. PhD thesis, Dept. of Physics, Swiss Federal Institute of Technology, Z¨ rich, Switzeru land, March 2002. [5] Eero Simoncelli. Distributed analysis and representation of visual motion. PhD thesis, MIT, Dept. of Electrical Engineering, Cambridge, MA, 1993. [6] Y. Weiss, E. Simoncelli, and E. Adelson. Motion illusions as optimal percept. Nature Neuroscience, 5(6):598–604, June 2002. [7] D.M. Green and J.A. Swets. Signal Detection Theory and Psychophysics. Wiley, New York, 1966. [8] F. H¨ rlimann, D. Kiper, and M. Carandini. Testing the Bayesian model of perceived speed. u Vision Research, 2002. [9] Y. Weiss and D.J. Fleet. Probabilistic Models of the Brain, chapter Velocity Likelihoods in Biological and Machine Vision, pages 77–96. Bradford, 2002. [10] K. Koerding and D. Wolpert. Bayesian integration in sensorimotor learning. 427(15):244–247, January 2004. Nature, [11] Leslie Welch. The perception of moving plaids reveals two motion-processing stages. Nature, 337:734–736, 1989. [12] S. McKee, G. Silvermann, and K. Nakayama. Precise velocity discrimintation despite random variations in temporal frequency and contrast. Vision Research, 26(4):609–619, 1986. [13] C.H. Anderson, H. Nover, and G.C. DeAngelis. Modeling the velocity tuning of macaque MT neurons. Journal of Vision/VSS abstract, 2003. [14] D.W. Dong and J.J. Atick. Statistics of natural time-varying images. Network: Computation in Neural Systems, 6:345–358, 1995.

6 0.068578266 124 nips-2004-Multiple Alignment of Continuous Time Series

7 0.066055693 185 nips-2004-The Convergence of Contrastive Divergences

8 0.063975044 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces

9 0.055225275 84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture

10 0.052350417 184 nips-2004-The Cerebellum Chip: an Analog VLSI Implementation of a Cerebellar Model of Classical Conditioning

11 0.0501092 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

12 0.049543966 148 nips-2004-Probabilistic Computation in Spiking Populations

13 0.046948832 193 nips-2004-Theories of Access Consciousness

14 0.046720859 33 nips-2004-Brain Inspired Reinforcement Learning

15 0.04633227 151 nips-2004-Rate- and Phase-coded Autoassociative Memory

16 0.046306372 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval

17 0.04603504 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

18 0.044413917 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons

19 0.044160973 19 nips-2004-An Application of Boosting to Graph Classification

20 0.038143165 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.128), (1, -0.059), (2, -0.009), (3, -0.026), (4, 0.004), (5, 0.05), (6, 0.022), (7, 0.027), (8, 0.079), (9, -0.012), (10, 0.028), (11, -0.024), (12, 0.034), (13, -0.028), (14, 0.189), (15, 0.02), (16, 0.002), (17, 0.025), (18, -0.038), (19, -0.062), (20, -0.04), (21, 0.087), (22, 0.085), (23, -0.144), (24, 0.01), (25, 0.001), (26, 0.071), (27, -0.109), (28, -0.03), (29, -0.061), (30, 0.034), (31, 0.018), (32, -0.099), (33, -0.009), (34, -0.164), (35, 0.031), (36, 0.093), (37, 0.01), (38, 0.083), (39, 0.023), (40, 0.013), (41, -0.075), (42, 0.027), (43, -0.003), (44, -0.149), (45, -0.015), (46, 0.063), (47, 0.13), (48, -0.039), (49, 0.098)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94211173 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account

Author: Aaron C. Courville, Nathaniel D. Daw, David S. Touretzky

2 0.56781632 46 nips-2004-Constraining a Bayesian Model of Human Visual Speed Perception

Author: Alan Stocker, Eero P. Simoncelli

3 0.55348277 163 nips-2004-Semi-parametric Exponential Family PCA

Author: Sajama Sajama, Alon Orlitsky

4 0.48204601 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval

Author: Max Welling, Michal Rosen-zvi, Geoffrey E. Hinton

Abstract: Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research ﬁelds. Although this approach has met with considerable success, the causal semantics of these models can make it difﬁcult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. Inference in these “exponential family harmoniums” is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords.

5 0.47954783 125 nips-2004-Multiple Relational Embedding

Author: Roland Memisevic, Geoffrey E. Hinton

6 0.4640429 124 nips-2004-Multiple Alignment of Continuous Time Series

7 0.45016161 140 nips-2004-Optimal Information Decoding from Neuronal Populations with Specific Stimulus Selectivity

8 0.44263256 193 nips-2004-Theories of Access Consciousness

9 0.39271227 84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture

10 0.38917449 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data

11 0.36142278 184 nips-2004-The Cerebellum Chip: an Analog VLSI Implementation of a Cerebellar Model of Classical Conditioning

12 0.34905601 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process

13 0.3464283 21 nips-2004-An Information Maximization Model of Eye Movements

14 0.34540999 185 nips-2004-The Convergence of Contrastive Divergences

15 0.33836648 141 nips-2004-Optimal sub-graphical models

16 0.31073812 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters

17 0.29755607 158 nips-2004-Sampling Methods for Unsupervised Learning

18 0.27868599 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons

19 0.27561662 57 nips-2004-Economic Properties of Social Networks

20 0.25918239 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.07), (15, 0.109), (19, 0.011), (26, 0.062), (31, 0.028), (33, 0.129), (35, 0.056), (39, 0.034), (50, 0.044), (52, 0.017), (62, 0.305)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80551291 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account

Author: Aaron C. Courville, Nathaniel D. Daw, David S. Touretzky

2 0.72014076 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul

Abstract: Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with signiﬁcantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts. 1

3 0.54255074 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units

Author: Eizaburo Doi, Michael S. Lewicki

Abstract: It has been suggested that the primary goal of the sensory system is to represent input in such a way as to reduce the high degree of redundancy. Given a noisy neural representation, however, solely reducing redundancy is not desirable, since redundancy is the only clue to reduce the effects of noise. Here we propose a model that best balances redundancy reduction and redundant representation. Like previous models, our model accounts for the localized and oriented structure of simple cells, but it also predicts a different organization for the population. With noisy, limited-capacity units, the optimal representation becomes an overcomplete, multi-scale representation, which, compared to previous models, is in closer agreement with physiological data. These results offer a new perspective on the expansion of the number of neurons from retina to V1 and provide a theoretical model of incorporating useful redundancy into efﬁcient neural representations. 1

4 0.54145306 131 nips-2004-Non-Local Manifold Tangent Learning

Author: Yoshua Bengio, Martin Monperrus

Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1

5 0.54142946 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction sufﬁx trees (PST) provide a popular and effective tool for tasks such as compression, classiﬁcation, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efﬁcient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the ﬁrst provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any ﬁxed PST determined in hindsight. 1

6 0.53965986 69 nips-2004-Fast Rates to Bayes for Kernel Machines

7 0.53914791 28 nips-2004-Bayesian inference in spiking neurons

8 0.53480923 178 nips-2004-Support Vector Classification with Input Data Uncertainty

9 0.53423649 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

10 0.53418493 151 nips-2004-Rate- and Phase-coded Autoassociative Memory

11 0.53403473 168 nips-2004-Semigroup Kernels on Finite Sets

12 0.53365284 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

13 0.53342277 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees

14 0.53308898 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

15 0.53226614 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

16 0.53181505 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

17 0.53135049 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

18 0.53028589 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

19 0.53011262 68 nips-2004-Face Detection --- Efficient and Rank Deficient

20 0.52996212 163 nips-2004-Semi-parametric Exponential Family PCA