nips nips2006 nips2006-114 knowledge-graph by maker-knowledge-mining

114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

Source: pdf

Author: Alexander T. Ihler, Padhraic Smyth

Abstract: Data sets that characterize human activity over time through collections of timestamped events or counts are of increasing interest in application areas as humancomputer interaction, video surveillance, and Web data analysis. We propose a non-parametric Bayesian framework for modeling collections of such data. In particular, we use a Dirichlet process framework for learning a set of intensity functions corresponding to different categories, which form a basis set for representing individual time-periods (e.g., several days) depending on which categories the time-periods are assigned to. This allows the model to learn in a data-driven fashion what “factors” are generating the observations on a particular day, including (for example) weekday versus weekend effects or day-speciﬁc effects corresponding to unique (single-day) occurrences of unusual behavior, sharing information where appropriate to obtain improved estimates of the behavior associated with each category. Applications to real–world data sets of count data involving both vehicles and people are used to illustrate the technique. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Data sets that characterize human activity over time through collections of timestamped events or counts are of increasing interest in application areas as humancomputer interaction, video surveillance, and Web data analysis. [sent-8, score-0.254]

2 In particular, we use a Dirichlet process framework for learning a set of intensity functions corresponding to different categories, which form a basis set for representing individual time-periods (e. [sent-10, score-0.181]

3 , several days) depending on which categories the time-periods are assigned to. [sent-12, score-0.128]

4 1 Introduction As sensor and storage technologies continue to improve in terms of both cost and performance, increasingly rich data sets are becoming available that characterize the rhythms of human activity over time. [sent-15, score-0.172]

5 Such data can be used to support a variety of different applications, such as classiﬁcation of human or animal activities, detection of unusual events, or to support the broad understanding of behavior in a particular context such as the temporal patterns of Web usage. [sent-17, score-0.209]

6 For example, Figure 1 shows several days worth of data from a building log, smoothed so that the similarities in patterns are more readily visible. [sent-21, score-0.409]

7 Of interest is the modeling of the underlying intensity of the process generating the data, where intensity here refers to the rate at which events occur. [sent-22, score-0.399]

8 These processes are typically inhomogeneous in time (as in Figure 1), as they arise from the aggregated behavior of individuals, and thus exhibit a temporal dependence linked to the rhythms of the underlying human activity. [sent-23, score-0.195]

9 Formulating the underlying event generation as an inhomogeneous Poisson process is a common ﬁrst step (see, e. [sent-25, score-0.17]

10 , [1, 4]), as it allows the application of various classic density estimation techniques to estimate the time-dependent intensity function (a normalized version of the rate function; see Sec- tion 2). [sent-27, score-0.175]

11 First, they allow us to represent and reason about uncertainty in the intensity function, 50 providing not just a single estimate but a distribution over functions. [sent-30, score-0.129]

12 , multiple days Figure 1: Count data from a building entry log of data). [sent-34, score-0.418]

13 For example, behavior may be dependent on not only time of day but also day of week, type of day (weekend or weekday), unobserved factors such as the weather, or other unusual circumstances. [sent-37, score-1.31]

14 In what follows we propose a non-parametric Bayesian framework for modeling intensity functions for event data over time. [sent-40, score-0.227]

15 In particular, we describe a Dirichlet process framework for learning the unknown rate functions, and learn a set of such functions corresponding to different categories. [sent-41, score-0.152]

16 , individual days) are then represented as additive combinations of intensity functions, depending on which categories are assigned to each time-period. [sent-44, score-0.279]

17 This allows the model to learn in a data-driven fashion what “factors” are generating the observations on a particular day, including (for example) weekday versus weekend effects as well as day-speciﬁc effects corresponding to unusual behavior present only on a single day. [sent-45, score-0.499]

18 Applications to two real–world data sets, a building access log and accident statistics, are used to illustrate the technique. [sent-46, score-0.128]

19 Broadly speaking, from the viewpoint of modeling of inhomogeneous time-series of counts, our work extends the work of [4] to allow sharing of information among multiple, related processes (e. [sent-48, score-0.298]

20 2 Poisson processes A common model for continuous-time event (counting) data is the Poisson process [8]. [sent-52, score-0.151]

21 As the discrete Poisson distribution is characterized by a rate parameter λ, the Poisson process1 is characterized by a rate function λ(t); it has the property that over any given time interval T , the number of events occurring within that time is Poisson with rate given by λ = T λ(t). [sent-53, score-0.268]

22 Let us suppose that we have a single collection of event times {τi } arising from a Poisson process with rate function λ(t), i. [sent-55, score-0.208]

23 , {τi } ∼ P[τ ; λ(t)] (1) 1 Here, we shall use the term Poisson process interchangeably with inhomogeneous Poisson process, meaning that the rate is a non-constant function of time t. [sent-57, score-0.182]

24 We may write λ(t) = γf (t), where γ = −∞ λ(t) and f (t) is the intensity function, a normalized version of the rate function. [sent-59, score-0.175]

25 A Bayesian model places prior distributions on these quantities; by selecting a parametric prior for γ and a nonparametric prior for f (t), we obtain a semi-parametric prior for λ(t). [sent-60, score-0.162]

26 The Dirichlet process provides a nonparametric prior for f (t), such that (with probability one) f has the form of a mixture model with inﬁnitely many components: f (t) = j wj K(t; θj ). [sent-62, score-0.238]

27 In particular, they have been the subject of recent work in modeling intensity functions for Poisson processes deﬁned over time [4] and space–time [5]. [sent-67, score-0.212]

28 As in many mixture model applications, it will be helpful to create auxiliary assignment variables zi for each τi , indicating with which of the mixture components the sample τi is associated. [sent-78, score-0.156]

29 Speciﬁcally, although the posterior for γ has a simple closed form, p(γ|{τi }) ∝ Γ(N + a, 1 + b), sampling from f is more complicated. [sent-81, score-0.118]

30 Such exact sampling approaches work by exploiting the fact that only a ﬁnite number of the mixture components are occupied by the data; by treating the unoccupied clusters as a single group, the inﬁnite number of potential associations can be treated as a ﬁnite number. [sent-84, score-0.27]

31 The operations involved (such as sampling values for θj given a collection of associated event times τi ) are easier for certain choices of K and G than others; for example using a Gaussian kernel and normal-Wishart distribution, the necessary quantities have convenient closed forms [9]. [sent-85, score-0.123]

32 Another, more brute-force way around the issue of having inﬁnitely many mixture components is to perform approximate sampling using a “truncated” Dirichlet process representation [12, 13]. [sent-86, score-0.167]

33 First, we sample the occupied mixture weights, wj (j ≤ J), and the total unoccupied ∞ weight w = J+1 wj , by drawing independent, Gamma-distributed random variables according to ¯ Γ(Nj , 1) and Γ(α, 1), respectively, and normalizing them to sum to one. [sent-94, score-0.289]

34 The values of weights wj in the unoccupied clusters (j > J) can then be sampled given w using the stick–breaking representation ¯ of Sethuraman [14]. [sent-95, score-0.123]

35 Note that the truncated DP approximation highlights the importance of also sampling α if we hope for our representation to act non-parametric in the sense that it may grow more complex as the data increase, since for a ﬁxed α and the number of components M is quite insensitive to N . [sent-96, score-0.144]

36 Here, we take a slightly different approach, drawing truncated Gaussian kernels with parameters sampled from a truncated Normal-Wishart distribution. [sent-107, score-0.164]

37 If these processes are known to be identical and independent, sharing information among them is relatively easy—we obtain D observations Nd with which to estimate γ, and the τdi are collectively used to estimate f (t). [sent-119, score-0.245]

38 However, if these processes are not necessarily identical, sharing information becomes more difﬁcult. [sent-120, score-0.166]

39 Clearly, there is a great deal of consistency in both size and shape, although not every day is exactly the same, and one or two stand out as different. [sent-123, score-0.367]

40 In this example, we might reasonably assume that the category memberships are known (for example, whether a given day is a weekday or weekend, or a Monday or Tuesday), though we shall relax this assumption in later sections. [sent-125, score-0.714]

41 Then, given a structure of potential relationships, what is a reasonable model for sharing information among categories? [sent-126, score-0.158]

42 Again, we initially assume that the category memberships are known; thus, if a category is associated with a particular day, the activity proﬁle associated with that category will be observed, along with additional activity arising from each of the other categories present. [sent-130, score-1.014]

43 Let us associate a rate function λc (t) = γc fc (t) with each category in our model. [sent-131, score-0.279]

44 We deﬁne the rate function of a given day d to be the sum of the rate functions of each category to which d belongs. [sent-132, score-0.72]

45 , that category c is present during day d, we have that λd (t) = c:sdc =1 λc (t). [sent-135, score-0.539]

46 In fact, we do not want a model which is too ﬂexible, such as a linear combination of patterns, since it is not physically meaningful to say, for example, that a day is only “part” Monday. [sent-138, score-0.367]

47 , things that happen every day versus only on weekdays or only on Mondays), it makes sense to take an “all or nothing” model where the pattern is either present, or not. [sent-141, score-0.543]

48 Using an additive model allows both a consistent size and shape to emerge for each category, while associating deviations from that proﬁle to categories further down in the hierarchy. [sent-145, score-0.212]

49 We deﬁne the association as [ydi , zdi ], where ydi indicates which of the categories generated event τdi . [sent-147, score-0.384]

50 For example, if we create categories with assigned meanings (weekdays, weekends, Sundays, Mondays, and so on), a day which is nominally a Monday but also happens to be a holiday, closure, or other unusual circumstances may be completely different from other Monday proﬁles. [sent-153, score-0.651]

51 Similarly, a day with unusual extra activity (receptions, talks, etc. [sent-154, score-0.664]

52 ) may see behavior unique to its particular circumstances and warrant an additional category to represent it. [sent-155, score-0.257]

53 We can accommodate both these possibilities by also sampling the values of the membership indicator variables sdc , i. [sent-156, score-0.442]

54 , the binary indicator that day d sees behavior from category c. [sent-158, score-0.592]

55 To this end, let us assume we have some prior knowledge of these membership probabilities, pdc (sdc ); we may then re-sample from their posterior distributions at each iteration of MCMC. [sent-159, score-0.224]

56 This sampling step is difﬁcult to do outside the truncated representation. [sent-160, score-0.144]

57 Although up until this point we could easily have elected to use, for example, the CRP formulation for sampling, the association variables {ydi , zdi } are tightly coupled with the memberships sdc since if any ydi = c we must have that sdc = 1. [sent-161, score-0.841]

58 Instead, to sample the sdc we condition on the truncated rate functions λc (t), with truncation depth M chosen to provide arbitrarily high precision. [sent-162, score-0.484]

59 The likelihood of the data under these rate functions for any values of {sdc } can then be computed directly via (2) where sdc γc γ= c and f (t) = γ −1 sdc λc (t). [sent-163, score-0.696]

60 c In practice, we propose changing the value of each membership variable sdc individually given the others, though more complex moves could also be applied. [sent-164, score-0.452]

61 Sharing information among similar days gives greatly improved estimates of the rate functions, resolving otherwise obscured features such as the decrease during and increase subsequent to lunchtime. [sent-166, score-0.464]

62 category magnitudes {γc } and a truncated representation of each fc (t) consisting of weights {wj } and parameters {θj }. [sent-167, score-0.288]

63 4 Experiments In this section we consider the application of our model to two data sets, one (mentioned previously) from the entry log of people entering a large campus building (produced by optical sensors at the front door), and the other from a log of vehicular trafﬁc accidents. [sent-168, score-0.185]

64 To this end, we create categories for “all days”, “weekends”, “weekdays”, and “Sundays” through “Saturdays”. [sent-173, score-0.128]

65 Each of these categories has a high probability (pdc = . [sent-174, score-0.128]

66 To account for the possibility of unusual increases in activity, we also add categories unique to each day, with lower prior probability (pdc = . [sent-176, score-0.347]

67 This allows but discourages each day to add a new category if there is evidence of unusual activity. [sent-178, score-0.695]

68 1 Building Entry Data To see the improvement in the estimated rate functions when information is shared among similar days, Figure 2 shows results from three different days of the week (Sunday, Monday, Tuesday). [sent-180, score-0.59]

69 Each panel shows the estimated proﬁles of each of the ten days estimated individually (using only that day’s observations) under a Dirichlet process mixture model (dotted lines). [sent-181, score-0.583]

70 Superimposed in each panel is a single, black curve corresponding to the total proﬁle for that day of week estimated using our categorical model; so, (a) shows the sum of the rate functions for “all days”, “weekends”, and “Sundays”, while (b) shows the sum of “all days”, “weekdays”, and “Mondays”. [sent-182, score-0.565]

71 First, by sharing several days worth of observations, the model can produce a much more accurate estimate of the proﬁles. [sent-185, score-0.479]

72 In this case, no single day contains enough observations to be conﬁdent about the details of the rate function, so each individually–estimated rate function appears relatively smooth. [sent-186, score-0.554]

73 However, when information from other days is included, the rate function begins to resolve into a clearly bi-modal shape for weekdays. [sent-187, score-0.469]

74 This “bi-modal” rate behavior is quite real, and corresponds to the arrival of occupants in the morning (ﬁrst mode), a lull during lunchtime, and a larger, narrower second peak as most occupants return from lunch. [sent-188, score-0.303]

75 Second, although Monday and Tuesday proﬁles appear similar, they also have distinct behavior, such as increased activity late Tuesday morning. [sent-189, score-0.141]

76 The breakdown of a particular day (the ﬁrst Tuesday) into its component categories is shown in Figure 3. [sent-191, score-0.495]

77 As we might expect, there is little consistency between weekdays and weekends, quite a bit of similarity among weekdays and among just Tuesdays, and (for this particular day) very little to set it apart from other Tuesdays. [sent-192, score-0.484]

78 We can also check to see that the category memberships sdc are being used effectively. [sent-193, score-0.524]

79 2 0 0 5 6 12 18 24 0 0 6 12 18 0 0 24 6 12 18 24 Figure 4: Proﬁles associated with individual-day categories in the entry log data for several days with known events (periods between dashed vertical lines). [sent-201, score-0.545]

80 The model successfully learns which days have signiﬁcant unusual activity and associates reasonable proﬁles with that activity (note that increases in entrance count rate typically occurs shortly before or at the beginning of the event time). [sent-202, score-0.921]

81 particular day, we ﬁnd that it has near-zero probability of belonging to either the weekday or Monday categories, and uses only the all-day and unique categories. [sent-203, score-0.149]

82 We can also examine days which have high probability of requiring their own category (indicating unusual activity). [sent-204, score-0.65]

83 Again, all three days are estimated to have additional activity, and the period of time for that activity corresponds well with the actual start and end time shown in the schedule (dashed vertical lines). [sent-207, score-0.491]

84 2 Vehicular Accident Data Our second data set consists of a database of vehicular accident times recorded by North Carolina police departments. [sent-209, score-0.14]

85 As we might expect of driving patterns, there is still less activity on weekends, but far more than was observed in the campus building log. [sent-210, score-0.218]

86 40 20 0 0 6 12 18 24 As before, sharing information allows us to Figure 5: Posterior mean and uncertainty for a decrease our posterior uncertainty on the rate single day of accident data, estimated individually for any particular day. [sent-211, score-0.848]

87 Figure 5 quantiﬁes this (red) and with data sharing (black). [sent-212, score-0.12]

88 Sharing data idea by showing the posterior means and (point- considerably reduces the posterior uncertainty in wise) two-sigma conﬁdence intervals for the the proﬁle shape. [sent-213, score-0.139]

89 rate function estimated for the same day (the ﬁrst Monday in the data set) using that day’s data only (red curves) and using the category-based additive model (black). [sent-214, score-0.517]

90 The additive model leverages the additional data to produce much tighter estimates of the rate proﬁle. [sent-215, score-0.192]

91 As in Figure 2, sharing information helps resolve features which the individual days do not have enough data to reliably estimate. [sent-218, score-0.51]

92 This also helps make the pattern of deviation on Friday clear, showing (as we would expect) increased activity at night. [sent-220, score-0.17]

93 In this paper we proposed a non-parametric Bayesian approach to learning time-intensity proﬁles for such activity data, based on an inhomogeneous Poisson process framework. [sent-222, score-0.25]

94 , days) to be grouped together by category (day of week, weekday/weekend, etc. [sent-225, score-0.172]

95 When the categorization of days is not a priori certain (e. [sent-227, score-0.322]

96 , days that fall on a holiday or days with unusual non-recurring additional activity) the model can infer the appropriate categorization, allowing (for example) automated detection of unusual events. [sent-229, score-1.007]

97 On two large real-world data sets the model was able to infer interpretable activity proﬁles that correspond to real-world phenomena. [sent-230, score-0.141]

98 Bayesian nonparametric mixture modeling for the intensity function of non-homogeneous Poisson processes. [sent-257, score-0.23]

99 Bayesian mixture modeling for spatial Poisson process intensities, with applications to extreme value analysis. [sent-264, score-0.134]

100 Markov chain sampling methods for Dirichlet process mixture models. [sent-297, score-0.167]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('day', 0.367), ('days', 0.322), ('sdc', 0.294), ('pro', 0.189), ('poisson', 0.187), ('monday', 0.176), ('mondays', 0.176), ('weekdays', 0.176), ('category', 0.172), ('sundays', 0.157), ('tuesday', 0.157), ('unusual', 0.156), ('dirichlet', 0.145), ('activity', 0.141), ('categories', 0.128), ('les', 0.122), ('sharing', 0.12), ('weekday', 0.117), ('weekends', 0.117), ('ydi', 0.117), ('intensity', 0.102), ('membership', 0.086), ('truncated', 0.082), ('accident', 0.078), ('tuesdays', 0.078), ('weekend', 0.078), ('zdi', 0.078), ('rate', 0.073), ('individually', 0.072), ('inhomogeneous', 0.065), ('collections', 0.064), ('le', 0.064), ('wj', 0.064), ('week', 0.062), ('vehicular', 0.062), ('sampling', 0.062), ('mixture', 0.061), ('event', 0.061), ('traf', 0.06), ('hdp', 0.059), ('logs', 0.059), ('morning', 0.059), ('occupants', 0.059), ('unoccupied', 0.059), ('cruz', 0.058), ('memberships', 0.058), ('posterior', 0.056), ('bayesian', 0.056), ('behavior', 0.053), ('pdc', 0.051), ('holiday', 0.051), ('crp', 0.051), ('building', 0.05), ('santa', 0.05), ('additive', 0.049), ('events', 0.049), ('associations', 0.047), ('entry', 0.046), ('processes', 0.046), ('process', 0.044), ('dp', 0.042), ('observations', 0.041), ('occupied', 0.041), ('freeway', 0.039), ('leverages', 0.039), ('sunday', 0.039), ('weather', 0.039), ('resolve', 0.039), ('mode', 0.038), ('among', 0.038), ('nonparametric', 0.038), ('worth', 0.037), ('shape', 0.035), ('functions', 0.035), ('fc', 0.034), ('lunchtime', 0.034), ('zi', 0.034), ('dotted', 0.033), ('di', 0.033), ('shared', 0.032), ('unique', 0.032), ('web', 0.031), ('ishwaran', 0.031), ('rhythms', 0.031), ('smyth', 0.031), ('estimates', 0.031), ('prior', 0.031), ('arising', 0.03), ('ihler', 0.029), ('helps', 0.029), ('modeling', 0.029), ('little', 0.028), ('ten', 0.028), ('estimated', 0.028), ('effects', 0.027), ('math', 0.027), ('campus', 0.027), ('weeks', 0.027), ('uncertainty', 0.027), ('count', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999893 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

Author: Alexander T. Ihler, Padhraic Smyth

2 0.12095592 58 nips-2006-Context Effects in Category Learning: An Investigation of Four Probabilistic Models

Author: Michael C. Mozer, Michael Shettel, Michael P. Holmes

Abstract: Categorization is a central activity of human cognition. When an individual is asked to categorize a sequence of items, context effects arise: categorization of one item inﬂuences category decisions for subsequent items. Speciﬁcally, when experimental subjects are shown an exemplar of some target category, the category prototype appears to be pulled toward the exemplar, and the prototypes of all nontarget categories appear to be pushed away. These push and pull effects diminish with experience, and likely reﬂect long-term learning of category boundaries. We propose and evaluate four principled probabilistic (Bayesian) accounts of context effects in categorization. In all four accounts, the probability of an exemplar given a category is encoded as a Gaussian density in feature space, and categorization involves computing category posteriors given an exemplar. The models differ in how the uncertainty distribution of category prototypes is represented (localist or distributed), and how it is updated following each experience (using a maximum likelihood gradient ascent, or a Kalman ﬁlter update). We ﬁnd that the distributed maximum-likelihood model can explain the key experimental phenomena. Further, the model predicts other phenomena that were conﬁrmed via reanalysis of the experimental data. Categorization is a key cognitive activity. We continually make decisions about characteristics of objects and individuals: Is the fruit ripe? Does your friend seem unhappy? Is your car tire ﬂat? When an individual is asked to categorize a sequence of items, context effects arise: categorization of one item inﬂuences category decisions for subsequent items. Intuitive naturalistic scenarios in which context effects occur are easy to imagine. For example, if one lifts a medium-weight object after lifting a light-weight or heavy-weight object, the medium weight feels heavier following the light weight than following the heavy weight. Although the object-contrast effect might be due to fatigue of sensory-motor systems, many context effects in categorization are purely cognitive and cannot easily be attributed to neural habituation. For example, if you are reviewing a set of conference papers, and the ﬁrst three in the set are dreadful, then even a mediocre paper seems like it might be above threshold for acceptance. Another example of a category boundary shift due to context is the following. Suppose you move from San Diego to Pittsburgh and notice that your neighbors repeatedly describe muggy, somewhat overcast days as ”lovely.” Eventually, your notion of what constitutes a lovely day accommodates to your new surroundings. As we describe shortly, experimental studies have shown a fundamental link between context effects in categorization and long-term learning of category boundaries. We believe that context effects can be viewed as a reﬂection of a trial-to-trial learning, and the cumulative effect of these trial-to-trial modulations corresponds to what we classically consider to be category learning. Consequently, any compelling model of category learning should also be capable of explaining context effects. 1 Experimental Studies of Context Effects in Categorization Consider a set of stimuli that vary along a single continuous dimension. Throughout this paper, we use as an illustration circles of varying diameters, and assume four categories of circles deﬁned ranges of diameters; call them A, B, C, and D, in order from smallest to largest diameter. In a classiﬁcation paradigm, experimental subjects are given an exemplar drawn from one category and are asked to respond with the correct category label (Zotov, Jones, & Mewhort, 2003). After making their response, subjects receive feedback as to the correct label, which we’ll refer to as the target. In a production paradigm, subjects are given a target category label and asked to produce an exemplar of that category, e.g., using a computer mouse to indicate the circle diameter (Jones & Mewhort, 2003). Once a response is made, subjects receive feedback as to the correct or true category label for the exemplar they produced. Neither classiﬁcation nor production task has sequential structure, because the order of trial is random in both experiments. The production task provides direct information about the subjects’ internal representations, because subjects are producing exemplars that they consider to be prototypes of a category, whereas the categorization task requires indirect inferences to be made about internal representations from reaction time and accuracy data. Nonetheless, the ﬁndings in the production and classiﬁcation tasks mirror one another nicely, providing converging evidence as to the nature of learning. The production task reveals how mental representations shift as a function of trial-to-trial sequences, and these shifts cause the sequential pattern of errors and response times typically observed in the classiﬁcation task. We focus on the production task in this paper because it provides a richer source of data. However, we address the categorization task with our models as well. Figure 1 provides a schematic depiction of the key sequential effects in categorization. The horizontal line represents the stimulus dimension, e.g., circle diameter. The dimension is cut into four regions labeled with the corresponding category. The category center, which we’ll refer to as the prototype, is indicated by a vertical dashed line. The long solid vertical line marks the current exemplar—whether it is an exemplar presented to subjects in the classiﬁcation task or an exemplar generated by subjects in the production task. Following an experimental trial with this exemplar, category prototypes appear to shift: the target-category prototype moves toward the exemplar, which we refer to as a pull effect, and all nontarget-category prototypes move away from the exemplar, which we refer to as a push effect. Push and pull effects are assessed in the production task by examining the exemplar produced on the following trial, and in the categorization task by examining the likelihood of an error response near category boundaries. The set of phenomena to be explained are as follows, described in terms of the production task. All numerical results referred to are from Jones and Mewhort (2003). This experiment consisted of 12 blocks of 40 trials, with each category label given as target 10 times within a block. • Within-category pull: When a target category is repeated on successive trials, the exemplar generated on the second trial moves toward the exemplar generated on the ﬁrst trial, with respect to the true category prototype. Across the experiment, a correlation coefﬁcient of 0.524 is obtained, and remains fairly constant over trials. • Between-category push: When the target category changes from one trial to the next, the exemplar generated on the second trial moves away from the exemplar generated on the ﬁrst trial (or equivalently, from the prototype of the target category on the ﬁrst trial). Figure 2a summarizes the sequential push effects from Jones and Mewhort. The diameter of the circle produced on trial t is plotted as a function of the target category on trial t − 1, with one line for each of the four trial t targets. The mean diameter for each target category is subtracted out, so the absolute vertical offset of each line is unimportant. The main feature of the data to note is that all four curves have a negative slope, which has the following meaning: the smaller that target t − 1 is (i.e., the further to the left on the x axis in Figure 1), the larger the response to target t is (further to the right in Figure 1), and vice versa, reﬂecting a push away from target t − 1. Interestingly and importantly, the magnitude of the push increases with the ordinal distance between targets t − 1 and t. Figure 2a is based on data from only eight subjects and is therefore noisy, though the effect is statistically reliable. As further evidence, Figure 2b shows data from a categorization task (Zotov et al., 2003), where the y-axis is a different dependent measure, but the negative slope has the same interpretation as in Figure 2a. example Figure 1: Schematic depiction of sequential effects in categorization A B C D stimulus dimension C 0 B −0.02 −0.04 −0.06 −0.08 −0.1 0.06 0.04 0.04 response deviation D 0.02 0.02 0 −0.02 −0.04 −0.06 −0.08 A B C −0.1 D previous category label 0.02 0 −0.02 −0.04 −0.06 −0.08 A B C −0.1 D previous category label (b) humans: classification (d) KFU−distrib 0.08 D previous category label C D 0.06 0.04 0.04 response deviation response deviation C B (f) MLGA−distrib 0.08 0.02 0 −0.02 −0.04 −0.06 −0.08 B A previous category label 0.06 A (e) MLGA−local 0.08 0.06 A 0.04 (c) KFU−local 0.08 response deviation response deviation 0.06 response bias from (a) production task of Jones and Mewhort (2003), (b) classiﬁcation task of Zotov et al. (2003), and (c)-(f) the models proposed in this paper. The y axis is the deviation of the response from the mean, as a proportion of the total category width. The response to category A is solid red, B is dashed magenta, C is dash-dotted blue, and D is dotted green. (a) humans: production 0.08 Figure 2: Push effect data −0.1 0.02 0 −0.02 −0.04 −0.06 −0.08 A B C D previous category label −0.1 A B C D previous category label • Push and pull effects are not solely a consequence of errors or experimenter feedback. In quantitative estimation of push and pull effects, trial t is included in the data only if the response on trial t − 1 is correct. Thus, the effects follow trials in which no error feedback is given to the subjects, and therefore the adjustments are not due to explicit error correction. • Push and pull effects diminish over the course of the experiment. The magnitude of push effects can be measured by the slope of the regression lines ﬁt to the data in Figure 2a. The slopes get shallower over successive trial blocks. The magnitude of pull effects can be measured by the standard deviation (SD) of the produced exemplars, which also decreases over successive trial blocks. • Accuracy increases steadily over the course of the experiment, from 78% correct responses in the ﬁrst block to 91% in the ﬁnal block. This improvement occurs despite the fact that error feedback is relatively infrequent and becomes even less frequent as performance improves. 2 Four Models In this paper, we explore four probabilistic (Bayesian) models to explain data described in the previous section. The key phenomenon to explain turns out to be the push effect, for which three of the four models fail to account. Modelers typically discard the models that they reject, and present only their pet model. In this work, we ﬁnd it useful to report on the rejected models for three reasons. First, they help to set up and motivate the one successful model. Second, they include several obvious candidates, and we therefore have the imperative to address them. Third, in order to evaluate a model that can explain certain data, one needs to know the degree to which the the data constrain the space of models. If many models exist that are consistent with the data, one has little reason to prefer our pet candidate. Underlying all of the models is a generative probabilistic framework in which a category i is represented by a prototype value, di , on the dimension that discriminates among the categories. In the example used throughout this paper, the dimension is the diameter of a circle (hence the notation d for the prototype). An exemplar, E, of category i is drawn from a Gaussian distribution with mean di and variance vi , denoted E ∼ N (di , vi ). Category learning involves determining d ≡ {di }. In this work, we assume that the {vi } are ﬁxed and given. Because d is unknown at the start of the experiment, it is treated as the value of a random vector, D ≡ {Di }. Figure 3a shows a simple graphical model representing the generative framework, in which E is the exemplar and C the category label. To formalize our discussion so far, we adopt the following notation: P (E|C = c, D = d) ∼ N (hc d, vc ), (1) where, for the time being, hc is a unary column vector all of whose elements are zero except for element c which has value 1. (Subscripts may indicate either an index over elements of a vector, or an index over vectors. Boldface is used for vectors and matrices.) Figure 3: (a) Graphical model depicting selection of an exemplar, E, of a category, C, based on the prototype vector, D; (b) Dynamic version of model indexed by trials, t (a) D (b) C Dt-1 Ct-1 E Dt Ct Et-1 Et We assume that the prototype representation, D, is multivariate Gaussian, D ∼ N (Ψ, Σ), where Ψ and Σ encode knowledge—and uncertainty in the knowledge—of the category prototype structure. Given this formulation, the uncertainty in D can be integrated out: P (E|C) ∼ N (hc Ψ, hc ΣhT + vc ). c (2) For the categorization task, a category label can be assigned by evaluating the category posterior, P (C|E), via Bayes rule, Equation 1, and the category priors, P (C). In this framework, learning takes place via trial-to-trial adaptation of the category prototype distribution, D. In Figure 3b, we add the subscript t to each random variable to denote the trial, yielding a dynamic graphical model for the sequential updating of the prototype vector, Dt . (The reader should be attentive to the fact that we use subscripted indices to denote both trials and category labels. We generally use the index t to denote trial, and c or i to denote a category label.) The goal of our modeling work is to show that the sequential updating process leads to context effects, such as the push and pull effects discussed earlier. We propose four alternative models to explore within this framework. The four models are obtained via the Cartesian product of two binary choices: the learning rule and the prototype representation. 2.1 Learning rule The ﬁrst learning rule, maximum likelihood gradient ascent (MLGA), attempts to adjust the prototype representation so as to maximize the log posterior of the category given the exemplar. (The category, C = c, is the true label associated with the exemplar, i.e., either the target label the subject was asked to produce, or—if an error was made—the actual category label the subject did produce.) Gradient ascent is performed in all parameters of Ψ and Σ: ∆ψi = ψ ∂ log(P (c|e)) and ∂ψi ∆σij = σ ∂ log(P (c|e)), ∂σij (3) where ψ and σ are step sizes. To ensure that Σ remains a covariance matrix, constrained gradient 2 steps are applied. The constraints are: (1) diagonal terms are nonnegative, i.e., σi ≥ 0; (2) offdiagonal terms are symmetric, i.e., σij = σji ; and (3) the matrix remains positive deﬁnite, ensured σ by −1 ≤ σiijj ≤ 1. σ The second learning rule, a Kalman ﬁlter update (KFU), reestimates the uncertainty distribution of the prototypes given evidence provided by the current exemplar and category label. To draw the correspondence between our framework and a Kalman ﬁlter: the exemplar is a scalar measurement that pops out of the ﬁlter, the category prototypes are the hidden state of the ﬁlter, the measurement noise is vc , and the linear mapping from state to measurement is achieved by hc . Technically, the model is a measurement-switched Kalman ﬁlter, where the switching is determined by the category label c, i.e., the measurement function, hc , and noise, vc , are conditioned on c. The Kalman ﬁlter also allows temporal dynamics via the update equation, dt = Adt−1 , as well as internal process noise, whose covariance matrix is often denoted Q in standard Kalman ﬁlter notation. We investigated the choice of A and R, but because they did not impact the qualitative outcome of the simulations, we used A = I and R = 0. Given the correspondence we’ve established, the KFU equations—which specify Ψt+1 and Σt+1 as a function of ct , et , Ψt , and Σt —can be found in an introductory text (e.g., Maybeck, 1979). Change to a category prototype for each category following a trial of a given category. Solid (open) bars indicate trials in which the exemplar is larger (smaller) than the prototype. 2.2 prototype mvt. Figure 4: 0.2 trial t −1: A 0 −0.2 0.2 trial t −1: B 0 A B C trial t D −0.2 0.2 trial t −1: C 0 A B C trial t D −0.2 0.2 trial t −1: D 0 A B C trial t D −0.2 A B C trial t D Representation of the prototype The prototype representation that we described is localist: there is a one-to-one correspondence between the prototype for each category i and the random variable Di . To select the appropriate prototype given a current category c, we deﬁned the unary vector hc and applied hc as a linear transform on D. The identical operations can be performed in conjunction with a distributed representation of the prototype. But we step back momentarily to motivate the distributed representation. The localist representation suffers from a key weakness: it does not exploit interrelatedness constraints on category structure. The task given to experimental subjects speciﬁes that there are four categories, and they have an ordering; the circle diameters associated with category A are smaller than the diameters associated with B, etc. Consequently, dA < dB < dC < dD . One might make a further assumption that the category prototypes are equally spaced. Exploiting these two sources of domain knowledge leads to the distributed representation of category structure. A simple sort of distributed representation involves deﬁning the prototype for category i not as di but as a linear function of an underlying two-dimensional state-space representation of structure. In this state space, d1 indicates the distance between categories and d2 an offset for all categories. This representation of state can be achieved by applying Equation 1 and deﬁning hc = (nc , 1), where nc is the ordinal position of the category (nA = 1, nB = 2, etc.). We augment this representation with a bit of redundancy by incorporating not only the ordinal positions but also the reverse ordinal positions; this addition yields a symmetry in the representation between the two ends of the ordinal category scale. As a result of this augmentation, d becomes a three-dimensional state space, and hc = (nc , N + 1 − nc , 1), where N is the number of categories. To summarize, both the localist and distributed representations posit the existence of a hidden-state space—unknown at the start of learning—that speciﬁes category prototypes. The localist model assumes one dimension in the state space per prototype, whereas the distributed model assumes fewer dimensions in the state space—three, in our proposal—than there are prototypes, and computes the prototype location as a function of the state. Both localist and distributed representations assume a ﬁxed, known {hc } that specify the interpretation of the state space, or, in the case of the distributed model, the subject’s domain knowledge about category structure. 3 Simulation Methodology We deﬁned a one-dimensional feature space in which categories A-D corresponded to the ranges [1, 2), [2, 3), [3, 4), and [4, 5), respectively. In the human experiment, responses were considered incorrect if they were smaller than A or larger than D; we call these two cases out-of-bounds-low (OOBL) and out-of-bounds-high (OOBH). OOBL and OOBH were treated as two additional categories, resulting in 6 categories altogether for the simulation. Subjects and the model were never asked to produce exemplars of OOBL or OOBH, but feedback was given if a response fell into these categories. As in the human experiment, our simulation involved 480 trials. We performed 100 replications of each simulation with identical initial conditions but different trial sequences, and averaged results over replications. All prototypes were initialized to have the same mean, 3.0, at the start of the simulation. Because subjects had some initial practice on the task before the start of the experimental trials, we provided the models with 12 initial trials of a categorization (not production) task, two for each of the 6 categories. (For the MLGA models, it was necessary to use a large step size on these trials to move the prototypes to roughly the correct neighborhood.) To perform the production task, the models must generate an exemplar given a category. It seems natural to draw an exemplar from the distribution in Equation 2 for P (E|C). However, this distribu- tion reﬂects the full range of exemplars that lie within the category boundaries, and presumably in the production task, subjects attempt to produce a prototypical exemplar. Consequently, we exclude the intrinsic category variance, vc , from Equation 2 in generating exemplars, leaving variance only via uncertainty about the prototype. Each model involved selection of various parameters and initial conditions. We searched the parameter space by hand, attempting to ﬁnd parameters that satisﬁed basic properties of the data: the accuracy and response variance in the ﬁrst and second halves of the experiment. We report only parameters for the one model that was successful, the MLGA-Distrib: ψ = 0.0075, σ = 1.5 × 10−6 for off-diagonal terms and 1.5 × 10−7 for diagonal terms (the gradient for the diagonal terms was relatively steep), Σ0 = 0.01I, and for all categories c, vc = 0.42 . 4 4.1 Results Push effect The phenomenon that most clearly distinguishes the models is the push effect. The push effect is manifested in sequential-dependency functions, which plot the (relative) response on trial t as a function of trial t − 1. As we explained using Figures 2a,b, the signature of the push effect is a negatively sloped line for each of the different trial t target categories. The sequential-dependency functions for the four models are presented in Figures 2c-f. KFU-Local (Figure 2c) produces a ﬂat line, indicating no push whatsoever. The explanation for this result is straightforward: the Kalman ﬁlter update alters only the variable that is responsible for the measurement (exemplar) obtained on that trial. That variable is the prototype of the target class c, Dc . We thought the lack of an interaction among the category prototypes might be overcome with KFU-Distrib, because with a distributed prototype representation, all of the state variables jointly determine the target category prototype. However, our intuition turned out to be incorrect. We experimented with many different representations and parameter settings, but KFU-Distrib consistently obtained ﬂat or shallow positive sloping lines (Figure 2d). MLGA-Local (Figure 2e) obtains a push effect for neighboring classes, but not distant classes. For example, examining the dashed magenta line, note that B is pushed away by A and C, but is not affected by D. MLGA-Local maximizes the likelihood of the target category both by pulling the classconditional density of the target category toward the exemplar and by pushing the class-conditional densities of the other categories away from the exemplar. However, if a category has little probability mass at the location of the exemplar, the increase in likelihood that results from pushing it further away is negligible, and consequently, so is the push effect. MLGA-Distrib obtains a lovely result (Figure 2f)—a negatively-sloped line, diagnostic of the push effect. The effect magnitude matches that in the human data (Figure 2a), and captures the key property that the push effect increases with the ordinal distance of the categories. We did not build a mechanism into MLGA-Distrib to produce the push effect; it is somewhat of an emergent property of the model. The state representation of MLGA-Distrib has three components: d1 , the weight of the ordinal position of a category prototype, d2 , the weight of the reverse ordinal position, and d3 , an offset. The last term, d3 , cannot be responsible for a push effect, because it shifts all prototypes equally, and therefore can only produce a ﬂat sequential dependency function. Figure 4 helps provide an intuition how d1 and d2 work together to produce the push effect. Each graph shows the average movement of the category prototype (units on the y-axis are arbitrary) observed on trial t, for each of the four categories, following presentation of a given category on trial t − 1. Positve values on the y axis indicate increases in the prototype (movement to the right in Figure 1), and negative values decreases. Each solid vertical bar represents the movement of a given category prototype following a trial in which the exemplar is larger than its current prototype; each open vertical bar represents movement when the exemplar is to the left of its prototype. Notice that all category prototypes get larger or smaller on a given trial. But over the course of the experiment, the exemplar should be larger than the prototype as often as it is smaller, and the two shifts should sum together and partially cancel out. The result is the value indicated by the small horizontal bar along each line. The balance between the shifts in the two directions exactly corresponds to the push effect. Thus, the model produce a push-effect graph, but it is not truly producing a push effect as was originally conceived by the experimentalists. We are currently considering empirical consequences of this simulation result. Figure 5 shows a trial-by-trial trace from MLGA-Distrib. (a) class prototype 2 50 100 150 200 250 300 350 400 6 4 2 0 50 100 150 200 250 300 350 400 450 −4 50 100 150 200 250 50 100 150 200 300 350 400 450 250 300 350 400 450 300 350 400 450 300 350 400 450 0.2 0 −0.2 50 100 150 200 250 (f) 1 −2 −6 0.6 (e) (c) 0 0.8 0.4 450 (b) posterior log(class variance) P(correct) 4 0 (d) 1 shift (+=toward −=away) example 6 0.8 0.6 0.4 50 100 150 200 250 Figure 5: Trial-by-trial trace of MLGA-Distrib. (a) exemplars generated on one run of the simulation; (b) the mean and (c) variance of the class prototype distribution for the 6 classes on one run; (d) mean proportion correct over 100 replications of the simulation; (e) push and pull effects, as measured by changes to the prototype means: the upper (green) curve is the pull of the target prototype mean toward the exemplar, and the lower (red) curve is the push of the nontarget prototype means away from the exemplar, over 100 replications; (f) category posterior of the generated exemplar over 100 replications, reﬂecting gradient ascent in the posterior. 4.2 Other phenomena accounted for MLGA-Distrib captures the other phenomena we listed at the outset of this paper. Like all of the other models, MLGA-Distrib readily produces a pull effect, which is shown in the movement of category prototypes in Figure 5e. More observably, a pull effect is manifested when two successive trials of the same category are positively correlated: when trial t − 1 is to the left of the true category prototype, trial t is likely to be to the left as well. In the human data, the correlation coefﬁcient over the experiment is 0.524; in the model, the coefﬁcient is 0.496. The explanation for the pull effect is apparent: moving the category prototype to the exemplar increases the category likelihood. Although many learning effects in humans are based on error feedback, the experimental studies showed that push and pull effects occur even in the absence of errors, as they do in MLGA-Distrib. The model simply assumes that the target category it used to generate an exemplar is the correct category when no feedback to the contrary is provided. As long as the likelihood gradient is nonzero, category prototypes will be shifted. Pull and push effects shrink over the course of the experiment in human studies, as they do in the simulation. Figure 5e shows a reduction in both pull and push, as measured by the shift of the prototype means toward or away from the exemplar. We measured the slope of MLGA-Distrib’s push function (Figure 2f) for trials in the ﬁrst and second half of the simulation. The slope dropped from −0.042 to −0.025, as one would expect from Figure 5e. (These slopes are obtained by combining responses from 100 replications of the simulation. Consequently, each point on the push function was an average over 6000 trials, and therefore the regression slopes are highly reliable.) A quantitative, observable measure of pull is the standard deviation (SD) of responses. As push and pull effects diminish, SDs should decrease. In human subjects, the response SDs in the ﬁrst and second half of the experiment are 0.43 and 0.33, respectively. In the simulation, the response SDs are 0.51 and 0.38. Shrink reﬂects the fact that the model is approaching a local optimum in log likelihood, causing gradients—and learning steps—to become smaller. Not all model parameter settings lead to shrink; as in any gradient-based algorithm, step sizes that are too large do not lead to converge. However, such parameter settings make little sense in the context of the learning objective. 4.3 Model predictions MLGA-Distrib produces greater pull of the target category toward the exemplar than push of the neighboring categories away from the exemplar. In the simulation, the magnitude of the target pull— measured by the movement of the prototype mean—is 0.105, contrasted with the neighbor push, which is 0.017. After observing this robust result in the simulation, we found pertinent experimental data. Using the categorization paradigm, Zotov et al. (2003) found that if the exemplar on trial t is near a category border, subjects are more likely to produce an error if the category on trial t − 1 is repeated (i.e., a pull effect just took place) than if the previous trial is of the neighboring category (i.e., a push effect), even when the distance between exemplars on t − 1 and t is matched. The greater probability of error translates to a greater magnitude of pull than push. The experimental studies noted a phenomenon termed snap back. If the same target category is presented on successive trials, and an error is made on the ﬁrst trial, subjects perform very accurately on the second trial, i.e., they generate an exemplar near the true category prototype. It appears as if subjects, realizing they have been slacking, reawaken and snap the category prototype back to where it belongs. We tested the model, but observed a sort of anti snap back. If the model made an error on the ﬁrst trial, the mean deviation was larger—not smaller—on the second trial: 0.40 versus 0.32. Thus, MLGA-Distrib fails to explain this phenomenon. However, the phenomenon is not inconsistent with the model. One might suppose that on an error trial, subjects become more attentive, and increased attention might correspond to a larger learning rate on an error trial, which should yield a more accurate response on the following trial. McLaren et al. (1995) studied a phenomenon in humans known as peak shift, in which subjects are trained to categorize unidimensional stimuli into one of two categories. Subjects are faster and more accurate when presented with exemplars far from the category boundary than those near the boundary. In fact, they respond more efﬁciently to far exemplars than they do to the category prototype. The results are characterized in terms of the prototype of one category being pushed away from the prototype of the other category. It seems straightforward to explain these data in MLGA-Distrib as a type of long-term push effect. 5 Related Work and Conclusions Stewart, Brown, and Chater (2002) proposed an account of categorization context effects in which responses are based solely on the relative difference between the previous and present exemplars. No representation of the category prototype is maintained. However, classiﬁcation based solely on relative difference cannot account for a diminished bias effects as a function of experience. A long-term stable prototype representation, of the sort incorporated into our models, seems necessary. We considered four models in our investigation, and the fact that only one accounts for the experimental data suggests that the data are nontrivial. All four models have principled theoretical underpinnings, and they space they deﬁne may suggest other elegant frameworks for understanding mechanisms of category learning. The successful model, MLDA-Distrib, offers a deep insight into understanding multiple-category domains: category structure must be considered. MLGA-Distrib exploits knowledge available to subjects performing the task concerning the ordinal relationships among categories. A model without this knowledge, MLGA-Local, fails to explain data. Thus, the interrelatedness of categories appears to provide a source of constraint that individuals use in learning about the structure of the world. Acknowledgments This research was supported by NSF BCS 0339103 and NSF CSE-SMA 0509521. Support for the second author comes from an NSERC fellowship. References Jones, M. N., & Mewhort, D. J. K. (2003). Sequential contrast and assimilation effects in categorization of perceptual stimuli. Poster presented at the 44th Meeting of the Psychonomic Society. Vancouver, B.C. Maybeck, P.S. (1979). Stochastic models, estimation, and control, Volume I. Academic Press. McLaren, I. P. L., et al. (1995). Prototype effects and peak shift in categorization. JEP:LMC, 21, 662–673. Stewart, N. Brown, G. D. A., & Chater, N. (2002). Sequence effects in categorization of simple perceptual stimuli. JEP:LMC, 28, 3–11. Zotov, V., Jones, M. N., & Mewhort, D. J. K. (2003). Trial-to-trial representation shifts in categorization. Poster presented at the 13th Meeting of the Canadian Society for Brain, Behaviour, and Cognitive Science: Hamilton, Ontario.

3 0.10972387 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning

Author: Gediminas Lukšys, Jérémie Knüsel, Denis Sheynikhovich, Carmen Sandi, Wulfram Gerstner

Abstract: Stress and genetic background regulate different aspects of behavioral learning through the action of stress hormones and neuromodulators. In reinforcement learning (RL) models, meta-parameters such as learning rate, future reward discount factor, and exploitation-exploration factor, control learning dynamics and performance. They are hypothesized to be related to neuromodulatory levels in the brain. We found that many aspects of animal learning and performance can be described by simple RL models using dynamic control of the meta-parameters. To study the effects of stress and genotype, we carried out 5-hole-box light conditioning and Morris water maze experiments with C57BL/6 and DBA/2 mouse strains. The animals were exposed to different kinds of stress to evaluate its effects on immediate performance as well as on long-term memory. Then, we used RL models to simulate their behavior. For each experimental session, we estimated a set of model meta-parameters that produced the best ﬁt between the model and the animal performance. The dynamics of several estimated meta-parameters were qualitatively similar for the two simulated experiments, and with statistically signiﬁcant differences between different genetic strains and stress conditions. 1

4 0.081854016 115 nips-2006-Learning annotated hierarchies from relational data

Author: Daniel M. Roy, Charles Kemp, Vikash K. Mansinghka, Joshua B. Tenenbaum

Abstract: The objects in many real-world domains can be organized into hierarchies, where each internal node picks out a category of objects. Given a collection of features and relations deﬁned over a set of objects, an annotated hierarchy includes a speciﬁcation of the categories that are most useful for describing each individual feature and relation. We deﬁne a generative model for annotated hierarchies and the features and relations that they describe, and develop a Markov chain Monte Carlo scheme for learning annotated hierarchies. We show that our model discovers interpretable structure in several real-world data sets.

5 0.078326061 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects

Author: Seyoung Kim, Padhraic Smyth

Abstract: Data sets involving multiple groups with shared characteristics frequently arise in practice. In this paper we extend hierarchical Dirichlet processes to model such data. Each group is assumed to be generated from a template mixture model with group level variability in both the mixing proportions and the component parameters. Variabilities in mixing proportions across groups are handled using hierarchical Dirichlet processes, also allowing for automatic determination of the number of components. In addition, each group is allowed to have its own component parameters coming from a prior described by a template mixture model. This group-level variability in the component parameters is handled using a random effects model. We present a Markov Chain Monte Carlo (MCMC) sampling algorithm to estimate model parameters and demonstrate the method by applying it to the problem of modeling spatial brain activation patterns across multiple images collected via functional magnetic resonance imaging (fMRI). 1

6 0.073088437 192 nips-2006-Theory and Dynamics of Perceptual Bistability

7 0.072410613 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

8 0.07123965 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

9 0.066462591 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

10 0.06449911 141 nips-2006-Multiple timescales and uncertainty in motor adaptation

11 0.062367253 66 nips-2006-Detecting Humans via Their Pose

12 0.061946429 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

13 0.059729129 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

14 0.055738725 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data

15 0.050077807 59 nips-2006-Context dependent amplification of both rate and event-correlation in a VLSI network of spiking neurons

16 0.049590517 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time

17 0.04843827 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

18 0.048003439 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation

19 0.047044009 185 nips-2006-Subordinate class recognition using relational object models

20 0.046632931 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.163), (1, -0.03), (2, 0.048), (3, -0.071), (4, 0.01), (5, -0.025), (6, 0.1), (7, -0.042), (8, 0.019), (9, -0.06), (10, 0.002), (11, 0.145), (12, 0.055), (13, 0.018), (14, -0.065), (15, -0.035), (16, 0.072), (17, -0.03), (18, -0.094), (19, -0.014), (20, -0.076), (21, -0.047), (22, 0.072), (23, 0.001), (24, -0.05), (25, 0.103), (26, 0.039), (27, 0.085), (28, -0.012), (29, -0.034), (30, -0.115), (31, -0.049), (32, 0.143), (33, -0.134), (34, 0.051), (35, -0.133), (36, 0.07), (37, -0.017), (38, -0.043), (39, 0.036), (40, 0.022), (41, 0.14), (42, -0.19), (43, -0.12), (44, -0.056), (45, -0.042), (46, 0.02), (47, 0.031), (48, -0.052), (49, 0.1)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93683356 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

Author: Alexander T. Ihler, Padhraic Smyth

2 0.68856561 58 nips-2006-Context Effects in Category Learning: An Investigation of Four Probabilistic Models

Author: Michael C. Mozer, Michael Shettel, Michael P. Holmes

3 0.6785816 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning

Author: Gediminas Lukšys, Jérémie Knüsel, Denis Sheynikhovich, Carmen Sandi, Wulfram Gerstner

4 0.52493346 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects

Author: Seyoung Kim, Padhraic Smyth

5 0.51709837 90 nips-2006-Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space

Author: Kyung-ah Sohn, Eric P. Xing

Abstract: We present a new statistical framework called hidden Markov Dirichlet process (HMDP) to jointly model the genetic recombinations among possibly inﬁnite number of founders and the coalescence-with-mutation events in the resulting genealogies. The HMDP posits that a haplotype of genetic markers is generated by a sequence of recombination events that select an ancestor for each locus from an unbounded set of founders according to a 1st-order Markov transition process. Conjoining this process with a mutation model, our method accommodates both between-lineage recombination and within-lineage sequence variations, and leads to a compact and natural interpretation of the population structure and inheritance process underlying haplotype data. We have developed an efﬁcient sampling algorithm for HMDP based on a two-level nested P´ lya urn scheme. On both simulated o and real SNP haplotype data, our method performs competitively or signiﬁcantly better than extant methods in uncovering the recombination hotspots along chromosomal loci; and in addition it also infers the ancestral genetic patterns and offers a highly accurate map of ancestral compositions of modern populations. 1

6 0.489521 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time

7 0.45936272 192 nips-2006-Theory and Dynamics of Perceptual Bistability

8 0.45727593 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

9 0.44770133 141 nips-2006-Multiple timescales and uncertainty in motor adaptation

10 0.4323951 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

11 0.42889741 115 nips-2006-Learning annotated hierarchies from relational data

12 0.41435805 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

13 0.39362979 113 nips-2006-Learning Structural Equation Models for fMRI

14 0.39310589 41 nips-2006-Bayesian Ensemble Learning

15 0.37880656 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

16 0.35918501 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising

17 0.35408404 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data

18 0.32235244 185 nips-2006-Subordinate class recognition using relational object models

19 0.32170621 197 nips-2006-Uncertainty, phase and oscillatory hippocampal recall

20 0.32117817 129 nips-2006-Map-Reduce for Machine Learning on Multicore

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.071), (3, 0.019), (6, 0.373), (7, 0.067), (9, 0.039), (20, 0.069), (22, 0.045), (44, 0.069), (57, 0.077), (65, 0.031), (69, 0.029), (71, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77164978 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

Author: Alexander T. Ihler, Padhraic Smyth

2 0.65915614 95 nips-2006-Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions

Author: Christian Walder, Olivier Chapelle, Bernhard Schölkopf

Abstract: We consider the problem of constructing a function whose zero set is to represent a surface, given sample points with surface normal vectors. The contributions include a novel means of regularising multi-scale compactly supported basis functions that leads to the desirable properties previously only associated with fully supported bases, and show equivalence to a Gaussian process with modiﬁed covariance function. We also provide a regularisation framework for simpler and more direct treatment of surface normals, along with a corresponding generalisation of the representer theorem. We demonstrate the techniques on 3D problems of up to 14 million data points, as well as 4D time series data. 1

3 0.60666662 5 nips-2006-A Kernel Method for the Two-Sample-Problem

Author: Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, Alex J. Smola

Abstract: We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The ﬁrst test is based on a large deviation bound for the test statistic, while the second is based on the asymptotic distribution of this statistic. The test statistic can be computed in O(m2 ) time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.

4 0.56775159 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

Author: Jeremy Lewi, Robert Butera, Liam Paninski

Abstract: Adaptively optimizing experiments can signiﬁcantly reduce the number of trials needed to characterize neural responses using parametric statistical models. However, the potential for these methods has been limited to date by severe computational challenges: choosing the stimulus which will provide the most information about the (typically high-dimensional) model parameters requires evaluating a high-dimensional integration and optimization in near-real time. Here we present a fast algorithm for choosing the optimal (most informative) stimulus based on a Fisher approximation of the Shannon information and specialized numerical linear algebra techniques. This algorithm requires only low-rank matrix manipulations and a one-dimensional linesearch to choose the stimulus and is therefore efﬁcient even for high-dimensional stimulus and parameter spaces; for example, we require just 15 milliseconds on a desktop computer to optimize a 100-dimensional stimulus. Our algorithm therefore makes real-time adaptive experimental design feasible. Simulation results show that model parameters can be estimated much more efﬁciently using these adaptive techniques than by using random (nonadaptive) stimuli. Finally, we generalize the algorithm to efﬁciently handle both fast adaptation due to spike-history effects and slow, non-systematic drifts in the model parameters. Maximizing the efﬁciency of data collection is important in any experimental setting. In neurophysiology experiments, minimizing the number of trials needed to characterize a neural system is essential for maintaining the viability of a preparation and ensuring robust results. As a result, various approaches have been developed to optimize neurophysiology experiments online in order to choose the “best” stimuli given prior knowledge of the system and the observed history of the cell’s responses. The “best” stimulus can be deﬁned a number of different ways depending on the experimental objectives. One reasonable choice, if we are interested in ﬁnding a neuron’s “preferred stimulus,” is the stimulus which maximizes the ﬁring rate of the neuron [1, 2, 3, 4]. Alternatively, when investigating the coding properties of sensory cells it makes sense to deﬁne the optimal stimulus in terms of the mutual information between the stimulus and response [5]. Here we take a system identiﬁcation approach: we deﬁne the optimal stimulus as the one which tells us the most about how a neural system responds to its inputs [6, 7]. We consider neural systems in † ‡ http://www.prism.gatech.edu/∼gtg120z http://www.stat.columbia.edu/∼liam which the probability p(rt |{xt , xt−1 , ..., xt−tk }, {rt−1 , . . . , rt−ta }) of the neural response rt given the current and past stimuli {xt , xt−1 , ..., xt−tk }, and the observed recent history of the neuron’s activity, {rt−1 , . . . , rt−ta }, can be described by a model p(rt |{xt }, {rt−1 }, θ), speciﬁed by a ﬁnite vector of parameters θ. Since we estimate these parameters from experimental trials, we want to choose our stimuli so as to minimize the number of trials needed to robustly estimate θ. Two inconvenient facts make it difﬁcult to realize this goal in a computationally efﬁcient manner: 1) model complexity — we typically need a large number of parameters to accurately model a system’s response p(rt |{xt }, {rt−1 }, θ); and 2) stimulus complexity — we are typically interested in neural responses to stimuli xt which are themselves very high-dimensional (e.g., spatiotemporal movies if we are dealing with visual neurons). In particular, it is computationally challenging to 1) update our a posteriori beliefs about the model parameters p(θ|{rt }, {xt }) given new stimulus-response data, and 2) ﬁnd the optimal stimulus quickly enough to be useful in an online experimental context. In this work we present methods for solving these problems using generalized linear models (GLM) for the input-output relationship p(rt |{xt }, {rt−1 }, θ) and certain Gaussian approximations of the posterior distribution of the model parameters. Our emphasis is on ﬁnding solutions which scale well in high dimensions. We solve problem (1) by using efﬁcient rank-one update methods to update the Gaussian approximation to the posterior, and problem (2) by a reduction to a highly tractable onedimensional optimization problem. Simulation results show that the resulting algorithm produces a set of stimulus-response pairs which is much more informative than the set produced by random sampling. Moreover, the algorithm is efﬁcient enough that it could feasibly run in real-time. Neural systems are highly adaptive and more generally nonstatic. A robust approach to optimal experimental design must be able to cope with changes in θ. We emphasize that the model framework analyzed here can account for three key types of changes: stimulus adaptation, spike rate adaptation, and random non-systematic changes. Adaptation which is completely stimulus dependent can be accounted for by including enough stimulus history terms in the model p(rt |{xt , ..., xt−tk }, {rt−1 , ..., rt−ta }). Spike-rate adaptation effects, and more generally spike history-dependent effects, are accounted for explicitly in the model (1) below. Finally, we consider slow, non-systematic changes which could potentially be due to changes in the health, arousal, or attentive state of the preparation. Methods We model a neuron as a point process whose conditional intensity function (instantaneous ﬁring rate) is given as the output of a generalized linear model (GLM) [8, 9]. This model class has been discussed extensively elsewhere; brieﬂy, this class is fairly natural from a physiological point of view [10], with close connections to biophysical models such as the integrate-and-ﬁre cell [9], and has been applied in a wide variety of experimental settings [11, 12, 13, 14]. The model is summarized as: tk λt = E(rt ) = f ta aj rt−j ki,t−l xi,t−l + i l=1 (1) j=1 In the above summation the ﬁlter coefﬁcients ki,t−l capture the dependence of the neuron’s instantaneous ﬁring rate λt on the ith component of the vector stimulus at time t − l, xt−l ; the model therefore allows for spatiotemporal receptive ﬁelds. For convenience, we arrange all the stimulus coefﬁcients in a vector, k, which allows for a uniform treatment of the spatial and temporal components of the receptive ﬁeld. The coefﬁcients aj model the dependence on the observed recent activity r at time t − j (these terms may reﬂect e.g. refractory effects, burstiness, ﬁring-rate adaptation, etc., depending on the value of the vector a [9]). For convenience we denote the unknown parameter vector as θ = {k; a}. The experimental objective is the estimation of the unknown ﬁlter coefﬁcients, θ, given knowledge of the stimuli, xt , and the resulting responses rt . We chose the nonlinear stage of the GLM, the link function f (), to be the exponential function for simplicity. This choice ensures that the log likelihood of the observed data is a concave function of θ [9]. Representing and updating the posterior. As emphasized above, our ﬁrst key task is to efﬁciently update the posterior distribution of θ after t trials, p(θt |xt , rt ), as new stimulus-response pairs are trial 100 trial 500 trial 2500 trial 5000 θ true 1 info. max. trial 0 0 random −1 (a) random info. max. 2000 Time(Seconds) Entropy 1500 1000 500 0 −500 0 1000 2000 3000 Iteration (b) 4000 5000 0.1 total time diagonalization posterior update 1d line Search 0.01 0.001 0 200 400 Dimensionality 600 (c) Figure 1: A) Plots of the estimated receptive ﬁeld for a simulated visual neuron. The neuron’s receptive ﬁeld θ has the Gabor structure shown in the last panel (spike history effects were set to zero for simplicity here, a = 0). The estimate of θ is taken as the mean of the posterior, µt . The images compare the accuracy of the estimates using information maximizing stimuli and random stimuli. B) Plots of the posterior entropies for θ in these two cases; note that the information-maximizing stimuli constrain the posterior of θ much more effectively than do random stimuli. C) A plot of the timing of the three steps performed on each iteration as a function of the dimensionality of θ. The timing for each step was well-ﬁt by a polynomial of degree 2 for the diagonalization, posterior update and total time, and degree 1 for the line search. The times are an average over many iterations. The error-bars for the total time indicate ±1 std. observed. (We use xt and rt to abbreviate the sequences {xt , . . . , x0 } and {rt , . . . , r0 }.) To solve this problem, we approximate this posterior as a Gaussian; this approximation may be justiﬁed by the fact that the posterior is the product of two smooth, log-concave terms, the GLM likelihood function and the prior (which we assume to be Gaussian, for simplicity). Furthermore, the main theorem of [7] indicates that a Gaussian approximation of the posterior will be asymptotically accurate. We use a Laplace approximation to construct the Gaussian approximation of the posterior, p(θt |xt , rt ): we set µt to the peak of the posterior (i.e. the maximum a posteriori (MAP) estimate of θ), and the covariance matrix Ct to the negative inverse of the Hessian of the log posterior at µt . In general, computing these terms directly requires O(td2 + d3 ) time (where d = dim(θ); the time-complexity increases with t because to compute the posterior we must form a product of t likelihood terms, and the d3 term is due to the inverse of the Hessian matrix), which is unfortunately too slow when t or d becomes large. Therefore we further approximate p(θt−1 |xt−1 , rt−1 ) as Gaussian; to see how this simpliﬁes matters, we use Bayes to write out the posterior: 1 −1 log p(θ|rt , xt ) = − (θ − µt−1 )T Ct−1 (θ − µt−1 ) + − exp {xt ; rt−1 }T θ 2 + rt {xt ; rt−1 }T θ + const d log p(θ|rt , xt ) −1 = −(θ − µt−1 )T Ct−1 + (2) − exp({xt ; rt−1 }T θ) + rt {xt ; rt−1 }T dθ d2 log p(θ|rt , xt ) −1 = −Ct−1 − exp({xt ; rt−1 }T θ){xt ; rt−1 }{xt ; rt−1 }T dθi dθj (3) Now, to update µt we only need to ﬁnd the peak of a one-dimensional function (as opposed to a d-dimensional function); this follows by noting that that the likelihood only varies along a single direction, {xt ; rt−1 }, as a function of θ. At the peak of the posterior, µt , the ﬁrst term in the gradient must be parallel to {xt ; rt−1 } because the gradient is zero. Since Ct−1 is non-singular, µt − µt−1 must be parallel to Ct−1 {xt ; rt−1 }. Therefore we just need to solve a one dimensional problem now to determine how much the mean changes in the direction Ct−1 {xt ; rt−1 }; this requires only O(d2 ) time. Moreover, from the second derivative term above it is clear that computing Ct requires just a rank-one matrix update of Ct−1 , which can be evaluated in O(d2 ) time via the Woodbury matrix lemma. Thus this Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) provides a large gain in efﬁciency; our simulations (data not shown) showed that, despite this improved efﬁciency, the loss in accuracy due to this approximation was minimal. Deriving the (approximately) optimal stimulus. To simplify the derivation of our maximization strategy, we start by considering models in which the ﬁring rate does not depend on past spiking, so θ = {k}. To choose the optimal stimulus for trial t + 1, we want to maximize the conditional mutual information I(θ; rt+1 |xt+1 , xt , rt ) = H(θ|xt , rt ) − H(θ|xt+1 , rt+1 ) (4) with respect to the stimulus xt+1 . The ﬁrst term does not depend on xt+1 , so maximizing the information requires minimizing the conditional entropy H(θ|xt+1 , rt+1 ) = p(rt+1 |xt+1 ) −p(θ|rt+1 , xt+1 ) log p(θ|rt+1 , xt+1 )dθ = Ert+1 |xt+1 log det[Ct+1 ] + const. rt+1 (5) We do not average the entropy of p(θ|rt+1 , xt+1 ) over xt+1 because we are only interested in the conditional entropy for the particular xt+1 which will be presented next. The equality above is due to our Gaussian approximation of p(θ|xt+1 , rt+1 ). Therefore, we need to minimize Ert+1 |xt+1 log det[Ct+1 ] with respect to xt+1 . Since we set Ct+1 to be the negative inverse Hessian of the log-posterior, we have: −1 Ct+1 = Ct + Jobs (rt+1 , xt+1 ) −1 , (6) Jobs is the observed Fisher information. Jobs (rt+1 , xt+1 ) = −∂ 2 log p(rt+1 |ε = xt θ)/∂ε2 xt+1 xt t+1 t+1 (7) Here we use the fact that for the GLM, the likelihood depends only on the dot product, ε = xt θ. t+1 We can use the Woodbury lemma to evaluate the inverse: Ct+1 = Ct I + D(rt+1 , ε)(1 − D(rt+1 , ε)xt Ct xt+1 )−1 xt+1 xt Ct t+1 t+1 (8) where D(rt+1 , ε) = ∂ 2 log p(rt+1 |ε)/∂ε2 . Using some basic matrix identities, log det[Ct+1 ] = log det[Ct ] − log(1 − D(rt+1 , ε)xt Ct xt+1 ) t+1 = log det[Ct ] + D(rt+1 , ε)xt Ct xt+1 t+1 + o(D(rt+1 , ε)xt Ct xt+1 ) t+1 (9) (10) Ignoring the higher order terms, we need to minimize Ert+1 |xt+1 D(rt+1 , ε)xt Ct xt+1 . In our case, t+1 with f (θt xt+1 ) = exp(θt xt+1 ), we can use the moment-generating function of the multivariate Trial info. max. i.i.d 2 400 −10−4 0 0.05 −10−1 −2 ai 800 2 0 −2 −7 −10 i i.i.d k info. max. 1 1 50 i 1 50 i 1 10 1 i (a) i 100 0 −0.05 10 1 1 (b) i 10 (c) Figure 2: A comparison of parameter estimates using information-maximizing versus random stimuli for a model neuron whose conditional intensity depends on both the stimulus and the spike history. The images in the top row of A and B show the MAP estimate of θ after each trial as a row in the image. Intensity indicates the value of the coefﬁcients. The true value of θ is shown in the second row of images. A) The estimated stimulus coefﬁcients, k. B) The estimated spike history coefﬁcients, a. C) The ﬁnal estimates of the parameters after 800 trials: dashed black line shows true values, dark gray is estimate using information maximizing stimuli, and light gray is estimate using random stimuli. Using our algorithm improved the estimates of k and a. Gaussian p(θ|xt , rt ) to evaluate this expectation. After some algebra, we ﬁnd that to maximize I(θ; rt+1 |xt+1 , xt , rt ), we need to maximize 1 F (xt+1 ) = exp(xT µt ) exp( xT Ct xt+1 )xT Ct xt+1 . t+1 t+1 2 t+1 (11) Computing the optimal stimulus. For the GLM the most informative stimulus is undeﬁned, since increasing the stimulus power ||xt+1 ||2 increases the informativeness of any putatively “optimal” stimulus. To obtain a well-posed problem, we optimize the stimulus under the usual power constraint ||xt+1 ||2 ≤ e < ∞. We maximize Eqn. 11 under this constraint using Lagrange multipliers and an eigendecomposition to reduce our original d-dimensional optimization problem to a onedimensional problem. Expressing Eqn. 11 in terms of the eigenvectors of Ct yields: 1 2 2 F (xt+1 ) = exp( u i yi + ci yi ) ci yi (12) 2 i i i = g( 2 ci yi ) ui yi )h( i (13) i where ui and yi represent the projection of µt and xt+1 onto the ith eigenvector and ci is the corresponding eigenvalue. To simplify notation we also introduce the functions g() and h() which are monotonically strictly increasing functions implicitly deﬁned by Eqn. 12. We maximize F (xt+1 ) by breaking the problem into an inner and outer problem by ﬁxing the value of i ui yi and maximizing h() subject to that constraint. A single line search over all possible values of i ui yi will then ﬁnd the global maximum of F (.). This approach is summarized by the equation: max F (y) = max g(b) · y:||y||2 =e b max y:||y||2 =e,y t u=b 2 ci yi ) h( i Since h() is increasing, to solve the inner problem we only need to solve: 2 ci yi max y:||y||2 =e,y t u=b (14) i This last expression is a quadratic function with quadratic and linear constraints and we can solve it using the Lagrange method for constrained optimization. The result is an explicit system of 1 true θ random info. max. info. max. no diffusion 1 0.8 0.6 trial 0.4 0.2 400 0 −0.2 −0.4 800 1 100 θi 1 θi 100 1 θi 100 1 θ i 100 −0.6 random info. max. θ true θ i 1 0 −1 Entropy θ i 1 0 −1 random info. max. 250 200 i 1 θ Trial 400 Trial 200 Trial 0 (a) 0 −1 20 40 (b) i 60 80 100 150 0 200 400 600 Iteration 800 (c) Figure 3: Estimating the receptive ﬁeld when θ is not constant. A) The posterior means µt and true θt plotted after each trial. θ was 100 dimensional, with its components following a Gabor function. To simulate nonsystematic changes in the response function, the center of the Gabor function was moved according to a random walk in between trials. We modeled the changes in θ as a random walk with a white covariance matrix, Q, with variance .01. In addition to the results for random and information-maximizing stimuli, we also show the µt given stimuli chosen to maximize the information under the (mistaken) assumption that θ was constant. Each row of the images plots θ using intensity to indicate the value of the different components. B) Details of the posterior means µt on selected trials. C) Plots of the posterior entropies as a function of trial number; once again, we see that information-maximizing stimuli constrain the posterior of θt more effectively. equations for the optimal yi as a function of the Lagrange multiplier λ1 . ui e yi (λ1 ) = ||y||2 2(ci − λ1 ) (15) Thus to ﬁnd the global optimum we simply vary λ1 (this is equivalent to performing a search over b), and compute the corresponding y(λ1 ). For each value of λ1 we compute F (y(λ1 )) and choose the stimulus y(λ1 ) which maximizes F (). It is possible to show (details omitted) that the maximum of F () must occur on the interval λ1 ≥ c0 , where c0 is the largest eigenvalue. This restriction on the optimal λ1 makes the implementation of the linesearch signiﬁcantly faster and more stable. To summarize, updating the posterior and ﬁnding the optimal stimulus requires three steps: 1) a rankone matrix update and one-dimensional search to compute µt and Ct ; 2) an eigendecomposition of Ct ; 3) a one-dimensional search over λ1 ≥ c0 to compute the optimal stimulus. The most expensive step here is the eigendecomposition of Ct ; in principle this step is O(d3 ), while the other steps, as discussed above, are O(d2 ). Here our Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) is once again quite useful: recall that in this setting Ct is just a rank-one modiﬁcation of Ct−1 , and there exist efﬁcient algorithms for rank-one eigendecomposition updates [15]. While the worst-case running time of this rank-one modiﬁcation of the eigendecomposition is still O(d3 ), we found the average running time in our case to be O(d2 ) (Fig. 1(c)), due to deﬂation which reduces the cost of matrix multiplications associated with ﬁnding the eigenvectors of repeated eigenvalues. Therefore the total time complexity of our algorithm is empirically O(d2 ) on average. Spike history terms. The preceding derivation ignored the spike-history components of the GLM model; that is, we ﬁxed a = 0 in equation (1). Incorporating spike history terms only affects the optimization step of our algorithm; updating the posterior of θ = {k; a} proceeds exactly as before. The derivation of the optimization strategy proceeds in a similar fashion and leads to an analogous optimization strategy, albeit with a few slight differences in detail which we omit due to space constraints. The main difference is that instead of maximizing the quadratic expression in Eqn. 14 to ﬁnd the maximum of h(), we need to maximize a quadratic expression which includes a linear term due to the correlation between the stimulus coefﬁcients, k, and the spike history coefﬁcients,a. The results of our simulations with spike history terms are shown in Fig. 2. Dynamic θ. In addition to fast changes due to adaptation and spike-history effects, animal preparations often change slowly and nonsystematically over the course of an experiment [16]. We model these effects by letting θ experience diffusion: θt+1 = θt + wt (16) Here wt is a normally distributed random variable with mean zero and known covariance matrix Q. This means that p(θt+1 |xt , rt ) is Gaussian with mean µt and covariance Ct + Q. To update the posterior and choose the optimal stimulus, we use the same procedure as described above1 . Results Our ﬁrst simulation considered the use of our algorithm for learning the receptive ﬁeld of a visually sensitive neuron. We took the neuron’s receptive ﬁeld to be a Gabor function, as a proxy model of a V1 simple cell. We generated synthetic responses by sampling Eqn. 1 with θ set to a 25x33 Gabor function. We used this synthetic data to compare how well θ could be estimated using information maximizing stimuli compared to using random stimuli. The stimuli were 2-d images which were rasterized in order to express x as a vector. The plots of the posterior means µt in Fig. 1 (recall these are equivalent to the MAP estimate of θ) show that the information maximizing strategy converges an order of magnitude more rapidly to the true θ. These results are supported by the conclusion of [7] that the information maximization strategy is asymptotically never worse than using random stimuli and is in general more efﬁcient. The running time for each step of the algorithm as a function of the dimensionality of θ is plotted in Fig. 1(c). These results were obtained on a machine with a dual core Intel 2.80GHz XEON processor running Matlab. The solid lines indicate ﬁtted polynomials of degree 1 for the 1d line search and degree 2 for the remaining curves; the total running time for each trial scaled as O(d2 ), as predicted. When θ was less than 200 dimensions, the total running time was roughly 50 ms (and for dim(θ) ≈ 100, the runtime was close to 15 ms), well within the range of tolerable latencies for many experiments. In Fig. 2 we apply our algorithm to characterize the receptive ﬁeld of a neuron whose response depends on its past spiking. Here, the stimulus coefﬁcients k were chosen to follow a sine-wave; 1 The one difference is that the covariance matrix of p(θt+1 |xt+1 , rt+1 ) is in general no longer just a rankone modiﬁcation of the covariance matrix of p(θt |xt , rt ); thus, we cannot use the rank-one update to compute the eigendecomposition. However, it is often reasonable to take Q to be white, Q = cI; in this case the eigenvectors of Ct + Q are those of Ct and the eigenvalues are ci + c where ci is the ith eigenvalue of Ct ; thus in this case, our methods may be applied without modiﬁcation. the spike history coefﬁcients a were inhibitory and followed an exponential function. When choosing stimuli we updated the posterior for the full θ = {k; a} simultaneously and maximized the information about both the stimulus coefﬁcients and the spike history coefﬁcients. The information maximizing strategy outperformed random sampling for estimating both the spike history and stimulus coefﬁcients. Our ﬁnal set of results, Fig. 3, considers a neuron whose receptive ﬁeld drifts non-systematically with time. We take the receptive ﬁeld to be a Gabor function whose center moves according to a random walk (we have in mind a slow random drift of eye position during a visual experiment). The results demonstrate the feasibility of the information-maximization strategy in the presence of nonstationary response properties θ, and emphasize the superiority of adaptive methods in this context. Conclusion We have developed an efﬁcient implementation of an algorithm for online optimization of neurophysiology experiments based on information-theoretic criterion. Reasonable approximations based on a GLM framework allow the algorithm to run in near-real time even for high dimensional parameter and stimulus spaces, and in the presence of spike-rate adaptation and time-varying neural response properties. Despite these approximations the algorithm consistently provides signiﬁcant improvements over random sampling; indeed, the differences in efﬁciency are large enough that the information-optimization strategy may permit robust system identiﬁcation in cases where it is simply not otherwise feasible to estimate the neuron’s parameters using random stimuli. Thus, in a sense, the proposed stimulus-optimization technique signiﬁcantly extends the reach and power of classical neurophysiology methods. Acknowledgments JL is supported by the Computational Science Graduate Fellowship Program administered by the DOE under contract DE-FG02-97ER25308 and by the NSF IGERT Program in Hybrid Neural Microsystems at Georgia Tech via grant number DGE-0333411. LP is supported by grant EY018003 from the NEI and by a Gatsby Foundation Pilot Grant. We thank P. Latham for helpful conversations. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] I. Nelken, et al., Hearing Research 72, 237 (1994). P. Foldiak, Neurocomputing 38–40, 1217 (2001). K. Zhang, et al., Proceedings (Computational and Systems Neuroscience Meeting, 2004). R. C. deCharms, et al., Science 280, 1439 (1998). C. Machens, et al., Neuron 47, 447 (2005). A. Watson, et al., Perception and Psychophysics 33, 113 (1983). L. Paninski, Neural Computation 17, 1480 (2005). P. McCullagh, et al., Generalized linear models (Chapman and Hall, London, 1989). L. Paninski, Network: Computation in Neural Systems 15, 243 (2004). E. Simoncelli, et al., The Cognitive Neurosciences, M. Gazzaniga, ed. (MIT Press, 2004), third edn. P. Dayan, et al., Theoretical Neuroscience (MIT Press, 2001). E. Chichilnisky, Network: Computation in Neural Systems 12, 199 (2001). F. Theunissen, et al., Network: Computation in Neural Systems 12, 289 (2001). L. Paninski, et al., Journal of Neuroscience 24, 8551 (2004). M. Gu, et al., SIAM Journal on Matrix Analysis and Applications 15, 1266 (1994). N. A. Lesica, et al., IEEE Trans. On Neural Systems And Rehabilitation Engineering 13, 194 (2005).

5 0.4143059 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

Author: Frank Wood, Thomas L. Griffiths

Abstract: Many unsupervised learning problems can be expressed as a form of matrix factorization, reconstructing an observed data matrix as the product of two matrices of latent variables. A standard challenge in solving these problems is determining the dimensionality of the latent matrices. Nonparametric Bayesian matrix factorization is one way of dealing with this challenge, yielding a posterior distribution over possible factorizations of unbounded dimensionality. A drawback to this approach is that posterior estimation is typically done using Gibbs sampling, which can be slow for large problems and when conjugate priors cannot be used. As an alternative, we present a particle ﬁlter for posterior estimation in nonparametric Bayesian matrix factorization models. We illustrate this approach with two matrix factorization models and show favorable performance relative to Gibbs sampling.

6 0.39861616 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

7 0.39609703 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

8 0.39438665 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

9 0.39127997 41 nips-2006-Bayesian Ensemble Learning

10 0.39066538 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

11 0.38995886 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

12 0.38940957 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

13 0.38906002 42 nips-2006-Bayesian Image Super-resolution, Continued

14 0.38868612 121 nips-2006-Learning to be Bayesian without Supervision

15 0.38862053 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

16 0.38796309 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

17 0.3869482 192 nips-2006-Theory and Dynamics of Perceptual Bistability

18 0.38682497 158 nips-2006-PG-means: learning the number of clusters in data

19 0.3866353 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency

20 0.38520133 167 nips-2006-Recursive ICA