nips nips2005 nips2005-111 knowledge-graph by maker-knowledge-mining

111 nips-2005-Learning Influence among Interacting Markov Chains

Source: pdf

Author: Dong Zhang, Daniel Gatica-perez, Samy Bengio, Deb Roy

Abstract: We present a model that learns the inﬂuence of interacting Markov chains within a team. The proposed model is a dynamic Bayesian network (DBN) with a two-level structure: individual-level and group-level. Individual level models actions of each player, and the group-level models actions of the team as a whole. Experiments on synthetic multi-player games and a multi-party meeting corpus show the effectiveness of the proposed model. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present a model that learns the inﬂuence of interacting Markov chains within a team. [sent-6, score-0.131]

2 Individual level models actions of each player, and the group-level models actions of the team as a whole. [sent-8, score-0.486]

3 Experiments on synthetic multi-player games and a multi-party meeting corpus show the effectiveness of the proposed model. [sent-9, score-0.258]

4 1 Introduction In multi-agent systems, individuals within a group coordinate and interact to achieve a goal. [sent-10, score-0.098]

5 For instance, consider a basketball game where a team of players with different roles, such as attack and defense, collaborate and interact to win the game. [sent-11, score-0.748]

6 Each player performs a set of individual actions, evolving based on their own dynamics. [sent-12, score-0.618]

7 Actions of the team and its players are strongly correlated, and different players have different inﬂuence on the team. [sent-14, score-0.979]

8 These people, skilled at establishing the leadership, have the largest inﬂuence on the group decisions, and often shift the focus of the meeting when they speak [8]. [sent-16, score-0.199]

9 In this paper, we quantitatively investigate the inﬂuence of individual players on their team using a dynamic Bayesian network, that we call two-level inﬂuence model. [sent-17, score-0.707]

10 The proposed model explicitly learns the inﬂuence of individual player on the team with a two-level structure. [sent-18, score-0.928]

11 In the ﬁrst level, we model actions of individual players. [sent-19, score-0.178]

12 In the second one, we model team actions as a whole. [sent-20, score-0.409]

13 The model is then applied to determine (a) the inﬂuence of players in multi-player games, and (b) the inﬂuence of participants in meetings. [sent-21, score-0.46]

14 Section 4 presents results on multi-player games, and Section 5 presents results on a meeting corpus. [sent-25, score-0.15]

15 (b) Two-level inﬂuence model (for simplicity, we omit the observation variables of individual Markov chains, and the switching parent variable Q). [sent-28, score-0.25]

16 Q is called a switching parent of S G , and {S 1 · · · S N } are conditional parents of S G . [sent-30, score-0.196]

17 2 Two-level Inﬂuence Model The proposed model, called two-level inﬂuence model, is a dynamic Bayesian network (DBN) with a two-level structure: the player level and the team level (Fig. [sent-32, score-0.884]

18 The player level represents the actions of individual players, evolving based on their own Markovian dynamics (Fig. [sent-34, score-0.757]

19 The team level represents group-level actions (the action belongs to the team as a whole, not to a particular player). [sent-36, score-0.728]

20 1 (b), the arrows up (from players to team) represent the inﬂuence of the individual actions on the group actions, and the arrows down (from team to players) represent the inﬂuence of the group actions on the individual actions. [sent-38, score-1.077]

21 Let O i and S i denote the observation and state of the ith player respectively, and S G denotes the team state. [sent-39, score-0.915]

22 (1) t=2 i=1 Regarding the player level, we model the actions of each individual with a ﬁrst-order Markov model (Fig. [sent-41, score-0.739]

23 Furthermore, to capture the dynamics of all the players interacting as a team, we add a hidden variable S G (team state), which is responsible to model the group-level actions. [sent-43, score-0.437]

24 Different from individual player state that has its own Markovian dynamics, team state is not directly inﬂuenced by its previous state . [sent-44, score-1.042]

25 S G could be seen as the aggregate behaviors of the individuals, yet provides a useful level of description beyond individual actions. [sent-45, score-0.114]

26 There are two kinds of relationships between the team and players: (1) The team state at time t inﬂuences the players’ states at the next time (down arrow in Fig. [sent-46, score-0.702]

27 In other words, the state of the ith player at time t + 1 depends on its previous state as well as on i i G the team state, i. [sent-48, score-0.945]

28 (2) The team state at time t is inﬂuenced by all the players’ states at the current time (up arrow in Fig. [sent-51, score-0.403]

29 To reduce the model complexity, we add one hidden variable Q in the model, to switch parents for S G . [sent-53, score-0.092]

30 The idea of switching parent (also called Bayesian multi-nets in [3]) is as follows: a variable -S G in this case- has a set of parents {Q, S 1 · · · S N } (Fig. [sent-54, score-0.196]

31 Q is the switching parent that determines which of the other parents to use, conditioned on the current value of the switching parent. [sent-56, score-0.251]

32 The distribution over the switching-parent variable P (Q) essentially describes how much inﬂuence or contribution the state transitions of the player variables have on the state transitions of the team variable. [sent-63, score-0.926]

33 Speciﬁcally, we used the switching parents feature of GMTK, which greatly facilitates the implementation of the two-level model to learn the inﬂuence values using the Expectation Maximization (EM) algorithm. [sent-70, score-0.147]

34 Then we use the trained emission distribution from the individual action model to initialize the emission distribution of the two-level inﬂuence model. [sent-75, score-0.233]

35 This procedure is beneﬁcial because we use data from all individual streams together, and thus have a larger amount of training data for learning. [sent-76, score-0.087]

36 3 Related Models The proposed two-level inﬂuence model is related to a number of models, namely mixedmemory Markov model (MMM) [14, 11], coupled HMM (CHMM) [13], inﬂuence model [1, 2, 6] and dynamical systems trees (DSTs) [10]. [sent-77, score-0.09]

37 The CHMM models interactions of multiple Markov chains by directly linking the current state of one stream i 1 2 N with the previous states of all the streams (including itself): P (St |St−1 St−1 · · · St−1 ). [sent-79, score-0.17]

38 The inﬂuence model [1, 2, 6] simpliﬁes the state transition distribution of the CHMM into a Figure 2: (a) A snapshot of the multi-player games: four players move along the pathes labeled in the map. [sent-81, score-0.469]

39 We can see that inﬂuence model and MMM take the same strategy to reduce complex models with large state spaces to a combination of simpler ones with smaller state spaces. [sent-86, score-0.126]

40 In [2, 6], the inﬂuence model was used to analyze speaking patterns in conversations (i. [sent-87, score-0.109]

41 In such model, αji is regarded as the inﬂuence of the j th player on the ith player. [sent-90, score-0.55]

42 DSTs [10] have a tree structure that models interacting processes through the parent hidden Markov chains. [sent-95, score-0.114]

43 There are two differences between DSTs and our model: (1) In DSTs, the parent chain has its own Markovian dynamics, while the team state of our model is not directly inﬂuenced by the previous team state. [sent-96, score-0.755]

44 Thus, our model captures the emergent phenomena in which the group action is “nothing more” than the aggregate behaviors of individuals, yet it provides a useful level of representation beyond individual actions. [sent-97, score-0.238]

45 (2) The inﬂuence between players and team in our model is “bi-direction” (up and down arrows in Fig. [sent-98, score-0.691]

46 In DSTs, the inﬂuence between child and parent chains is “uni-direction”: parent chains could inﬂuence child chains, while child chains could not inﬂuence their parent chains. [sent-100, score-0.507]

47 4 Experiments on Synthetic Data We ﬁrst test our model on multi-player synthetic games, in which four players (labeled A-D) move along a number of predetermined paths manually labeled in a map (Fig. [sent-101, score-0.432]

48 Player B and C are meticulously following player A. [sent-103, score-0.573]

49 The initial positions and speeds of players are randomly generated. [sent-111, score-0.34]

50 The observation of an individual player is its motion trajectory in the form of a sequence of positions, (x1 , y1 ), (x2 , y2 ) · · · (xt , yt ), each of which belongs to one of 20 predetermined paths in the map. [sent-112, score-0.617]

51 In experiments, we found that the ﬁnal results were not sensitive to the speciﬁc number of team states for this dataset in a wide range. [sent-115, score-0.336]

52 3 shows the learned inﬂuence value for each of the four players in the different games with respect to the number of EM iterations. [sent-132, score-0.417]

53 We can see that for Game I, player A is the leader player based on the deﬁned rules. [sent-133, score-1.062]

54 The ﬁnal learned inﬂuence value for player A is almost 1, while the inﬂuence for the rest three players are almost 0. [sent-134, score-0.871]

55 For Game II, player A and player C are both leaders based on the deﬁned rules. [sent-135, score-1.062]

56 The learned inﬂuence values for player A and C are indeed close to 0. [sent-136, score-0.531]

57 For Game III, the four players are moving randomly, and the learned inﬂuence values are around 0. [sent-138, score-0.358]

58 25, which indicates that all players have similar inﬂuence on the team. [sent-139, score-0.34]

59 5 Experiments on Meeting Data As an application of the two-level inﬂuence model, we investigate the inﬂuence of participants in meetings. [sent-141, score-0.09]

60 We used a public meeting corpus (available at http://mmm. [sent-143, score-0.177]

61 ch), which consists of 30 ﬁve-minute four-participant meetings collected in a room equipped with synchronized multi-channel audio and video recorders [12]. [sent-145, score-0.226]

62 These meetings have pre-deﬁned topics and an action agenda, designed to ensure discussions and monologues. [sent-148, score-0.125]

63 We then report our results using audio and language features, compared with simple baseline methods. [sent-151, score-0.262]

64 1 Manually Labeling Inﬂuence Values and the Performance Measure The manual annotation of inﬂuence of meeting participants is to some degree a subjective task, as a deﬁnite ground-truth does not exist. [sent-153, score-0.314]

65 In our case, each meeting was labeled by three independent annotators who had no access to any information about the participants (e. [sent-154, score-0.296]

66 This was enforced to avoid any bias based on prior knowledge of the meeting participants (e. [sent-157, score-0.24]

67 After watching an entire meeting, the three annotators were asked to assign a probability-based value (ranging from 0 to 1, all adding up to 1) to meeting participants, which indicated their inﬂuence in the meeting (Fig. [sent-160, score-0.356]

68 Using language, the number of states equals PLSA topics plus one silence state. [sent-170, score-0.153]

69 2 Audio and Language Features We ﬁrst extract audio features useful to detect speaking turns in conversations. [sent-176, score-0.287]

70 We use a Gaussian emission probability, and set NS = 2, each state corresponding to speaking and non-speaking (silence), respectively (Fig. [sent-178, score-0.183]

71 After removing stop words, the meeting corpus contains 2175 unique terms. [sent-181, score-0.177]

72 We then employed probabilistic latent semantic analysis (PLSA) [9], which is a language model that projects documents in the high-dimensional bag-of-words space into a topic-based space of lower dimension. [sent-182, score-0.136]

73 In our case, a document corresponds to one speech utterance (ts , te , w1 w2 · · · wk ), where ts is the start time, te is the end time, and w1 w2 · · · wk is a sequence of words. [sent-184, score-0.127]

74 We embedded PLSA into our model by treating the states of individual players as instances of PLSA topics (similar to [5]). [sent-186, score-0.507]

75 We repeat the PLSA topic within the same utterance (ts ≤ t ≤ te ). [sent-189, score-0.114]

76 The topic for the silence segments was set to 0 (Fig. [sent-190, score-0.139]

77 We can see that using audio-only features can be seen as a special case of using language features, by using only one topic in the PLSA model (i. [sent-192, score-0.221]

78 3 Results and Discussions We compare our model with a method based on the speaking length (how much time each of the participants speaks). [sent-198, score-0.199]

79 In this case, the inﬂuence value of a meeting participant is N deﬁned to be proportional to his speaking length: P (Q = i) = Li / i=1 Li , where Li is the speaking length of participant i. [sent-199, score-0.396]

80 5 4 0 5 10 15 20 25 30 (h) Figure 5: Inﬂuence values of the 4 participants (y-axis) in the 30 meetings (x-axis) (a) ground-truth (average of the three human annotations: A1 , A2 , A3 ). [sent-214, score-0.2]

81 (b) A1 : human annotation 1 (c) A2 : human annotation 2 (d) A3 : human annotation 3 (e) our model + language (f) our model + audio (g) speaking-length (h) randomization. [sent-215, score-0.598]

82 Method model + Language model + Audio Speaking length Randomization KL divergence 0. [sent-217, score-0.12]

83 Our model (using either audio or language features) outperforms the speaking-length based method, which suggests that the learned inﬂuence distributions are in better accordance with the inﬂuence distributions from human judgements. [sent-228, score-0.332]

84 4, using audio features can be seen as a special case of using language features. [sent-230, score-0.292]

85 We use language features to capture “topic turns” by factorizing the two states: “speaking, silence” into more states: “topic1, topic2, . [sent-231, score-0.136]

86 We can see that the result using language features is better than that using audio features. [sent-235, score-0.292]

87 In other words, compared with “speaking turns”, “topic turns” improves the performance of our model to learn the inﬂuence of participants in meetings. [sent-236, score-0.12]

88 It is interesting to look at the KL divergence between any pair of the three human annotations (Ai vs. [sent-237, score-0.155]

89 6(a) shows the histogram of KL divergence between any pair of human annotations for the 30 meetings. [sent-245, score-0.173]

90 With our model, we can calculate the cumulative inﬂuence of each meeting participant over time. [sent-253, score-0.215]

91 6(b) shows such an example using the two-level inﬂuence model with audio features. [sent-255, score-0.186]

92 We can see that the cumulative inﬂuence is related to the meeting agenda: The meeting starts with the monologue of person1 (monologue1). [sent-256, score-0.321]

93 ) 4 5 (b) Figure 6: (a) Histogram of KL divergence between any pair of the human annotations (Ai vs. [sent-275, score-0.155]

94 The dotted vertical lines indicate the predeﬁned meeting agenda. [sent-278, score-0.15]

95 The ﬁnal inﬂuence of participants becomes stable in the second discussion. [sent-281, score-0.09]

96 6 Conclusions We have presented a two-level inﬂuence model that learns the inﬂuence of all players within a team. [sent-282, score-0.37]

97 Individual level models actions of individual players and group-level models the group as a whole. [sent-284, score-0.564]

98 Experiments on synthetic multi-player games and a multi-party meeting corpus showed the effectiveness of the proposed model. [sent-285, score-0.258]

99 More generally, we anticipate that our approach to multi-level inﬂuence modeling may provide a means for analyzing a wide range of social dynamics to infer patterns of emergent group behaviors. [sent-286, score-0.103]

100 Modeling conversational dynamics as a mixed memory markov process. [sent-326, score-0.114]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('player', 0.531), ('st', 0.352), ('uence', 0.344), ('players', 0.34), ('team', 0.299), ('audio', 0.156), ('meeting', 0.15), ('plsa', 0.139), ('language', 0.106), ('ns', 0.097), ('participants', 0.09), ('game', 0.087), ('silence', 0.084), ('actions', 0.08), ('speaking', 0.079), ('parent', 0.079), ('dsts', 0.07), ('meetings', 0.07), ('individual', 0.068), ('chains', 0.066), ('kl', 0.062), ('parents', 0.062), ('divergence', 0.06), ('games', 0.059), ('annotators', 0.056), ('emission', 0.056), ('switching', 0.055), ('annotations', 0.055), ('topic', 0.055), ('annotation', 0.052), ('group', 0.049), ('state', 0.048), ('ai', 0.045), ('influence', 0.044), ('participant', 0.044), ('conversational', 0.042), ('meticulously', 0.042), ('idiap', 0.041), ('human', 0.04), ('markov', 0.04), ('ot', 0.039), ('em', 0.039), ('states', 0.037), ('speech', 0.037), ('chmm', 0.036), ('interacting', 0.035), ('ng', 0.035), ('dbn', 0.033), ('snapshot', 0.033), ('dynamics', 0.032), ('gt', 0.032), ('topics', 0.032), ('martigny', 0.031), ('te', 0.031), ('model', 0.03), ('features', 0.03), ('moves', 0.029), ('agenda', 0.028), ('choudhury', 0.028), ('gmtk', 0.028), ('luence', 0.028), ('mmm', 0.028), ('timeline', 0.028), ('utterance', 0.028), ('markovian', 0.028), ('corpus', 0.027), ('level', 0.027), ('individuals', 0.027), ('kappa', 0.024), ('toolkit', 0.024), ('persons', 0.024), ('iterations', 0.024), ('uences', 0.024), ('child', 0.024), ('action', 0.023), ('synthetic', 0.022), ('switzerland', 0.022), ('uenced', 0.022), ('manual', 0.022), ('interact', 0.022), ('emergent', 0.022), ('stopped', 0.022), ('bilmes', 0.022), ('dominance', 0.022), ('turns', 0.022), ('arrows', 0.022), ('manually', 0.022), ('cumulative', 0.021), ('bengio', 0.021), ('randomization', 0.021), ('aj', 0.02), ('ith', 0.019), ('streams', 0.019), ('dj', 0.019), ('evolving', 0.019), ('aggregate', 0.019), ('arrow', 0.019), ('four', 0.018), ('observation', 0.018), ('histogram', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 111 nips-2005-Learning Influence among Interacting Markov Chains

Author: Dong Zhang, Daniel Gatica-perez, Samy Bengio, Deb Roy

2 0.25105038 53 nips-2005-Cyclic Equilibria in Markov Games

Author: Martin Zinkevich, Amy Greenwald, Michael L. Littman

Abstract: Although variants of value iteration have been proposed for ﬁnding Nash or correlated equilibria in general-sum Markov games, these variants have not been shown to be effective in general. In this paper, we demonstrate by construction that existing variants of value iteration cannot ﬁnd stationary equilibrium policies in arbitrary general-sum Markov games. Instead, we propose an alternative interpretation of the output of value iteration based on a new (non-stationary) equilibrium concept that we call “cyclic equilibria.” We prove that value iteration identiﬁes cyclic equilibria in a class of games in which it fails to ﬁnd stationary equilibria. We also demonstrate empirically that value iteration ﬁnds cyclic equilibria in nearly all examples drawn from a random distribution of Markov games. 1

3 0.10750131 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

Author: Drew Bagnell, Andrew Y. Ng

Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1

4 0.10153265 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models

Author: Deepak Verma, Rajesh P. Rao

Abstract: Humans are extremely adept at learning new skills by imitating the actions of others. A progression of imitative abilities has been observed in children, ranging from imitation of simple body movements to goalbased imitation based on inferring intent. In this paper, we show that the problem of goal-based imitation can be formulated as one of inferring goals and selecting actions using a learned probabilistic graphical model of the environment. We ﬁrst describe algorithms for planning actions to achieve a goal state using probabilistic inference. We then describe how planning can be used to bootstrap the learning of goal-dependent policies by utilizing feedback from the environment. The resulting graphical model is then shown to be powerful enough to allow goal-based imitation. Using a simple maze navigation task, we illustrate how an agent can infer the goals of an observed teacher and imitate the teacher even when the goals are uncertain and the demonstration is incomplete.

5 0.070446789 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care

Author: Christopher Williams, John Quinn, Neil Mcintosh

Abstract: The observed physiological dynamics of an infant receiving intensive care are affected by many possible factors, including interventions to the baby, the operation of the monitoring equipment and the state of health. The Factorial Switching Kalman Filter can be used to infer the presence of such factors from a sequence of observations, and to estimate the true values where these observations have been corrupted. We apply this model to clinical time series data and show it to be effective in identifying a number of artifactual and physiological patterns. 1

6 0.070352986 89 nips-2005-Group and Topic Discovery from Relations and Their Attributes

7 0.062663414 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting

8 0.061424062 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks

9 0.059611674 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

10 0.056734685 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

11 0.055814791 78 nips-2005-From Weighted Classification to Policy Search

12 0.054931082 36 nips-2005-Bayesian models of human action understanding

13 0.053003084 153 nips-2005-Policy-Gradient Methods for Planning

14 0.051575966 141 nips-2005-Norepinephrine and Neural Interrupts

15 0.051332783 120 nips-2005-Learning vehicular dynamics, with application to modeling helicopters

16 0.050410621 144 nips-2005-Off-policy Learning with Options and Recognizers

17 0.049811583 48 nips-2005-Context as Filtering

18 0.048043545 156 nips-2005-Prediction and Change Detection

19 0.047525022 52 nips-2005-Correlated Topic Models

20 0.045213163 64 nips-2005-Efficient estimation of hidden state dynamics from spike trains

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.123), (1, -0.024), (2, 0.188), (3, -0.0), (4, -0.001), (5, -0.078), (6, 0.018), (7, 0.011), (8, -0.05), (9, -0.045), (10, -0.072), (11, -0.058), (12, 0.009), (13, -0.08), (14, -0.062), (15, -0.004), (16, 0.013), (17, 0.127), (18, 0.006), (19, 0.022), (20, 0.028), (21, -0.221), (22, 0.074), (23, 0.071), (24, -0.158), (25, -0.247), (26, 0.085), (27, 0.132), (28, -0.135), (29, 0.017), (30, -0.187), (31, -0.072), (32, -0.001), (33, 0.005), (34, 0.308), (35, -0.136), (36, -0.002), (37, -0.102), (38, 0.078), (39, -0.078), (40, -0.001), (41, 0.072), (42, 0.01), (43, 0.084), (44, 0.012), (45, -0.091), (46, 0.027), (47, 0.014), (48, 0.033), (49, -0.095)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97150105 111 nips-2005-Learning Influence among Interacting Markov Chains

Author: Dong Zhang, Daniel Gatica-perez, Samy Bengio, Deb Roy

2 0.8656795 53 nips-2005-Cyclic Equilibria in Markov Games

Author: Martin Zinkevich, Amy Greenwald, Michael L. Littman

3 0.50993723 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care

Author: Christopher Williams, John Quinn, Neil Mcintosh

4 0.47947595 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

Author: Gabriel Y. Weintraub, Lanier Benkard, Benjamin Van Roy

Abstract: We propose a mean-ﬁeld approximation that dramatically reduces the computational complexity of solving stochastic dynamic games. We provide conditions that guarantee our method approximates an equilibrium as the number of agents grow. We then derive a performance bound to assess how well the approximation performs for any given number of agents. We apply our method to an important class of problems in applied microeconomics. We show with numerical experiments that we are able to greatly expand the set of economic problems that can be analyzed computationally. 1

5 0.3610374 120 nips-2005-Learning vehicular dynamics, with application to modeling helicopters

Author: Pieter Abbeel, Varun Ganapathi, Andrew Y. Ng

Abstract: We consider the problem of modeling a helicopter’s dynamics based on state-action trajectories collected from it. The contribution of this paper is two-fold. First, we consider the linear models such as learned by CIFER (the industry standard in helicopter identiﬁcation), and show that the linear parameterization makes certain properties of dynamical systems, such as inertia, fundamentally difﬁcult to capture. We propose an alternative, acceleration based, parameterization that does not suffer from this deﬁciency, and that can be learned as efﬁciently from data. Second, a Markov decision process model of a helicopter’s dynamics would explicitly model only the one-step transitions, but we are often interested in a model’s predictive performance over longer timescales. In this paper, we present an efﬁcient algorithm for (approximately) minimizing the prediction error over long time scales. We present empirical results on two different helicopters. Although this work was motivated by the problem of modeling helicopters, the ideas presented here are general, and can be applied to modeling large classes of vehicular dynamics. 1

6 0.33538109 89 nips-2005-Group and Topic Discovery from Relations and Their Attributes

7 0.31565195 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

8 0.30202472 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models

9 0.27279133 174 nips-2005-Separation of Music Signals by Harmonic Structure Modeling

10 0.26355383 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks

11 0.22580349 62 nips-2005-Efficient Estimation of OOMs

12 0.22172487 171 nips-2005-Searching for Character Models

13 0.21421948 156 nips-2005-Prediction and Change Detection

14 0.2070158 64 nips-2005-Efficient estimation of hidden state dynamics from spike trains

15 0.2037904 141 nips-2005-Norepinephrine and Neural Interrupts

16 0.20238206 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails

17 0.20043956 108 nips-2005-Layered Dynamic Textures

18 0.20035839 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

19 0.19963831 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting

20 0.19861205 188 nips-2005-Temporally changing synaptic plasticity

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.048), (10, 0.027), (27, 0.056), (31, 0.086), (34, 0.051), (39, 0.014), (44, 0.021), (55, 0.04), (56, 0.331), (60, 0.012), (65, 0.015), (69, 0.041), (73, 0.025), (80, 0.012), (88, 0.075), (91, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75494951 111 nips-2005-Learning Influence among Interacting Markov Chains

Author: Dong Zhang, Daniel Gatica-perez, Samy Bengio, Deb Roy

2 0.73211813 173 nips-2005-Sensory Adaptation within a Bayesian Framework for Perception

Author: Alan Stocker, Eero P. Simoncelli

Abstract: We extend a previously developed Bayesian framework for perception to account for sensory adaptation. We ﬁrst note that the perceptual effects of adaptation seems inconsistent with an adjustment of the internally represented prior distribution. Instead, we postulate that adaptation increases the signal-to-noise ratio of the measurements by adapting the operational range of the measurement stage to the input range. We show that this changes the likelihood function in such a way that the Bayesian estimator model can account for reported perceptual behavior. In particular, we compare the model’s predictions to human motion discrimination data and demonstrate that the model accounts for the commonly observed perceptual adaptation effects of repulsion and enhanced discriminability. 1 Motivation A growing number of studies support the notion that humans are nearly optimal when performing perceptual estimation tasks that require the combination of sensory observations with a priori knowledge. The Bayesian formulation of these problems deﬁnes the optimal strategy, and provides a principled yet simple computational framework for perception that can account for a large number of known perceptual effects and illusions, as demonstrated in sensorimotor learning [1], cue combination [2], or visual motion perception [3], just to name a few of the many examples. Adaptation is a fundamental phenomenon in sensory perception that seems to occur at all processing levels and modalities. A variety of computational principles have been suggested as explanations for adaptation. Many of these are based on the concept of maximizing the sensory information an observer can obtain about a stimulus despite limited sensory resources [4, 5, 6]. More mechanistically, adaptation can be interpreted as the attempt of the sensory system to adjusts its (limited) dynamic range such that it is maximally informative with respect to the statistics of the stimulus. A typical example is observed in the retina, which manages to encode light intensities that vary over nine orders of magnitude using ganglion cells whose dynamic range covers only two orders of magnitude. This is achieved by adapting to the local mean as well as higher order statistics of the visual input over short time-scales [7]. ∗ corresponding author. If a Bayesian framework is to provide a valid computational explanation of perceptual processes, then it needs to account for the behavior of a perceptual system, regardless of its adaptation state. In general, adaptation in a sensory estimation task seems to have two fundamental effects on subsequent perception: • Repulsion: The estimate of parameters of subsequent stimuli are repelled by those of the adaptor stimulus, i.e. the perceived values for the stimulus variable that is subject to the estimation task are more distant from the adaptor value after adaptation. This repulsive effect has been reported for perception of visual speed (e.g. [8, 9]), direction-of-motion [10], and orientation [11]. • Increased sensitivity: Adaptation increases the observer’s discrimination ability around the adaptor (e.g. for visual speed [12, 13]), however it also seems to decrease it further away from the adaptor as shown in the case of direction-of-motion discrimination [14]. In this paper, we show that these two perceptual effects can be explained within a Bayesian estimation framework of perception. Note that our description is at an abstract functional level - we do not attempt to provide a computational model for the underlying mechanisms responsible for adaptation, and this clearly separates this paper from other work which might seem at ﬁrst glance similar [e.g., 15]. 2 Adaptive Bayesian estimator framework Suppose that an observer wants to estimate a property of a stimulus denoted by the variable θ, based on a measurement m. In general, the measurement can be vector-valued, and is corrupted by both internal and external noise. Hence, combining the noisy information gained by the measurement m with a priori knowledge about θ is advantageous. According to Bayes’ rule 1 p(θ|m) = p(m|θ)p(θ) . (1) α That is, the probability of stimulus value θ given m (posterior) is the product of the likelihood p(m|θ) of the particular measurement and the prior p(θ). The normalization constant α serves to ensure that the posterior is a proper probability distribution. Under the assumpˆ tion of a squared-error loss function, the optimal estimate θ(m) is the mean of the posterior, thus ∞ ˆ θ(m) = θ p(θ|m) dθ . (2) 0 ˆ Note that θ(m) describes an estimate for a single measurement m. As discussed in [16], the measurement will vary stochastically over the course of many exposures to the same stimulus, and thus the estimator will also vary. We return to this issue in Section 3.2. Figure 1a illustrates a Bayesian estimator, in which the shape of the (arbitrary) prior distribution leads on average to a shift of the estimate toward a lower value of θ than the true stimulus value θstim . The likelihood and the prior are the fundamental constituents of the Bayesian estimator model. Our goal is to describe how adaptation alters these constituents so as to account for the perceptual effects of repulsion and increased sensitivity. Adaptation does not change the prior ... An intuitively sensible hypothesis is that adaptation changes the prior distribution. Since the prior is meant to reﬂect the knowledge the observer has about the distribution of occurrences of the variable θ in the world, repeated viewing of stimuli with the same parameter a b probability probability attraction ! posterior likelihood prior modified prior Ã θ θ ˆ θ' θ θadapt Figure 1: Hypothetical model in which adaptation alters the prior distribution. a) Unadapted Bayesian estimation conﬁguration in which the prior leads to a shift of the estimate ˆ θ, relative to the stimulus parameter θstim . Both the likelihood function and the prior distriˆ bution contribute to the exact value of the estimate θ (mean of the posterior). b) Adaptation acts by increasing the prior distribution around the value, θadapt , of the adapting stimulus ˆ parameter. Consequently, an subsequent estimate θ of the same stimulus parameter value θstim is attracted toward the adaptor. This is the opposite of observed perceptual effects, and we thus conclude that adjustments of the prior in a Bayesian model do not account for adaptation. value θadapt should presumably increase the prior probability in the vicinity of θadapt . Figure 1b schematically illustrates the effect of such a change in the prior distribution. The estimated (perceived) value of the parameter under the adapted condition is attracted to the adapting parameter value. In order to account for observed perceptual repulsion effects, the prior would have to decrease at the location of the adapting parameter, a behavior that seems fundamentally inconsistent with the notion of a prior distribution. ... but increases the reliability of the measurements Since a change in the prior distribution is not consistent with repulsion, we are led to the conclusion that adaptation must change the likelihood function. But why, and how should this occur? In order to answer this question, we reconsider the functional purpose of adaptation. We assume that adaptation acts to allocate more resources to the representation of the parameter values in the vicinity of the adaptor [4], resulting in a local increase in the signal-to-noise ratio (SNR). This can be accomplished, for example, by dynamically adjusting the operational range to the statistics of the input. This kind of increased operational gain around the adaptor has been effectively demonstrated in the process of retinal adaptation [17]. In the context of our Bayesian estimator framework, and restricting to the simple case of a scalar-valued measurement, adaptation results in a narrower conditional probability density p(m|θ) in the immediate vicinity of the adaptor, thus an increase in the reliability of the measurement m. This is offset by a broadening of the conditional probability density p(m|θ) in the region beyond the adaptor vicinity (we assume that total resources are conserved, and thus an increase around the adaptor must necessarily lead to a decrease elsewhere). Figure 2 illustrates the effect of this local increase in signal-to-noise ratio on the likeli- unadapted adapted θadapt p(m2| θ )' 1/SNR θ θ θ1 θ2 θ1 p(m2|θ) θ2 θ m2 p(m1| θ )' m1 m m p(m1|θ) θ θ θ θadapt p(m| θ2)' p(m|θ2) likelihoods p(m|θ1) p(m| θ1)' p(m|θadapt )' conditionals Figure 2: Measurement noise, conditionals and likelihoods. The two-dimensional conditional density, p(m|θ), is shown as a grayscale image for both the unadapted and adapted cases. We assume here that adaptation increases the reliability (SNR) of the measurement around the parameter value of the adaptor. This is balanced by a decrease in SNR of the measurement further away from the adaptor. Because the likelihood is a function of θ (horizontal slices, shown plotted at right), this results in an asymmetric change in the likelihood that is in agreement with a repulsive effect on the estimate. a b ^ ∆θ ^ ∆θ [deg] + 0 60 30 0 -30 - θ θ adapt -60 -180 -90 90 θadapt 180 θ [deg] Figure 3: Repulsion: Model predictions vs. human psychophysics. a) Difference in perceived direction in the pre- and post-adaptation condition, as predicted by the model. Postadaptive percepts of motion direction are repelled away from the direction of the adaptor. b) Typical human subject data show a qualitatively similar repulsive effect. Data (and ﬁt) are replotted from [10]. hood function. The two gray-scale images represent the conditional probability densities, p(m|θ), in the unadapted and the adapted state. They are formed by assuming additive noise on the measurement m of constant variance (unadapted) or with a variance that decreases symmetrically in the vicinity of the adaptor parameter value θadapt , and grows slightly in the region beyond. In the unadapted state, the likelihood is convolutional and the shape and variance are equivalent to the distribution of measurement noise. However, in the adapted state, because the likelihood is a function of θ (horizontal slice through the conditional surface) it is no longer convolutional around the adaptor. As a result, the mean is pushed away from the adaptor, as illustrated in the two graphs on the right. Assuming that the prior distribution is fairly smooth, this repulsion effect is transferred to the posterior distribution, and thus to the estimate. 3 Simulation Results We have qualitatively demonstrated that an increase in the measurement reliability around the adaptor is consistent with the repulsive effects commonly seen as a result of perceptual adaptation. In this section, we simulate an adapted Bayesian observer by assuming a simple model for the changes in signal-to-noise ratio due to adaptation. We address both repulsion and changes in discrimination threshold. In particular, we compare our model predictions with previously published data from psychophysical experiments examining human perception of motion direction. 3.1 Repulsion In the unadapted state, we assume the measurement noise to be additive and normally distributed, and constant over the whole measurement space. Thus, assuming that m and θ live in the same space, the likelihood is a Gaussian of constant width. In the adapted state, we assume a simple functional description for the variance of the measurement noise around the adapter. Speciﬁcally, we use a constant plus a difference of two Gaussians, a b relative discrimination threshold relative discrimination threshold 1.8 1 θ θadapt 1.6 1.4 1.2 1 0.8 -40 -20 θ adapt 20 40 θ [deg] Figure 4: Discrimination thresholds: Model predictions vs. human psychophysics. a) The model predicts that thresholds for direction discrimination are reduced at the adaptor. It also predicts two side-lobes of increased threshold at further distance from the adaptor. b) Data of human psychophysics are in qualitative agreement with the model. Data are replotted from [14] (see also [11]). each having equal area, with one twice as broad as the other (see Fig. 2). Finally, for simplicity, we assume a ﬂat prior, but any reasonable smooth prior would lead to results that are qualitatively similar. Then, according to (2) we compute the predicted estimate of motion direction in both the unadapted and the adapted case. Figure 3a shows the predicted difference between the pre- and post-adaptive average estimate of direction, as a function of the stimulus direction, θstim . The adaptor is indicated with an arrow. The repulsive effect is clearly visible. For comparison, Figure 3b shows human subject data replotted from [10]. The perceived motion direction of a grating was estimated, under both adapted and unadapted conditions, using a two-alternative-forced-choice experimental paradigm. The plot shows the change in perceived direction as a function of test stimulus direction relative to that of the adaptor. Comparison of the two panels of Figure 3 indicate that despite the highly simpliﬁed construction of the model, the prediction is quite good, and even includes the small but consistent repulsive effects observed 180 degrees from the adaptor. 3.2 Changes in discrimination threshold Adaptation also changes the ability of human observers to discriminate between the direction of two different moving stimuli. In order to model discrimination thresholds, we need to consider a Bayesian framework that can account not only for the mean of the estimate but also its variability. We have recently developed such a framework, and used it to quantitatively constrain the likelihood and the prior from psychophysical data [16]. This framework accounts for the effect of the measurement noise on the variability of the ˆ ˆ estimate θ. Speciﬁcally, it provides a characterization of the distribution p(θ|θstim ) of the estimate for a given stimulus direction in terms of its expected value and its variance as a function of the measurement noise. As in [16] we write ˆ ∂ θ(m) 2 ˆ var θ|θstim = var m ( ) |m=θstim . (3) ∂m Assuming that discrimination threshold is proportional to the standard deviation, ˆ var θ|θstim , we can now predict how discrimination thresholds should change after adaptation. Figure 4a shows the predicted change in discrimination thresholds relative to the unadapted condition for the same model parameters as in the repulsion example (Figure 3a). Thresholds are slightly reduced at the adaptor, but increase symmetrically for directions further away from the adaptor. For comparison, Figure 4b shows the relative change in discrimination thresholds for a typical human subject [14]. Again, the behavior of the human observer is qualitatively well predicted. 4 Discussion We have shown that adaptation can be incorporated into a Bayesian estimation framework for human sensory perception. Adaptation seems unlikely to manifest itself as a change in the internal representation of prior distributions, as this would lead to perceptual bias effects that are opposite to those observed in human subjects. Instead, we argue that adaptation leads to an increase in reliability of the measurement in the vicinity of the adapting stimulus parameter. We show that this change in the measurement reliability results in changes of the likelihood function, and that an estimator that utilizes this likelihood function will exhibit the commonly-observed adaptation effects of repulsion and changes in discrimination threshold. We further conﬁrm our model by making quantitative predictions and comparing them with known psychophysical data in the case of human perception of motion direction. Many open questions remain. The results demonstrated here indicate that a resource allocation explanation is consistent with the functional effects of adaptation, but it seems unlikely that theory alone can lead to a unique quantitative prediction of the detailed form of these effects. Speciﬁcally, the constraints imposed by biological implementation are likely to play a role in determining the changes in measurement noise as a function of adaptor parameter value, and it will be important to characterize and interpret neural response changes in the context of our framework. Also, although we have argued that changes in the prior seem inconsistent with adaptation effects, it may be that such changes do occur but are offset by the likelihood effect, or occur only on much longer timescales. Last, if one considers sensory perception as the result of a cascade of successive processing stages (with both feedforward and feedback connections), it becomes necessary to expand the Bayesian description to describe this cascade [e.g., 18, 19]. For example, it may be possible to interpret this cascade as a sequence of Bayesian estimators, in which the measurement of each stage consists of the estimate computed at the previous stage. Adaptation could potentially occur in each of these processing stages, and it is of fundamental interest to understand how such a cascade can perform useful stable computations despite the fact that each of its elements is constantly readjusting its response properties. References [1] K. K¨ rding and D. Wolpert. Bayesian integration in sensorimotor learning. o 427(15):244–247, January 2004. Nature, [2] D C Knill and W Richards, editors. Perception as Bayesian Inference. Cambridge University Press, 1996. [3] Y. Weiss, E. Simoncelli, and E. Adelson. Motion illusions as optimal percept. Nature Neuroscience, 5(6):598–604, June 2002. [4] H.B. Barlow. Vision: Coding and Efﬁciency, chapter A theory about the functional role and synaptic mechanism of visual after-effects, pages 363–375. Cambridge University Press., 1990. [5] M.J. Wainwright. Visual adaptation as optimal information transmission. Vision Research, 39:3960–3974, 1999. [6] N. Brenner, W. Bialek, and R. de Ruyter van Steveninck. Adaptive rescaling maximizes information transmission. Neuron, 26:695–702, June 2000. [7] S.M. Smirnakis, M.J. Berry, D.K. Warland, W. Bialek, and M. Meister. Adaptation of retinal processing to image contrast and spatial scale. Nature, 386:69–73, March 1997. [8] P. Thompson. Velocity after-effects: the effects of adaptation to moving stimuli on the perception of subsequently seen moving stimuli. Vision Research, 21:337–345, 1980. [9] A.T. Smith. Velocity coding: evidence from perceived velocity shifts. Vision Research, 25(12):1969–1976, 1985. [10] P. Schrater and E. Simoncelli. Local velocity representation: evidence from motion adaptation. Vision Research, 38:3899–3912, 1998. [11] C.W. Clifford. Perceptual adaptation: motion parallels orientation. Trends in Cognitive Sciences, 6(3):136–143, March 2002. [12] C. Clifford and P. Wenderoth. Adaptation to temporal modulaton can enhance differential speed sensitivity. Vision Research, 39:4324–4332, 1999. [13] A. Kristjansson. Increased sensitivity to speed changes during adaptation to ﬁrst-order, but not to second-order motion. Vision Research, 41:1825–1832, 2001. [14] R.E. Phinney, C. Bowd, and R. Patterson. Direction-selective coding of stereoscopic (cyclopean) motion. Vision Research, 37(7):865–869, 1997. [15] N.M. Grzywacz and R.M. Balboa. A Bayesian framework for sensory adaptation. Neural Computation, 14:543–559, 2002. [16] A.A. Stocker and E.P. Simoncelli. Constraining a Bayesian model of human visual speed perception. In Lawrence K. Saul, Yair Weiss, and L´ on Bottou, editors, Advances in Neural Infore mation Processing Systems NIPS 17, pages 1361–1368, Cambridge, MA, 2005. MIT Press. [17] D. Tranchina, J. Gordon, and R.M. Shapley. Retinal light adaptation – evidence for a feedback mechanism. Nature, 310:314–316, July 1984. ´ [18] S. Deneve. Bayesian inference in spiking neurons. In Lawrence K. Saul, Yair Weiss, and L eon Bottou, editors, Adv. Neural Information Processing Systems (NIPS*04), vol 17, Cambridge, MA, 2005. MIT Press. [19] R. Rao. Hierarchical Bayesian inference in networks of spiking neurons. In Lawrence K. Saul, Yair Weiss, and L´ on Bottou, editors, Adv. Neural Information Processing Systems (NIPS*04), e vol 17, Cambridge, MA, 2005. MIT Press.

3 0.41259313 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks

Author: Kristina Klinkner, Cosma Shalizi, Marcelo Camperi

Abstract: Most nervous systems encode information about stimuli in the responding activity of large neuronal networks. This activity often manifests itself as dynamically coordinated sequences of action potentials. Since multiple electrode recordings are now a standard tool in neuroscience research, it is important to have a measure of such network-wide behavioral coordination and information sharing, applicable to multiple neural spike train data. We propose a new statistic, informational coherence, which measures how much better one unit can be predicted by knowing the dynamical state of another. We argue informational coherence is a measure of association and shared information which is superior to traditional pairwise measures of synchronization and correlation. To ﬁnd the dynamical states, we use a recently-introduced algorithm which reconstructs effective state spaces from stochastic time series. We then extend the pairwise measure to a multivariate analysis of the network by estimating the network multi-information. We illustrate our method by testing it on a detailed model of the transition from gamma to beta rhythms. Much of the most important information in neural systems is shared over multiple neurons or cortical areas, in such forms as population codes and distributed representations [1]. On behavioral time scales, neural information is stored in temporal patterns of activity as opposed to static markers; therefore, as information is shared between neurons or brain regions, it is physically instantiated as coordination between entire sequences of neural spikes. Furthermore, neural systems and regions of the brain often require coordinated neural activity to perform important functions; acting in concert requires multiple neurons or cortical areas to share information [2]. Thus, if we want to measure the dynamic network-wide behavior of neurons and test hypotheses about them, we need reliable, practical methods to detect and quantify behavioral coordination and the associated information sharing across multiple neural units. These would be especially useful in testing ideas about how particular forms of coordination relate to distributed coding (e.g., that of [3]). Current techniques to analyze relations among spike trains handle only pairs of neurons, so we further need a method which is extendible to analyze the coordination in the network, system, or region as a whole. Here we propose a new measure of behavioral coordination and information sharing, informational coherence, based on the notion of dynamical state. Section 1 argues that coordinated behavior in neural systems is often not captured by exist- ing measures of synchronization or correlation, and that something sensitive to nonlinear, stochastic, predictive relationships is needed. Section 2 deﬁnes informational coherence as the (normalized) mutual information between the dynamical states of two systems and explains how looking at the states, rather than just observables, fulﬁlls the needs laid out in Section 1. Since we rarely know the right states a prori, Section 2.1 brieﬂy describes how we reconstruct effective state spaces from data. Section 2.2 gives some details about how we calculate the informational coherence and approximate the global information stored in the network. Section 3 applies our method to a model system (a biophysically detailed conductance-based model) comparing our results to those of more familiar second-order statistics. In the interest of space, we omit proofs and a full discussion of the existing literature, giving only minimal references here; proofs and references will appear in a longer paper now in preparation. 1 Synchrony or Coherence? Most hypotheses which involve the idea that information sharing is reﬂected in coordinated activity across neural units invoke a very speciﬁc notion of coordinated activity, namely strict synchrony: the units should be doing exactly the same thing (e.g., spiking) at exactly the same time. Investigators then measure coordination by measuring how close the units come to being strictly synchronized (e.g., variance in spike times). From an informational point of view, there is no reason to favor strict synchrony over other kinds of coordination. One neuron consistently spiking 50 ms after another is just as informative a relationship as two simultaneously spiking, but such stable phase relations are missed by strict-synchrony approaches. Indeed, whatever the exact nature of the neural code, it uses temporally extended patterns of activity, and so information sharing should be reﬂected in coordination of those patterns, rather than just the instantaneous activity. There are three common ways of going beyond strict synchrony: cross-correlation and related second-order statistics, mutual information, and topological generalized synchrony. The cross-correlation function (the normalized covariance function; this includes, for present purposes, the joint peristimulus time histogram [2]), is one of the most widespread measures of synchronization. It can be efﬁciently calculated from observable series; it handles statistical as well as deterministic relationships between processes; by incorporating variable lags, it reduces the problem of phase locking. Fourier transformation of the covariance function γXY (h) yields the cross-spectrum FXY (ν), which in turn gives the 2 spectral coherence cXY (ν) = FXY (ν)/FX (ν)FY (ν), a normalized correlation between the Fourier components of X and Y . Integrated over frequencies, the spectral coherence measures, essentially, the degree of linear cross-predictability of the two series. ([4] applies spectral coherence to coordinated neural activity.) However, such second-order statistics only handle linear relationships. Since neural processes are known to be strongly nonlinear, there is little reason to think these statistics adequately measure coordination and synchrony in neural systems. Mutual information is attractive because it handles both nonlinear and stochastic relationships and has a very natural and appealing interpretation. Unfortunately, it often seems to fail in practice, being disappointingly small even between signals which are known to be tightly coupled [5]. The major reason is that the neural codes use distinct patterns of activity over time, rather than many different instantaneous actions, and the usual approach misses these extended patterns. Consider two neurons, one of which drives the other to spike 50 ms after it does, the driving neuron spiking once every 500 ms. These are very tightly coordinated, but whether the ﬁrst neuron spiked at time t conveys little information about what the second neuron is doing at t — it’s not spiking, but it’s not spiking most of the time anyway. Mutual information calculated from the direct observations conﬂates the “no spike” of the second neuron preparing to ﬁre with its just-sitting-around “no spike”. Here, mutual information could ﬁnd the coordination if we used a 50 ms lag, but that won’t work in general. Take two rate-coding neurons with base-line ﬁring rates of 1 Hz, and suppose that a stimulus excites one to 10 Hz and suppresses the other to 0.1 Hz. The spiking rates thus share a lot of information, but whether the one neuron spiked at t is uninformative about what the other neuron did then, and lagging won’t help. Generalized synchrony is based on the idea of establishing relationships between the states of the various units. “State” here is taken in the sense of physics, dynamics and control theory: the state at time t is a variable which ﬁxes the distribution of observables at all times ≥ t, rendering the past of the system irrelevant [6]. Knowing the state allows us to predict, as well as possible, how the system will evolve, and how it will respond to external forces [7]. Two coupled systems are said to exhibit generalized synchrony if the state of one system is given by a mapping from the state of the other. Applications to data employ statespace reconstruction [8]: if the state x ∈ X evolves according to smooth, d-dimensional deterministic dynamics, and we observe a generic function y = f (x), then the space Y of time-delay vectors [y(t), y(t − τ ), ...y(t − (k − 1)τ )] is diffeomorphic to X if k > 2d, for generic choices of lag τ . The various versions of generalized synchrony differ on how, precisely, to quantify the mappings between reconstructed state spaces, but they all appear to be empirically equivalent to one another and to notions of phase synchronization based on Hilbert transforms [5]. Thus all of these measures accommodate nonlinear relationships, and are potentially very ﬂexible. Unfortunately, there is essentially no reason to believe that neural systems have deterministic dynamics at experimentally-accessible levels of detail, much less that there are deterministic relationships among such states for different units. What we want, then, but none of these alternatives provides, is a quantity which measures predictive relationships among states, but allows those relationships to be nonlinear and stochastic. The next section introduces just such a measure, which we call “informational coherence”. 2 States and Informational Coherence There are alternatives to calculating the “surface” mutual information between the sequences of observations themselves (which, as described, fails to capture coordination). If we know that the units are phase oscillators, or rate coders, we can estimate their instantaneous phase or rate and, by calculating the mutual information between those variables, see how coordinated the units’ patterns of activity are. However, phases and rates do not exhaust the repertoire of neural patterns and a more general, common scheme is desirable. The most general notion of “pattern of activity” is simply that of the dynamical state of the system, in the sense mentioned above. We now formalize this. Assuming the usual notation for Shannon information [9], the information content of a state variable X is H[X] and the mutual information between X and Y is I[X; Y ]. As is well-known, I[X; Y ] ≤ min H[X], H[Y ]. We use this to normalize the mutual state information to a 0 − 1 scale, and this is the informational coherence (IC). ψ(X, Y ) = I[X; Y ] , with 0/0 = 0 . min H[X], H[Y ] (1) ψ can be interpreted as follows. I[X; Y ] is the Kullback-Leibler divergence between the joint distribution of X and Y , and the product of their marginal distributions [9], indicating the error involved in ignoring the dependence between X and Y . The mutual information between predictive, dynamical states thus gauges the error involved in assuming the two systems are independent, i.e., how much predictions could improve by taking into account the dependence. Hence it measures the amount of dynamically-relevant information shared between the two systems. ψ simply normalizes this value, and indicates the degree to which two systems have coordinated patterns of behavior (cf. [10], although this only uses directly observable quantities). 2.1 Reconstruction and Estimation of Effective State Spaces As mentioned, the state space of a deterministic dynamical system can be reconstructed from a sequence of observations. This is the main tool of experimental nonlinear dynamics [8]; but the assumption of determinism is crucial and false, for almost any interesting neural system. While classical state-space reconstruction won’t work on stochastic processes, such processes do have state-space representations [11], and, in the special case of discretevalued, discrete-time series, there are ways to reconstruct the state space. Here we use the CSSR algorithm, introduced in [12] (code available at http://bactra.org/CSSR). This produces causal state models, which are stochastic automata capable of statistically-optimal nonlinear prediction; the state of the machine is a minimal sufﬁcient statistic for the future of the observable process[13].1 The basic idea is to form a set of states which should be (1) Markovian, (2) sufﬁcient statistics for the next observable, and (3) have deterministic transitions (in the automata-theory sense). The algorithm begins with a minimal, one-state, IID model, and checks whether these properties hold, by means of hypothesis tests. If they fail, the model is modiﬁed, generally but not always by adding more states, and the new model is checked again. Each state of the model corresponds to a distinct distribution over future events, i.e., to a statistical pattern of behavior. Under mild conditions, which do not involve prior knowledge of the state space, CSSR converges in probability to the unique causal state model of the data-generating process [12]. In practice, CSSR is quite fast (linear in the data size), and generalizes at least as well as training hidden Markov models with the EM algorithm and using cross-validation for selection, the standard heuristic [12]. One advantage of the causal state approach (which it shares with classical state-space reconstruction) is that state estimation is greatly simpliﬁed. In the general case of nonlinear state estimation, it is necessary to know not just the form of the stochastic dynamics in the state space and the observation function, but also their precise parametric values and the distribution of observation and driving noises. Estimating the state from the observable time series then becomes a computationally-intensive application of Bayes’s Rule [17]. Due to the way causal states are built as statistics of the data, with probability 1 there is a ﬁnite time, t, at which the causal state at time t is certain. This is not just with some degree of belief or conﬁdence: because of the way the states are constructed, it is impossible for the process to be in any other state at that time. Once the causal state has been established, it can be updated recursively, i.e., the causal state at time t + 1 is an explicit function of the causal state at time t and the observation at t + 1. The causal state model can be automatically converted, therefore, into a ﬁnite-state transducer which reads in an observation time series and outputs the corresponding series of states [18, 13]. (Our implementation of CSSR ﬁlters its training data automatically.) The result is a new time series of states, from which all non-predictive components have been ﬁltered out. 2.2 Estimating the Coherence Our algorithm for estimating the matrix of informational coherences is as follows. For each unit, we reconstruct the causal state model, and ﬁlter the observable time series to produce a series of causal states. Then, for each pair of neurons, we construct a joint histogram of 1 Causal state models have the same expressive power as observable operator models [14] or predictive state representations [7], and greater power than variable-length Markov models [15, 16]. a b Figure 1: Rastergrams of neuronal spike-times in the network. Excitatory, pyramidal neurons (numbers 1 to 1000) are shown in green, inhibitory interneurons (numbers 1001 to 1300) in red. During the ﬁrst 10 seconds (a), the current connections among the pyramidal cells are suppressed and a gamma rhythm emerges (left). At t = 10s, those connections become active, leading to a beta rhythm (b, right). the state distribution, estimate the mutual information between the states, and normalize by the single-unit state informations. This gives a symmetric matrix of ψ values. Even if two systems are independent, their estimated IC will, on average, be positive, because, while they should have zero mutual information, the empirical estimate of mutual information is non-negative. Thus, the signiﬁcance of IC values must be assessed against the null hypothesis of system independence. The easiest way to do so is to take the reconstructed state models for the two systems and run them forward, independently of one another, to generate a large number of simulated state sequences; from these calculate values of the IC. This procedure will approximate the sampling distribution of the IC under a null model which preserves the dynamics of each system, but not their interaction. We can then ﬁnd p-values as usual. We omit them here to save space. 2.3 Approximating the Network Multi-Information There is broad agreement [2] that analyses of networks should not just be an analysis of pairs of neurons, averaged over pairs. Ideally, an analysis of information sharing in a network would look at the over-all structure of statistical dependence between the various units, reﬂected in the complete joint probability distribution P of the states. This would then allow us, for instance, to calculate the n-fold multi-information, I[X1 , X2 , . . . Xn ] ≡ D(P ||Q), the Kullback-Leibler divergence between the joint distribution P and the product of marginal distributions Q, analogous to the pairwise mutual information [19]. Calculated over the predictive states, the multi-information would give the total amount of shared dynamical information in the system. Just as we normalized the mutual information I[X1 , X2 ] by its maximum possible value, min H[X1 ], H[X2 ], we normalize the multiinformation by its maximum, which is the smallest sum of n − 1 marginal entropies: I[X1 ; X2 ; . . . Xn ] ≤ min k H[Xn ] i=k Unfortunately, P is a distribution over a very high dimensional space and so, hard to estimate well without strong parametric constraints. We thus consider approximations. The lowest-order approximation treats all the units as independent; this is the distribution Q. One step up are tree distributions, where the global distribution is a function of the joint distributions of pairs of units. Not every pair of units needs to enter into such a distribution, though every unit must be part of some pair. Graphically, a tree distribution corresponds to a spanning tree, with edges linking units whose interactions enter into the global probability, and conversely spanning trees determine tree distributions. Writing ET for the set of pairs (i, j) and abbreviating X1 = x1 , X2 = x2 , . . . Xn = xn by X = x, one has n T (X = x) = (i,j)∈ET T (Xi = xi , Xj = xj ) T (Xi = xi ) T (Xi = xi )T (Xj = xj ) i=1 (2) where the marginal distributions T (Xi ) and the pair distributions T (Xi , Xj ) are estimated by the empirical marginal and pair distributions. We must now pick edges ET so that T best approximates the true global distribution P . A natural approach is to minimize D(P ||T ), the divergence between P and its tree approximation. Chow and Liu [20] showed that the maximum-weight spanning tree gives the divergence-minimizing distribution, taking an edge’s weight to be the mutual information between the variables it links. There are three advantages to using the Chow-Liu approximation. (1) Estimating T from empirical probabilities gives a consistent maximum likelihood estimator of the ideal ChowLiu tree [20], with reasonable rates of convergence, so T can be reliably known even if P cannot. (2) There are efﬁcient algorithms for constructing maximum-weight spanning trees, such as Prim’s algorithm [21, sec. 23.2], which runs in time O(n2 + n log n). Thus, the approximation is computationally tractable. (3) The KL divergence of the Chow-Liu distribution from Q gives a lower bound on the network multi-information; that bound is just the sum of the mutual informations along the edges in the tree: I[X1 ; X2 ; . . . Xn ] ≥ D(T ||Q) = I[Xi ; Xj ] (3) (i,j)∈ET Even if we knew P exactly, Eq. 3 would be useful as an alternative to calculating D(P ||Q) directly, evaluating log P (x)/Q(x) for all the exponentially-many conﬁgurations x. It is natural to seek higher-order approximations to P , e.g., using three-way interactions not decomposable into pairwise interactions [22, 19]. But it is hard to do so effectively, because ﬁnding the optimal approximation to P when such interactions are allowed is NP [23], and analytical formulas like Eq. 3 generally do not exist [19]. We therefore conﬁne ourselves to the Chow-Liu approximation here. 3 Example: A Model of Gamma and Beta Rhythms We use simulated data as a test case, instead of empirical multiple electrode recordings, which allows us to try the method on a system of over 1000 neurons and compare the measure against expected results. The model, taken from [24], was originally designed to study episodes of gamma (30–80Hz) and beta (12–30Hz) oscillations in the mammalian nervous system, which often occur successively with a spontaneous transition between them. More concretely, the rhythms studied were those displayed by in vitro hippocampal (CA1) slice preparations and by in vivo neocortical EEGs. The model contains two neuron populations: excitatory (AMPA) pyramidal neurons and inhibitory (GABAA ) interneurons, deﬁned by conductance-based Hodgkin-Huxley-style equations. Simulations were carried out in a network of 1000 pyramidal cells and 300 interneurons. Each cell was modeled as a one-compartment neuron with all-to-all coupling, endowed with the basic sodium and potassium spiking currents, an external applied current, and some Gaussian input noise. The ﬁrst 10 seconds of the simulation correspond to the gamma rhythm, in which only a group of neurons is made to spike via a linearly increasing applied current. The beta rhythm a b c d Figure 2: Heat-maps of coordination for the network, as measured by zero-lag cross-correlation (top row) and informational coherence (bottom), contrasting the gamma rhythm (left column) with the beta (right). Colors run from red (no coordination) through yellow to pale cream (maximum). (subsequent 10 seconds) is obtained by activating pyramidal-pyramidal recurrent connections (potentiated by Hebbian preprocessing as a result of synchrony during the gamma rhythm) and a slow outward after-hyper-polarization (AHP) current (the M-current), suppressed during gamma due to the metabotropic activation used in the generation of the rhythm. During the beta rhythm, pyramidal cells, silent during gamma rhythm, ﬁre on a subset of interneurons cycles (Fig. 1). Fig. 2 compares zero-lag cross-correlation, a second-order method of quantifying coordination, with the informational coherence calculated from the reconstructed states. (In this simulation, we could have calculated the actual states of the model neurons directly, rather than reconstructing them, but for purposes of testing our method we did not.) Crosscorrelation ﬁnds some of the relationships visible in Fig. 1, but is confused by, for instance, the phase shifts between pyramidal cells. (Surface mutual information, not shown, gives similar results.) Informational coherence, however, has no trouble recognizing the two populations as effectively coordinated blocks. The presence of dynamical noise, problematic for ordinary state reconstruction, is not an issue. The average IC is 0.411 (or 0.797 if the inactive, low-numbered neurons are excluded). The tree estimate of the global informational multi-information is 3243.7 bits, with a global coherence of 0.777. The right half of Fig. 2 repeats this analysis for the beta rhythm; in this stage, the average IC is 0.614, and the tree estimate of the global multi-information is 7377.7 bits, though the estimated global coherence falls very slightly to 0.742. This is because low-numbered neurons which were quiescent before are now active, contributing to the global information, but the over-all pattern is somewhat weaker and more noisy (as can be seen from Fig. 1b.) So, as expected, the total information content is higher, but the overall coordination across the network is lower. 4 Conclusion Informational coherence provides a measure of neural information sharing and coordinated activity which accommodates nonlinear, stochastic relationships between extended patterns of spiking. It is robust to dynamical noise and leads to a genuinely multivariate measure of global coordination across networks or regions. Applied to data from multi-electrode recordings, it should be a valuable tool in evaluating hypotheses about distributed neural representation and function. Acknowledgments Thanks to R. Haslinger, E. Ionides and S. Page; and for support to the Santa Fe Institute (under grants from Intel, the NSF and the MacArthur Foundation, and DARPA agreement F30602-00-2-0583), the Clare Booth Luce Foundation (KLK) and the James S. McDonnell Foundation (CRS). References [1] L. F. Abbott and T. J. Sejnowski, eds. Neural Codes and Distributed Representations. MIT Press, 1998. [2] E. N. Brown, R. E. Kass, and P. P. Mitra. Nature Neuroscience, 7:456–461, 2004. [3] D. H. Ballard, Z. Zhang, and R. P. N. Rao. In R. P. N. Rao, B. A. Olshausen, and M. S. Lewicki, eds., Probabilistic Models of the Brain, pp. 273–284, MIT Press, 2002. [4] D. R. Brillinger and A. E. P. Villa. In D. R. Brillinger, L. T. Fernholz, and S. Morgenthaler, eds., The Practice of Data Analysis, pp. 77–92. Princeton U.P., 1997. [5] R. Quian Quiroga et al. Physical Review E, 65:041903, 2002. [6] R. F. Streater. Statistical Dynamics. Imperial College Press, London. [7] M. L. Littman, R. S. Sutton, and S. Singh. In T. G. Dietterich, S. Becker, and Z. Ghahramani, eds., Advances in Neural Information Processing Systems 14, pp. 1555–1561. MIT Press, 2002. [8] H. Kantz and T. Schreiber. Nonlinear Time Series Analysis. Cambridge U.P., 1997. [9] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991. [10] M. Palus et al. Physical Review E, 63:046211, 2001. [11] F. B. Knight. Annals of Probability, 3:573–596, 1975. [12] C. R. Shalizi and K. L. Shalizi. In M. Chickering and J. Halpern, eds., Uncertainty in Artiﬁcial Intelligence: Proceedings of the Twentieth Conference, pp. 504–511. AUAI Press, 2004. [13] C. R. Shalizi and J. P. Crutchﬁeld. Journal of Statistical Physics, 104:817–819, 2001. [14] H. Jaeger. Neural Computation, 12:1371–1398, 2000. [15] D. Ron, Y. Singer, and N. Tishby. Machine Learning, 25:117–149, 1996. [16] P. B¨ hlmann and A. J. Wyner. Annals of Statistics, 27:480–513, 1999. u [17] N. U. Ahmed. Linear and Nonlinear Filtering for Scientists and Engineers. World Scientiﬁc, 1998. [18] D. R. Upper. PhD thesis, University of California, Berkeley, 1997. [19] E. Schneidman, S. Still, M. J. Berry, and W. Bialek. Physical Review Letters, 91:238701, 2003. [20] C. K. Chow and C. N. Liu. IEEE Transactions on Information Theory, IT-14:462–467, 1968. [21] T. H. Cormen et al. Introduction to Algorithms. 2nd ed. MIT Press, 2001. [22] S. Amari. IEEE Transacttions on Information Theory, 47:1701–1711, 2001. [23] S. Kirshner, P. Smyth, and A. Robertson. Tech. Rep. 04-04, UC Irvine, Information and Computer Science, 2004. [24] M. S. Olufsen et al. Journal of Computational Neuroscience, 14:33–54, 2003.

4 0.41019252 3 nips-2005-A Bayesian Framework for Tilt Perception and Confidence

Author: Odelia Schwartz, Peter Dayan, Terrence J. Sejnowski

Abstract: The misjudgement of tilt in images lies at the heart of entertaining visual illusions and rigorous perceptual psychophysics. A wealth of ﬁndings has attracted many mechanistic models, but few clear computational principles. We adopt a Bayesian approach to perceptual tilt estimation, showing how a smoothness prior offers a powerful way of addressing much confusing data. In particular, we faithfully model recent results showing that conﬁdence in estimation can be systematically affected by the same aspects of images that affect bias. Conﬁdence is central to Bayesian modeling approaches, and is applicable in many other perceptual domains. Perceptual anomalies and illusions, such as the misjudgements of motion and tilt evident in so many psychophysical experiments, have intrigued researchers for decades.1–3 A Bayesian view4–8 has been particularly inﬂuential in models of motion processing, treating such anomalies as the normative product of prior information (often statistically codifying Gestalt laws) with likelihood information from the actual scenes presented. Here, we expand the range of statistically normative accounts to tilt estimation, for which there are classes of results (on estimation conﬁdence) that are so far not available for motion. The tilt illusion arises when the perceived tilt of a center target is misjudged (ie bias) in the presence of ﬂankers. Another phenomenon, called Crowding, refers to a loss in the conﬁdence (ie sensitivity) of perceived target tilt in the presence of ﬂankers. Attempts have been made to formalize these phenomena quantitatively. Crowding has been modeled as compulsory feature pooling (ie averaging of orientations), ignoring spatial positions.9, 10 The tilt illusion has been explained by lateral interactions11, 12 in populations of orientationtuned units; and by calibration.13 However, most models of this form cannot explain a number of crucial aspects of the data. First, the geometry of the positional arrangement of the stimuli affects attraction versus repulsion in bias, as emphasized by Kapadia et al14 (ﬁgure 1A), and others.15, 16 Second, Solomon et al. recently measured bias and sensitivity simultaneously.11 The rich and surprising range of sensitivities, far from ﬂat as a function of ﬂanker angles (ﬁgure 1B), are outside the reach of standard models. Moreover, current explanations do not offer a computational account of tilt perception as the outcome of a normative inference process. Here, we demonstrate that a Bayesian framework for orientation estimation, with a prior favoring smoothness, can naturally explain a range of seemingly puzzling tilt data. We explicitly consider both the geometry of the stimuli, and the issue of conﬁdence in the esti- 6 5 4 3 2 1 0 -1 -2 (B) Attraction Repulsion Sensititvity (1/deg) Bias (deg) (A) 0.6 0.5 0.4 0.3 0.2 0.1 -80 -60 -40 -20 0 20 40 60 80 Flanker tilt (deg) Figure 1: Tilt biases and sensitivities in visual perception. (A) Kapadia et al demonstrated the importance of geometry on tilt bias, with bar stimuli in the fovea (and similar results in the periphery). When 5 degrees clockwise ﬂankers are arranged colinearly, the center target appears attracted in the direction of the ﬂankers; when ﬂankers are lateral, the target appears repulsed. Data are an average of 5 subjects.14 (B) Solomon et al measured both biases and sensitivities for gratings in the visual periphery.11 On the top are example stimuli, with ﬂankers tilted 22.5 degrees clockwise. This constitutes the classic tilt illusion, with a repulsive bias percept. In addition, sensitivities vary as a function of ﬂanker angles, in a systematic way (even in cases when there are no biases at all). Sensitivities are given in units of the inverse of standard deviation of the tilt estimate. More detailed data for both experiments are shown in the results section. mation. Bayesian analyses have most frequently been applied to bias. Much less attention has been paid to the equally important phenomenon of sensitivity. This aspect of our model should be applicable to other perceptual domains. In section 1 we formulate the Bayesian model. The prior is determined by the principle of creating a smooth contour between the target and ﬂankers. We describe how to extract the bias and sensitivity. In section 2 we show experimental data of Kapadia et al and Solomon et al, alongside the model simulations, and demonstrate that the model can account for both geometry, and bias and sensitivity measurements in the data. Our results suggest a more uniﬁed, rational, approach to understanding tilt perception. 1 Bayesian model Under our Bayesian model, inference is controlled by the posterior distribution over the tilt of the target element. This comes from the combination of a prior favoring smooth conﬁgurations of the ﬂankers and target, and the likelihood associated with the actual scene. A complete distribution would consider all possible angles and relative spatial positions of the bars, and marginalize the posterior over all but the tilt of the central element. For simplicity, we make two benign approximations: conditionalizing over (ie clamping) the angles of the ﬂankers, and exploring only a small neighborhood of their positions. We now describe the steps of inference. Smoothness prior: Under these approximations, we consider a given actual conﬁguration (see ﬁg 2A) of ﬂankers f1 = (φ1 , x1 ), f2 = (φ2 , x2 ) and center target c = (φc , xc ), arranged from top to bottom. We have to generate a prior over φc and δ1 = x1 − xc and δ2 = x2 − xc based on the principle of smoothness. As a less benign approximation, we do this in two stages: articulating a principle that determines a single optimal conﬁguration; and generating a prior as a mixture of a Gaussian about this optimum and a uniform distribution, with the mixing proportion of the latter being determined by the smoothness of the optimum. Smoothness has been extensively studied in the computer vision literature.17–20 One widely (B) (C) f1 f1 β1 R Probability max smooth Max smooth target (deg) (A) 40 20 0 -20 c δ1 c -40 Φc f2 f2 1 0.8 0.6 0.4 0.2 0 -80 -60 -40 -20 0 20 40 Flanker tilt (deg) 60 80 -80 -60 -40 20 0 20 40 Flanker tilt (deg) 60 80 Figure 2: Geometry and smoothness for ﬂankers, f1 and f2 , and center target, c. (A) Example actual conﬁguration of ﬂankers and target, aligned along the y axis from top to bottom. (B) The elastica procedure can rotate the target angle (to Φc ) and shift the relative ﬂanker and target positions on the x axis (to δ1 and δ2 ) in its search for the maximally smooth solution. Small spatial shifts (up to 1/15 the size of R) of positions are allowed, but positional shift is overemphasized in the ﬁgure for visibility. (C) Top: center tilt that results in maximal smoothness, as a function of ﬂanker tilt. Boxed cartoons show examples for given ﬂanker tilts, of the optimally smooth conﬁguration. Note attraction of target towards ﬂankers for small ﬂanker angles; here ﬂankers and target are positioned in a nearly colinear arrangement. Note also repulsion of target away from ﬂankers for intermediate ﬂanker angles. Bottom: P [c, f1 , f2 ] for center tilt that yields maximal smoothness. The y axis is normalized between 0 and 1. used principle, elastica, known even to Euler, has been applied to contour completion21 and other computer vision applications.17 The basic idea is to ﬁnd the curve with minimum energy (ie, square of curvature). Sharon et al19 showed that the elastica function can be well approximated by a number of simpler forms. We adopt a version that Leung and Malik18 adopted from Sharon et al.19 We assume that the probability for completing a smooth curve, can be factorized into two terms: P [c, f1 , f2 ] = G(c, f1 )G(c, f2 ) (1) with the term G(c, f1 ) (and similarly, G(c, f2 )) written as: R Dβ 2 2 Dβ = β1 + βc − β1 βc (2) − ) where σR σβ and β1 (and similarly, βc ) is the angle between the orientation at f1 , and the line joining f1 and c. The distance between the centers of f1 and c is given by R. The two constants, σβ and σR , control the relative contribution to smoothness of the angle versus the spatial distance. Here, we set σβ = 1, and σR = 1.5. Figure 2B illustrates an example geometry, in which φc , δ1 , and δ2 , have been shifted from the actual scene (of ﬁgure 2A). G(c, f1 ) = exp(− We now estimate the smoothest solution for given conﬁgurations. Figure 2C shows for given ﬂanker tilts, the center tilt that yields maximal smoothness, and the corresponding probability of smoothness. For near vertical ﬂankers, the spatial lability leads to very weak attraction and high probability of smoothness. As the ﬂanker angle deviates farther from vertical, there is a large repulsion, but also lower probability of smoothness. These observations are key to our model: the maximally smooth center tilt will inﬂuence attractive and repulsive interactions of tilt estimation; the probability of smoothness will inﬂuence the relative weighting of the prior versus the likelihood. From the smoothness principle, we construct a two dimensional prior (ﬁgure 3A). One dimension represents tilt, the other dimension, the overall positional shift between target (B) Likelihood (D) Marginalized Posterior (C) Posterior 20 0.03 10 -10 -20 0 Probability 0 10 Angle Angle Angle 10 0 -10 -20 0.01 -10 -20 0.02 0 -0. 2 0 Position 0.2 (E) Psychometric function 20 -0. 2 0 0.2 -0. 2 0 0.2 Position Position -10 -5 0 Angle 5 10 Probability clockwise (A) Prior 20 1 0.8 0.6 0.4 0.2 0 -20 -10 0 10 20 Target angle (deg) Counter-clockwise Clockwise Figure 3: Bayes model for example ﬂankers and target. (A) Prior 2D distribution for ﬂankers set at 22.5 degrees (note repulsive preference for -5.5 degrees). (B) Likelihood 2D distribution for a target tilt of 3 degrees; (C) Posterior 2D distribution. All 2D distributions are drawn on the same grayscale range, and the presence of a larger baseline in the prior causes it to appear more dimmed. (D) Marginalized posterior, resulting in 1D distribution over tilt. Dashed line represents the mean, with slight preference for negative angle. (E) For this target tilt, we calculate probability clockwise, and obtain one point on psychometric curve. and ﬂankers (called ’position’). The prior is a 2D Gaussian distribution, sat upon a constant baseline.22 The Gaussian is centered at the estimated smoothest target angle and relative position, and the baseline is determined by the probability of smoothness. The baseline, and its dependence on the ﬂanker orientation, is a key difference from Weiss et al’s Gaussian prior for smooth, slow motion. It can be seen as a mechanism to allow segmentation (see Posterior description below). The standard deviation of the Gaussian is a free parameter. Likelihood: The likelihood over tilt and position (ﬁgure 3B) is determined by a 2D Gaussian distribution with an added baseline.22 The Gaussian is centered at the actual target tilt; and at a position taken as zero, since this is the actual position, to which the prior is compared. The standard deviation and baseline constant are free parameters. Posterior and marginalization: The posterior comes from multiplying likelihood and prior (ﬁgure 3C) and then marginalizing over position to obtain a 1D distribution over tilt. Figure 3D shows an example in which this distribution is bimodal. Other likelihoods, with closer agreement between target and smooth prior, give unimodal distributions. Note that the bimodality is a direct consequence of having an added baseline to the prior and likelihood (if these were Gaussian without a baseline, the posterior would always be Gaussian). The viewer is effectively assessing whether the target is associated with the same object as the ﬂankers, and this is reﬂected in the baseline, and consequently, in the bimodality, and conﬁdence estimate. We deﬁne α as the mean angle of the 1D posterior distribution (eg, value of dashed line on the x axis), and β as the height of the probability distribution at that mean angle (eg, height of dashed line). The term β is an indication of conﬁdence in the angle estimate, where for larger values we are more certain of the estimate. Decision of probability clockwise: The probability of a clockwise tilt is estimated from the marginalized posterior: 1 P = 1 + exp (3) −α.∗k − log(β+η) where α and β are deﬁned as above, k is a free parameter and η a small constant. Free parameters are set to a single constant value for all ﬂanker and center conﬁgurations. Weiss et al use a similar compressive nonlinearity, but without the term β. We also tried a decision function that integrates the posterior, but the resulting curves were far from the sigmoidal nature of the data. Bias and sensitivity: For one target tilt, we generate a single probability and therefore a single point on the psychometric function relating tilt to the probability of choosing clockwise. We generate the full psychometric curve from all target tilts and ﬁt to it a cumulative 60 40 20 -5 0 5 Target tilt (deg) 10 80 60 40 20 0 -10 (C) Data -5 0 5 Target tilt (deg) 10 80 60 40 20 0 -10 (D) Model 100 100 100 80 0 -10 Model Frequency responding clockwise (B) Data Frequency responding clockwise Frequency responding clockwise Frequency responding clockwise (A) 100 -5 0 5 Target tilt (deg) 10 80 60 40 20 0 -10 -5 0 5 10 Target tilt (deg) Figure 4: Kapadia et al data,14 versus Bayesian model. Solid lines are ﬁts to a cumulative Gaussian distribution. (A) Flankers are tilted 5 degrees clockwise (black curve) or anti-clockwise (gray) of vertical, and positioned spatially in a colinear arrangement. The center bar appears tilted in the direction of the ﬂankers (attraction), as can be seen by the attractive shift of the psychometric curve. The boxed stimuli cartoon illustrates a vertical target amidst the ﬂankers. (B) Model for colinear bars also produces attraction. (C) Data and (D) model for lateral ﬂankers results in repulsion. All data are collected in the fovea for bars. Gaussian distribution N (µ, σ) (ﬁgure 3E). The mean µ of the ﬁt corresponds to the bias, 1 and σ to the sensitivity, or conﬁdence in the bias. The ﬁt to a cumulative Gaussian and extraction of these parameters exactly mimic psychophysical procedures.11 2 Results: data versus model We ﬁrst consider the geometry of the center and ﬂanker conﬁgurations, modeling the full psychometric curve for colinear and parallel ﬂanks (recall that ﬁgure 1A showed summary biases). Figure 4A;B demonstrates attraction in the data and model; that is, the psychometric curve is shifted towards the ﬂanker, because of the nature of smooth completions for colinear ﬂankers. Figure 4C;D shows repulsion in the data and model. In this case, the ﬂankers are arranged laterally instead of colinearly. The smoothest solution in the model arises by shifting the target estimate away from the ﬂankers. This shift is rather minor, because the conﬁguration has a low probability of smoothness (similar to ﬁgure 2C), and thus the prior exerts only a weak effect. The above results show examples of changes in the psychometric curve, but do not address both bias and, particularly, sensitivity, across a whole range of ﬂanker conﬁgurations. Figure 5 depicts biases and sensitivity from Solomon et al, versus the Bayes model. The data are shown for a representative subject, but the qualitative behavior is consistent across all subjects tested. In ﬁgure 5A, bias is shown, for the condition that both ﬂankers are tilted at the same angle. The data exhibit small attraction at near vertical ﬂanker angles (this arrangement is close to colinear); large repulsion at intermediate ﬂanker angles of 22.5 and 45 degrees from vertical; and minimal repulsion at large angles from vertical. This behavior is also exhibited in the Bayes model (Figure 5B). For intermediate ﬂanker angles, the smoothest solution in the model is repulsive, and the effect of the prior is strong enough to induce a signiﬁcant repulsion. For large angles, the prior exerts almost no effect. Interestingly, sensitivity is far from ﬂat in both data and model. In the data (Figure 5C), there is most loss in sensitivity at intermediate ﬂanker angles of 22.5 and 45 degrees (ie, the subject is less certain); and sensitivity is higher for near vertical or near horizontal ﬂankers. The model shows the same qualitative behavior (Figure 5D). In the model, there are two factors driving sensitivity: one is the probability of completing a smooth curvature for a given ﬂanker conﬁguration, as in Figure 2B; this determines the strength of the prior. The other factor is certainty in a particular center estimation; this is determined by β, derived from the posterior distribution, and incorporated into the decision stage of the model Data 5 0 -60 -40 -80 -60 -40 -20 0 20 40 Flanker tilt (deg) -20 0 20 40 Flanker tilt (deg) 60 60 80 -60 -40 0.6 0.5 0.4 0.3 0.2 0.1 -20 0 20 40 Flanker tilt (deg) 60 80 60 80 -80 -60 -40 -20 0 20 40 Flanker tilt (deg) -80 -60 -40 -20 0 20 40 Flanker tilt (deg) 60 80 -20 0 20 40 Flanker tilt (deg) 60 80 (F) Bias (deg) 10 5 0 -5 0.6 0.5 0.4 0.3 0.2 0.1 -80 (D) 10 -10 -10 80 Sensitivity (1/deg) -80 5 0 -5 -80 -80 -60 -60 -40 -40 -20 0 20 40 Flanker tilt (deg) -20 0 20 40 Flanker tilt (deg) 60 -10 80 (H) 60 80 Sensitivity (1/deg) Sensititvity (1/deg) Bias (deg) 0.6 0.5 0.4 0.3 0.2 0.1 (G) Sensititvity (1/deg) 0 -5 (C) (E) 5 -5 -10 Model (B) 10 Bias (deg) Bias (deg) (A) 10 0.6 0.5 0.4 0.3 0.2 0.1 -80 -60 -40 Figure 5: Solomon et al data11 (subject FF), versus Bayesian model. (A) Data and (B) model biases with same-tilted ﬂankers; (C) Data and (D) model sensitivities with same-tilted ﬂankers; (E;G) data and (F;H) model as above, but for opposite-tilted ﬂankers (note that opposite-tilted data was collected for less ﬂanker angles). Each point in the ﬁgure is derived by ﬁtting a cummulative Gaussian distribution N (µ, σ) to corresponding psychometric curve, and setting bias 1 equal to µ and sensitivity to σ . In all experiments, ﬂanker and target gratings are presented in the visual periphery. Both data and model stimuli are averages of two conﬁgurations, on the left hand side (9 O’clock position) and right hand side (3 O’clock position). The conﬁgurations are similar to Figure 1 (B), but slightly shifted according to an iso-eccentric circle, so that all stimuli are similarly visible in the periphery. (equation 3). For ﬂankers that are far from vertical, the prior has minimal effect because one cannot ﬁnd a smooth solution (eg, the likelihood dominates), and thus sensitivity is higher. The low sensitivity at intermediate angles arises because the prior has considerable effect; and there is conﬂict between the prior (tilt, position), and likelihood (tilt, position). This leads to uncertainty in the target angle estimation . For ﬂankers near vertical, the prior exerts a strong effect; but there is less conﬂict between the likelihood and prior estimates (tilt, position) for a vertical target. This leads to more conﬁdence in the posterior estimate, and therefore, higher sensitivity. The only aspect that our model does not reproduce is the (more subtle) sensitivity difference between 0 and +/- 5 degree ﬂankers. Figure 5E-H depict data and model for opposite tilted ﬂankers. The bias is now close to zero in the data (Figure 5E) and model (Figure 5F), as would be expected (since the maximally smooth angle is now always roughly vertical). Perhaps more surprisingly, the sensitivities continue to to be non-ﬂat in the data (Figure 5G) and model (Figure 5H). This behavior arises in the model due to the strength of prior, and positional uncertainty. As before, there is most loss in sensitivity at intermediate angles. Note that to ﬁt Kapadia et al, simulations used a constant parameter of k = 9 in equation 3, whereas for the Solomon et al. simulations, k = 2.5. This indicates that, in our model, there was higher conﬁdence in the foveal experiments than in the peripheral ones. 3 Discussion We applied a Bayesian framework to the widely studied tilt illusion, and demonstrated the model on examples from two different data sets involving foveal and peripheral estimation. Our results support the appealing hypothesis that perceptual misjudgements are not a consequence of poor system design, but rather can be described as optimal inference.4–8 Our model accounts correctly for both attraction and repulsion, determined by the smoothness prior and the geometry of the scene. We emphasized the issue of estimation conﬁdence. The dataset showing how conﬁdence is affected by the same issues that affect bias,11 was exactly appropriate for a Bayesian formulation; other models in the literature typically do not incorporate conﬁdence in a thoroughly probabilistic manner. In fact, our model ﬁts the conﬁdence (and bias) data more proﬁciently than an account based on lateral interactions among a population of orientationtuned cells.11 Other Bayesian work, by Stocker et al,6 utilized the full slope of the psychometric curve in ﬁtting a prior and likelihood to motion data, but did not examine the issue of conﬁdence. Estimation conﬁdence plays a central role in Bayesian formulations as a whole. Understanding how priors affect conﬁdence should have direct bearing on many other Bayesian calculations such as multimodal integration.23 Our model is obviously over-simpliﬁed in a number of ways. First, we described it in terms of tilts and spatial positions; a more complete version should work in the pixel/ﬁltering domain.18, 19 We have also only considered two ﬂanking elements; the model is extendible to a full-ﬁeld surround, whereby smoothness operates along a range of geometric directions, and some directions are more (smoothly) dominant than others. Second, the prior is constructed by summarizing the maximal smoothness information; a more probabilistically correct version should capture the full probability of smoothness in its prior. Third, our model does not incorporate a formal noise representation; however, sensitivities could be inﬂuenced both by stimulus-driven noise and conﬁdence. Fourth, our model does not address attraction in the so-called indirect tilt illusion, thought to be mediated by a different mechanism. Finally, we have yet to account for neurophysiological data within this framework, and incorporate constraints at the neural implementation level. However, versions of our computations are oft suggested for intra-areal and feedback cortical circuits; and smoothness principles form a key part of the association ﬁeld connection scheme in Li’s24 dynamical model of contour integration in V1. Our model is connected to a wealth of literature in computer vision and perception. Notably, occlusion and contour completion might be seen as the extreme example in which there is no likelihood information at all for the center target; a host of papers have shown that under these circumstances, smoothness principles such as elastica and variants explain many aspects of perception. The model is also associated with many studies on contour integration motivated by Gestalt principles;25, 26 and exploration of natural scene statistics and Gestalt,27, 28 including the relation to contour grouping within a Bayesian framework.29, 30 Indeed, our model could be modiﬁed to include a prior from natural scenes. There are various directions for the experimental test and reﬁnement of our model. Most pressing is to determine bias and sensitivity for different center and ﬂanker contrasts. As in the case of motion, our model predicts that when there is more uncertainty in the center element, prior information is more dominant. Another interesting test would be to design a task such that the center element is actually part of a different ﬁgure and unrelated to the ﬂankers; our framework predicts that there would be minimal bias, because of segmentation. Our model should also be applied to other tilt-based illusions such as the Fraser spiral and Z¨ llner. Finally, our model can be applied to other perceptual domains;31 and given o the apparent similarities between the tilt illusion and the tilt after-effect, we plan to extend the model to adaptation, by considering smoothness in time as well as space. Acknowledgements This work was funded by the HHMI (OS, TJS) and the Gatsby Charitable Foundation (PD). We are very grateful to Serge Belongie, Leanne Chukoskie, Philip Meier and Joshua Solomon for helpful discussions. References [1] J J Gibson. Adaptation, after-effect, and contrast in the perception of tilted lines. Journal of Experimental Psychology, 20:553–569, 1937. [2] C Blakemore, R H S Carpentar, and M A Georgeson. Lateral inhibition between orientation detectors in the human visual system. Nature, 228:37–39, 1970. [3] J A Stuart and H M Burian. A study of separation difﬁculty: Its relationship to visual acuity in normal and amblyopic eyes. American Journal of Ophthalmology, 53:471–477, 1962. [4] A Yuille and H H Bulthoff. Perception as bayesian inference. In Knill and Whitman, editors, Bayesian decision theory and psychophysics, pages 123–161. Cambridge University Press, 1996. [5] Y Weiss, E P Simoncelli, and E H Adelson. Motion illusions as optimal percepts. Nature Neuroscience, 5:598–604, 2002. [6] A Stocker and E P Simoncelli. Constraining a bayesian model of human visual speed perception. Adv in Neural Info Processing Systems, 17, 2004. [7] D Kersten, P Mamassian, and A Yuille. Object perception as bayesian inference. Annual Review of Psychology, 55:271–304, 2004. [8] K Kording and D Wolpert. Bayesian integration in sensorimotor learning. Nature, 427:244–247, 2004. [9] L Parkes, J Lund, A Angelucci, J Solomon, and M Morgan. Compulsory averaging of crowded orientation signals in human vision. Nature Neuroscience, 4:739–744, 2001. [10] D G Pelli, M Palomares, and N J Majaj. Crowding is unlike ordinary masking: Distinguishing feature integration from detection. Journal of Vision, 4:1136–1169, 2002. [11] J Solomon, F M Felisberti, and M Morgan. Crowding and the tilt illusion: Toward a uniﬁed account. Journal of Vision, 4:500–508, 2004. [12] J A Bednar and R Miikkulainen. Tilt aftereffects in a self-organizing model of the primary visual cortex. Neural Computation, 12:1721–1740, 2000. [13] C W Clifford, P Wenderoth, and B Spehar. A functional angle on some after-effects in cortical vision. Proc Biol Sci, 1454:1705–1710, 2000. [14] M K Kapadia, G Westheimer, and C D Gilbert. Spatial distribution of contextual interactions in primary visual cortex and in visual perception. J Neurophysiology, 4:2048–262, 2000. [15] C C Chen and C W Tyler. Lateral modulation of contrast discrimination: Flanker orientation effects. Journal of Vision, 2:520–530, 2002. [16] I Mareschal, M P Sceniak, and R M Shapley. Contextual inﬂuences on orientation discrimination: binding local and global cues. Vision Research, 41:1915–1930, 2001. [17] D Mumford. Elastica and computer vision. In Chandrajit Bajaj, editor, Algebraic geometry and its applications. Springer Verlag, 1994. [18] T K Leung and J Malik. Contour continuity in region based image segmentation. In Proc. ECCV, pages 544–559, 1998. [19] E Sharon, A Brandt, and R Basri. Completion energies and scale. IEEE Pat. Anal. Mach. Intell., 22(10), 1997. [20] S W Zucker, C David, A Dobbins, and L Iverson. The organization of curve detection: coarse tangent ﬁelds. Computer Graphics and Image Processing, 9(3):213–234, 1988. [21] S Ullman. Filling in the gaps: the shape of subjective contours and a model for their generation. Biological Cybernetics, 25:1–6, 1976. [22] G E Hinton and A D Brown. Spiking boltzmann machines. Adv in Neural Info Processing Systems, 12, 1998. [23] R A Jacobs. What determines visual cue reliability? Trends in Cognitive Sciences, 6:345–350, 2002. [24] Z Li. A saliency map in primary visual cortex. Trends in Cognitive Science, 6:9–16, 2002. [25] D J Field, A Hayes, and R F Hess. Contour integration by the human visual system: evidence for a local “association ﬁeld”. Vision Research, 33:173–193, 1993. [26] J Beck, A Rosenfeld, and R Ivry. Line segregation. Spatial Vision, 4:75–101, 1989. [27] M Sigman, G A Cecchi, C D Gilbert, and M O Magnasco. On a common circle: Natural scenes and gestalt rules. PNAS, 98(4):1935–1940, 2001. [28] S Mahumad, L R Williams, K K Thornber, and K Xu. Segmentation of multiple salient closed contours from real images. IEEE Pat. Anal. Mach. Intell., 25(4):433–444, 1997. [29] W S Geisler, J S Perry, B J Super, and D P Gallogly. Edge co-occurence in natural images predicts contour grouping performance. Vision Research, 6:711–724, 2001. [30] J H Elder and R M Goldberg. Ecological statistics of gestalt laws for the perceptual organization of contours. Journal of Vision, 4:324–353, 2002. [31] S R Lehky and T J Sejnowski. Neural model of stereoacuity and depth interpolation based on a distributed representation of stereo disparity. Journal of Neuroscience, 10:2281–2299, 1990.

5 0.40957624 78 nips-2005-From Weighted Classification to Policy Search

Author: Doron Blatt, Alfred O. Hero

Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classiﬁcation problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a ﬁnite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classiﬁcation methods can be applied to ﬁnd policies in the reinforcement learning problem. 1

6 0.40719607 153 nips-2005-Policy-Gradient Methods for Planning

7 0.40200397 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

8 0.39786997 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

9 0.39464402 144 nips-2005-Off-policy Learning with Options and Recognizers

10 0.39439842 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting

11 0.39420101 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

12 0.39213687 36 nips-2005-Bayesian models of human action understanding

13 0.39125299 100 nips-2005-Interpolating between types and tokens by estimating power-law generators

14 0.39091617 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

15 0.38900641 108 nips-2005-Layered Dynamic Textures

16 0.38720629 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

17 0.38685998 45 nips-2005-Conditional Visual Tracking in Kernel Space

18 0.38525817 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

19 0.38500723 30 nips-2005-Assessing Approximations for Gaussian Process Classification

20 0.38445213 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery