nips nips2004 nips2004-155 knowledge-graph by maker-knowledge-mining

155 nips-2004-Responding to Modalities with Different Latencies

Source: pdf

Author: Fredrik Bissmarck, Hiroyuki Nakahara, Kenji Doya, Okihide Hikosaka

Abstract: Motor control depends on sensory feedback in multiple modalities with different latencies. In this paper we consider within the framework of reinforcement learning how different sensory modalities can be combined and selected for real-time, optimal movement control. We propose an actor-critic architecture with multiple modules, whose output are combined using a softmax function. We tested our architecture in a simulation of a sequential reaching task. Reaching was initially guided by visual feedback with a long latency. Our learning scheme allowed the agent to utilize the somatosensory feedback with shorter latency when the hand is near the experienced trajectory. In simulations with different latencies for visual and somatosensory feedback, we found that the agent depended more on feedback with shorter latency. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Responding to modalities with different latencies Fredrik Bissmarck Computational Neuroscience Labs ATR International Hikari-dai 2-2-2, Seika, Soraku Kyoto 619-0288 JAPAN xfredrik@atr. [sent-1, score-0.149]

2 gov Abstract Motor control depends on sensory feedback in multiple modalities with different latencies. [sent-9, score-0.362]

3 In this paper we consider within the framework of reinforcement learning how different sensory modalities can be combined and selected for real-time, optimal movement control. [sent-10, score-0.266]

4 We propose an actor-critic architecture with multiple modules, whose output are combined using a softmax function. [sent-11, score-0.14]

5 We tested our architecture in a simulation of a sequential reaching task. [sent-12, score-0.143]

6 Reaching was initially guided by visual feedback with a long latency. [sent-13, score-0.398]

7 Our learning scheme allowed the agent to utilize the somatosensory feedback with shorter latency when the hand is near the experienced trajectory. [sent-14, score-0.792]

8 In simulations with different latencies for visual and somatosensory feedback, we found that the agent depended more on feedback with shorter latency. [sent-15, score-0.691]

9 1 Introduction For motor response, the brain relies on several modalities. [sent-16, score-0.39]

10 For example, vision keeps us updated on external world events, while somatosensation gives us detailed information about the state of the motor system. [sent-18, score-0.39]

11 For example, information may be perceived faster by the somatosensory pathway than the visual. [sent-21, score-0.147]

12 For quick responses it would be reasonable that the modality with shorter latency is more important. [sent-22, score-0.346]

13 The slower modality would be useful if it carries additional information, for example when we have to attend to a visual cue. [sent-23, score-0.322]

14 There has been a lot of research on modular organisation where each module is an expert of a particular part of the state space (e. [sent-24, score-0.595]

15 We address questions concerning modules with different feedback delays, and how they are used for real-time motor control. [sent-27, score-0.78]

16 How does the latency affect the inﬂuence of a modality over action? [sent-28, score-0.224]

17 Here, we propose an actor-critic framework, where modules compete for inﬂuence over action by reinforcement. [sent-30, score-0.225]

18 2 General framework This section describes the generic concepts of our model: a set of modules with delayed feedback, a function for combining them and a learning algorithm. [sent-33, score-0.211]

19 1 Network architecture Consider M modules, where each module has its own feedback signal ym (x(t − τ m )) (m = 1, 2, . [sent-35, score-0.933]

20 Each module has a corresponding time delay τ m (see ﬁgure 1). [sent-38, score-0.622]

21 (The same feedback signals are used to compute the critic, see the next subsection). [sent-39, score-0.22]

22 Each module outputs a populationcoded output am (t), where each element am (j = 1, 2, . [sent-40, score-0.566]

23 J) corresponds to the moj tor output vector uj , which represents, for example, joint torques. [sent-42, score-0.128]

24 The output of an actor is given by a function approximator am (t) = f(ym (t − τ m ); wm ) with parameters wm . [sent-43, score-0.258]

25 The actual motor command u ∈ RD is given by combination of population vector outputs am of the modules. [sent-45, score-0.502]

26 The probablity of taking j-th motor output vector is given by exp β πj (t) = J j=1 M m=1 exp β am j M m=1 am j , where β is the inverse temperature, controlling the stochasticity. [sent-47, score-0.431]

27 At each moment, one of the motor command vectors is selected as p(u(t) = uj ) = πj (t). [sent-48, score-0.513]

28 There is no explicit mechanism in the architecture that explicitly favour a module with shorter latency. [sent-50, score-0.768]

29 Instead, we test whether a reinforcement learning algorithm can learn to select the modules which are more useful to the agent. [sent-51, score-0.214]

30 , yM (t − τ M ); wc ) where wc is a set of trainable parameters. [sent-58, score-0.149]

31 Learning of each actor module is guided by the action deviation signal (qj (t) − πj (t))2 2 which is the difference between the its output and the action that was actually selected. [sent-62, score-0.839]

32 Ej (t) = Parameters of the critic and actors are updated using eligibility traces 1 ∂V ec (t) = − ec + ˙k c κ k ∂wk ∂Ej (t) 1 em (t) = − ec + ˙ kj m κ kj ∂wkj where k is the index of parameters and κ is a time constant. [sent-63, score-0.383]

33 The trace for m-th actor is given from ∂Ej (t) ∂πj (t) m = (qj (t) − πj (t)) ∂w m ∂wkj kj The parameters are updated by gradient descent as wk = αδ T D (t)ec (t) ˙c k wkj = αδ T D (t)em (t) ˙m kj where α denotes the learning rate. [sent-64, score-0.261]

34 3 Neuroanatomical correlates Our network architecture is modeled to resemble the function of the basal gangliathalamocortical (BG-TC) system to select and learn actions for goal-directed movement. [sent-66, score-0.132]

35 Actor-critic models of the basal ganglia has been proposed by many (e. [sent-67, score-0.154]

36 The modular organisation of the BG-TC loop circuits ([5], [6]), where modules depends on different sensory feedback, implies that the actor-critic depends on several modules. [sent-70, score-0.24]

37 The results from these experiments suggested that the inﬂuence of the motor BG-TC loop for motor execution is relatively stronger for learned sequences than for new ones, compared to the prefrontal BG-TC loop. [sent-72, score-0.919]

38 In our model implementation, we want to investigate how the feedback delay affects the inﬂuence of visual and sensorimotor modalities when learning a stereotype real-time motor sequence. [sent-73, score-0.978]

39 In our implementation (see ﬁgure 2), we use two modules, one “visual”, and one “motor”, corresponding to visual and somatosensory feedback respectively. [sent-74, score-0.438]

40 The visual module represents a preknown, visually guided reaching policy for arbitrary start and endpoints within reach. [sent-75, score-0.829]

41 The motor module represents the motor skill memory to be learned. [sent-77, score-1.305]

42 The controlled object is a 2DOF arm, for which the agent gives a joint torque motor command, with action selection sampled at 100 Hz. [sent-79, score-0.65]

43 If the hand of the arm satisfy a proximity condition ˙ (|ξ target − ξ hand | < ξ prox and |ξ hand | < prox v ) a key (target) is considered pressed, and the next target appears immediately. [sent-87, score-0.69]

44 To allow a larger possibility of modifying the movement, we have a very loose velocity constraint v prox (for all simulations, ξ prox = 0. [sent-88, score-0.222]

45 For each successful key press, the agent is rewarded instantaneously, with an increasing amount of reward for later keys in the sequence (50, 100, 150 respectively). [sent-92, score-0.186]

46 2 The visual module The visual module is designed as a computed torque feedback controller for simplicity. [sent-95, score-1.653]

47 It was designed to give an output as similar as possible to biological reaching movements, but we did not attempt to design the controller itself in a biologically plausible way. [sent-96, score-0.191]

48 The visual module is hand hand fed back the hand position {ξ1 , ξ2 } and the position of the active target target target {ξ1 , ξ2 }, while the motor module is fed back a population code representing the joint angles {θ1 , θ2 }. [sent-98, score-2.099]

49 The feedback signal yv to the visual module consists of the hand kinematics ξ hand , ˙ ξ hand and the target position ξ target . [sent-101, score-1.243]

50 Using a computed torque feedback control law, the visual module uses these signals to generate a reaching movement, representing the preknown motor behaviour of the agent. [sent-102, score-1.55]

51 As such a control law does not have measures to deal with delayed signals, we make the assumption that the control law relies ˜ on ξ hand (t) = ξ hand (t), i. [sent-103, score-0.325]

52 the controller can predict for the delay regarding the arm movement (the target signal is still delayed by τ v . [sent-105, score-0.499]

53 With proper control gains K1 and K2 , the ﬁlter helps to give bell-shaped velocity proﬁles for the reaching movement, desirable for its resemblance to biological motion. [sent-108, score-0.125]

54 The output uvisual is then expanded to a population vector av (t) = j 1 1 exp(− { Z 2 ( d uvisual (t) − ujd 2 ¯ d ) }) σjd where Z is the normalisation term, ujd is a preferable joint torque for Cartesian dimension ¯ d for vector element j, σjd the corresponding variance. [sent-109, score-0.587]

55 The prefered joint torques uj corresponding to action j were distributed symmetrically over the origin ¯ in a 5x5 grid, in the range (-100:100,-100:100) with the middle (0,0) unit removed. [sent-111, score-0.186]

56 3 The motor module The motor module relies on information about the motor state of the arm. [sent-114, score-2.22]

57 In the vicinity of a target, by the immediate motor state alone it may be difﬁcult to determine whether the hand should move towards or away from the target position. [sent-115, score-0.561]

58 , K is partitioned by K0 : The ﬁrst part (k ≤ K0 ) represents the motor state, and the second part (k > K0 ) represents the context. [sent-120, score-0.39]

59 , N tapped delay lines (where N correspond to the number of keys in the sequence), where each delay line has Q units. [sent-124, score-0.229]

60 For (k > K 0 , k = K0 + Q(n − 1) + 1): yk (t) = − ˙m 1 m y (t) + yk−1 (t) τC k Each delay line is initiated by the input at (k = K0 + Q(n − 1) + 1): m keypress ) yk (t) = δ(t − τn keypress is the instant the nth key was pressed. [sent-125, score-0.279]

61 The corresponding variances σ kd and σkd were half the distance to the closest node in each direction. [sent-132, score-0.186]

62 4 Simulation results We trained the model for four different feedback delay pairs (τ v / τ m , in ms): 100/0, 100/50, 100/100, 0/100 (β = 10, τ T D = 200 ms, κ = 200 ms, α = 0. [sent-135, score-0.317]

63 Two properties are essential for our argument: the shortest feedback delay τ min = min(τ v , τ m ) and the relative latency ∆τ = (τ v − τ m ). [sent-138, score-0.513]

64 Figure 3: (Left) Change in performance time (running averages) across trials for different feedback delays (displayed in ms as visual/motor). [sent-139, score-0.336]

65 (Right) Example hand trajectories for the initial (gray lines) and learned (black lines) behaviour for the run with 100 ms/0 ms delay. [sent-140, score-0.261]

66 1 Final performance time depends on the shortest latency Figure 3 shows that the performance time (PT, the time it takes to complete one trial) was improved for all four simulations. [sent-142, score-0.196]

67 The ﬁnal PT relates to the shortest latency τ min , the shorter the better ﬁnal performance. [sent-143, score-0.318]

68 However, there are three possible reasons for speedup: 1) a more deterministic (greedy) policy π, 2) a change in trajectory and 3) faster reaction by utilization of faster feedback. [sent-144, score-0.177]

69 25 s for τ v = 100 ms and τ v = 0 ms respectively, while the corresponding greedy policy PTs are 1. [sent-148, score-0.204]

70 We see that while the initial movement was directed target-bytarget, the learned displays a smoothly curved movement, optimized to perform the entire sequence. [sent-153, score-0.117]

71 This is expected, as the discounted reward (determined by τ T D ) and time cost favour fast movements over slow. [sent-154, score-0.179]

72 We also see that the shorter τ min , the shorter ﬁnal PT. [sent-157, score-0.244]

73 Reason 3) is also signiﬁcant: the possibility to speed up the movement is limited by τ min . [sent-158, score-0.117]

74 Figure 4: Performance after learning with typical examples of hand trajectories in a normal condition, and a condition with the visual module turned off, for agents with different feedback delay. [sent-159, score-1.036]

75 When the visual module was turned off, the agent often failed to complete the sequence in 5 s. [sent-161, score-0.734]

76 The solid lines highlight the trajectory while execution is stable, while the dashed lines show the parts when the agent is out of control. [sent-163, score-0.263]

77 2 The module with shorter latency is more inﬂuential over motor control Figure 4 shows the performance of sufﬁciently learned behaviour (after 125,000 trials) for two conditions: one normal (“condition 1”) and one with the visual module turned off (“condition 2”). [sent-165, score-1.926]

78 The difference in trajectories in condition 1 are marginal, but execution tends to destabilize with longer τ min . [sent-167, score-0.238]

79 Condition 2 reveals the dependence of the visual module. [sent-168, score-0.122]

80 In the 100/0 case, the correct spatial trajectory is generated each time, but a sometimes too fast movement leads to overshoots for 2nd and 3rd keys. [sent-169, score-0.193]

81 For smaller ∆τ (rightwards in ﬁgure 4) the execution becomes unstable, and the 0/100 case it could never execute the movement. [sent-170, score-0.172]

82 Thus, we conclude that the faster module are more inﬂuential over motor control. [sent-172, score-0.966]

83 The adaptiveness of the motor loop also offer the motor module an advantage over the visual. [sent-173, score-1.305]

84 5 Conclusion Our framework offers a natural way to combine modules with different feedback latencies. [sent-174, score-0.39]

85 In any particular situation, the learning algorithm will reinforce the better module to use. [sent-175, score-0.525]

86 When execution is fast, the module with shorter latency may be favourable, and when slow, the one with more information. [sent-176, score-0.944]

87 For example, in vicinity of the experienced sequence, our agent utilized the somatosensory feedback to execute the movement more quickly, but once it lost control the visual feedback was needed to put the arm back on track again. [sent-177, score-1.088]

88 By using the softmax function it is possible to ﬂexibly gate or combine module outputs. [sent-178, score-0.569]

89 Sometimes the asynchrony of modules can cause the visual and motor modules to be directed towards different targets. [sent-179, score-0.852]

90 Then it is desirable to suppress the slower module to favour the faster, which also occured in our example by reinforcing the motor module enough to suppress the visual. [sent-180, score-1.613]

91 In other situations the reliability of one module may be insufﬁcient for robust execution, making it necessary to combine modules. [sent-181, score-0.525]

92 In our 100/0 example, the slower visual module was used to assist the faster motor module to learn a skill. [sent-182, score-1.658]

93 Once acquired, the visual module was not necessary for the skillful execution anymore, unless something went wrong. [sent-183, score-0.786]

94 Thus, the visual module is more free to attend to other tasks. [sent-184, score-0.736]

95 When we learn to ride a bicycle, for example, we ﬁrst need to attend to what we do, but once we have learned, we can attend to other things, like the surrounding trafﬁc or a conversation. [sent-185, score-0.178]

96 Our result suggests that a longer relative latency helps to make the faster modality independent, so the slower can be decoupled from execution after learning. [sent-186, score-0.459]

97 In the human brain, forward models are likely to have access to an efference copy of the motor command, which may be more important than the incoming feedback for fast movements [1]. [sent-187, score-0.659]

98 What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? [sent-205, score-0.154]

99 Functional architecture of basal ganglia circuits: neural substrates of parallel processing. [sent-216, score-0.209]

100 Parallel cortico-basal ganglia mechanisms for acquisition and execution of visuomotor sequences - a computational approach. [sent-222, score-0.251]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('module', 0.525), ('motor', 0.39), ('feedback', 0.22), ('modules', 0.17), ('latency', 0.158), ('kd', 0.156), ('execution', 0.139), ('uvisual', 0.133), ('visual', 0.122), ('shorter', 0.122), ('movement', 0.117), ('prox', 0.111), ('modalities', 0.105), ('ym', 0.103), ('delay', 0.097), ('somatosensory', 0.096), ('attend', 0.089), ('reaching', 0.088), ('critic', 0.088), ('agent', 0.087), ('arm', 0.084), ('ms', 0.083), ('actor', 0.077), ('basal', 0.077), ('ganglia', 0.077), ('torque', 0.077), ('command', 0.077), ('hand', 0.07), ('wm', 0.07), ('target', 0.068), ('favour', 0.066), ('nakahara', 0.066), ('wkj', 0.066), ('modality', 0.066), ('reward', 0.064), ('controller', 0.062), ('trajectories', 0.061), ('ec', 0.059), ('kj', 0.059), ('jd', 0.058), ('doya', 0.058), ('velocities', 0.058), ('wc', 0.058), ('guided', 0.056), ('action', 0.055), ('architecture', 0.055), ('faster', 0.051), ('movements', 0.049), ('yk', 0.047), ('angles', 0.047), ('behaviour', 0.047), ('neurosci', 0.046), ('uj', 0.046), ('slower', 0.045), ('hikosaka', 0.044), ('keypress', 0.044), ('latencies', 0.044), ('okinawa', 0.044), ('prefered', 0.044), ('preknown', 0.044), ('pts', 0.044), ('ujd', 0.044), ('ej', 0.044), ('sensorimotor', 0.044), ('softmax', 0.044), ('reinforcement', 0.044), ('pt', 0.042), ('joint', 0.041), ('output', 0.041), ('td', 0.041), ('delayed', 0.041), ('normalisation', 0.039), ('experienced', 0.039), ('organisation', 0.039), ('overshoots', 0.039), ('condition', 0.038), ('policy', 0.038), ('shortest', 0.038), ('control', 0.037), ('trajectory', 0.037), ('law', 0.035), ('visuomotor', 0.035), ('uential', 0.035), ('keys', 0.035), ('comput', 0.035), ('population', 0.035), ('japan', 0.034), ('vicinity', 0.033), ('delays', 0.033), ('subsection', 0.033), ('trainable', 0.033), ('execute', 0.033), ('angular', 0.031), ('modular', 0.031), ('suppress', 0.031), ('environment', 0.03), ('ct', 0.03), ('variances', 0.03), ('signal', 0.03), ('contextual', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 155 nips-2004-Responding to Modalities with Different Latencies

Author: Fredrik Bissmarck, Hiroyuki Nakahara, Kenji Doya, Okihide Hikosaka

2 0.13472962 12 nips-2004-A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities

Author: Lavi Shpigelman, Koby Crammer, Rony Paz, Eilon Vaadia, Yoram Singer

Abstract: We devise and experiment with a dynamical kernel-based system for tracking hand movements from neural activity. The state of the system corresponds to the hand location, velocity, and acceleration, while the system’s input are the instantaneous spike rates. The system’s state dynamics is deﬁned as a combination of a linear mapping from the previous estimated state and a kernel-based mapping tailored for modeling neural activities. In contrast to generative models, the activity-to-state mapping is learned using discriminative methods by minimizing a noise-robust loss function. We use this approach to predict hand trajectories on the basis of neural activity in motor cortex of behaving monkeys and ﬁnd that the proposed approach is more accurate than both a static approach based on support vector regression and the Kalman ﬁlter. 1

3 0.13407625 33 nips-2004-Brain Inspired Reinforcement Learning

Author: Françcois Rivest, Yoshua Bengio, John Kalaska

Abstract: Successful application of reinforcement learning algorithms often involves considerable hand-crafting of the necessary non-linear features to reduce the complexity of the value functions and hence to promote convergence of the algorithm. In contrast, the human brain readily and autonomously finds the complex features when provided with sufficient training. Recent work in machine learning and neurophysiology has demonstrated the role of the basal ganglia and the frontal cortex in mammalian reinforcement learning. This paper develops and explores new reinforcement learning algorithms inspired by neurological evidence that provides potential new approaches to the feature construction problem. The algorithms are compared and evaluated on the Acrobot task. 1

4 0.091176666 64 nips-2004-Experts in a Markov Decision Process

Author: Eyal Even-dar, Sham M. Kakade, Yishay Mansour

Abstract: We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain ﬁxed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efﬁcient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. We also show that in the case that the dynamics change over time, the problem becomes computationally hard. 1

5 0.090040632 88 nips-2004-Intrinsically Motivated Reinforcement Learning

Author: Nuttapong Chentanez, Andrew G. Barto, Satinder P. Singh

Abstract: Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a speciﬁc problem of clear practical value. But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous entities able to efﬁciently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing artiﬁcial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. 1

6 0.081388094 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

7 0.07773561 56 nips-2004-Dynamic Bayesian Networks for Brain-Computer Interfaces

8 0.074135937 117 nips-2004-Methods Towards Invasive Human Brain Computer Interfaces

9 0.070405081 39 nips-2004-Coarticulation in Markov Decision Processes

10 0.065795735 183 nips-2004-Temporal-Difference Networks

11 0.062546365 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces

12 0.062034551 24 nips-2004-Approximately Efficient Online Mechanism Design

13 0.061324451 193 nips-2004-Theories of Access Consciousness

14 0.05922642 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation

15 0.055399962 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors

16 0.053841319 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

17 0.053081825 184 nips-2004-The Cerebellum Chip: an Analog VLSI Implementation of a Cerebellar Model of Classical Conditioning

18 0.05294067 84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture

19 0.051871877 140 nips-2004-Optimal Information Decoding from Neuronal Populations with Specific Stimulus Selectivity

20 0.050950769 123 nips-2004-Multi-agent Cooperation in Diverse Population Games

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.134), (1, -0.088), (2, 0.125), (3, -0.06), (4, -0.086), (5, 0.099), (6, 0.124), (7, -0.08), (8, -0.002), (9, -0.03), (10, -0.03), (11, 0.039), (12, -0.013), (13, -0.088), (14, 0.193), (15, 0.008), (16, 0.019), (17, -0.085), (18, -0.029), (19, -0.008), (20, 0.047), (21, -0.057), (22, -0.05), (23, -0.032), (24, 0.008), (25, 0.006), (26, 0.003), (27, 0.083), (28, 0.06), (29, -0.022), (30, -0.048), (31, 0.011), (32, -0.088), (33, 0.1), (34, 0.142), (35, -0.101), (36, 0.096), (37, -0.074), (38, 0.067), (39, -0.164), (40, 0.135), (41, -0.008), (42, 0.084), (43, 0.081), (44, 0.117), (45, 0.016), (46, 0.03), (47, 0.07), (48, -0.002), (49, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97176886 155 nips-2004-Responding to Modalities with Different Latencies

Author: Fredrik Bissmarck, Hiroyuki Nakahara, Kenji Doya, Okihide Hikosaka

2 0.5956369 33 nips-2004-Brain Inspired Reinforcement Learning

Author: Françcois Rivest, Yoshua Bengio, John Kalaska

3 0.53422827 12 nips-2004-A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities

Author: Lavi Shpigelman, Koby Crammer, Rony Paz, Eilon Vaadia, Yoram Singer

4 0.48956645 193 nips-2004-Theories of Access Consciousness

Author: Michael D. Colagrosso, Michael C. Mozer

Abstract: Theories of access consciousness address how it is that some mental states but not others are available for evaluation, choice behavior, and verbal report. Farah, O’Reilly, and Vecera (1994) argue that quality of representation is critical; Dehaene, Sergent, and Changeux (2003) argue that the ability to communicate representations is critical. We present a probabilistic information transmission or PIT model that suggests both of these conditions are essential for access consciousness. Having successfully modeled data from the repetition priming literature in the past, we use the PIT model to account for data from two experiments on subliminal priming, showing that the model produces priming even in the absence of accessibility and reportability of internal states. The model provides a mechanistic basis for understanding the dissociation of priming and awareness. Philosophy has made many attempts to identify distinct aspects of consciousness. Perhaps the most famous effort is Block’s (1995) delineation of phenomenal and access consciousness. Phenomenal consciousness has to do with “what it is like” to experience chocolate or a pin prick. Access consciousness refers to internal states whose content is “(1) inferentially promiscuous, i.e., poised to be used as a premise in reasoning, (2) poised for control of action, and (3) poised for rational control of speech.” (p. 230) The scientiﬁc study of consciousness has exploded in the past six years, and an important catalyst for this explosion has been the decision to focus on the problem of access consciousness: how is it that some mental states but not others become available for evaluation, choice behavior, verbal report, and storage in working memory. Another reason for the recent explosion of consciousness research is the availability of functional imaging techniques to explore differences in brain activation between conscious and unconscious states, as well as the development of clever psychological experiments that show that a stimulus that is not consciously perceived can nonetheless inﬂuence cognition, which we describe shortly. 1 Subliminal Priming The phenomena we address utilize an experimental paradigm known as repetition priming. Priming refers to an improvement in efﬁciency in processing a stimulus item as a result of previous exposure to the item. Efﬁciency is deﬁned in terms of shorter response times, lower error rates, or both. A typical long-term perceptual priming experiment consists of a study phase during which participants are asked to read aloud a list of words, and a test phase during which participants must name or categorize a series of words, presented one at a time. Reaction time is lower and/or accuracy is higher for test words that were also on the study list. Repetition priming occurs without strategic effort on the part of participants, and therefore appears to be a low level mechanism of learning, which likely serves as the mechanism underlying the reﬁnement of cognitive skills with practice. In traditional studies, priming is supraliminal—the prime is consciously perceived. In the studies we model here, primes are subliminal. Subliminal priming addresses fundamental issues concerning conscious access: How is it that a word or image that cannot be identiﬁed, detected, or even discriminated in forced choice can nonetheless inﬂuence the processing of a subsequent stimulus word? Answering this question in a computational framework would be a signiﬁcant advance toward understanding the nature of access consciousness. 2 Models of Conscious and Unconscious Processing In contrast to the wealth of experimental data, and the large number of speculative and philosophical papers on consciousness, concrete computational models are rare. The domain of consciousness is particularly ripe for theoretical perspectives, because it is a signiﬁcant contribution to simply provide an existence proof of a mechanism that can explain speciﬁc experimental data. Ordinarily, a theorist faces skepticism when presenting a model; it often seems that hundreds of alternative, equally plausible accounts must exist. However, when addressing data deemed central to issues of consciousness, simply providing a concrete handle on the phenomena serves to demystify consciousness and bring it into the realm of scientiﬁc understanding. We are familiar with only three computational models that address speciﬁc experimental data in the domain of consciousness. We summarize these models, and then present a novel model and describe its relationship to the previous efforts. Farah, O’Reilly, and Vecera (1994) were the ﬁrst to model speciﬁc phenomena pertaining to consciousness in a computational framework. The phenomena involve prosopagnosia, a deﬁcit of overt face recognition following brain damage. Nonetheless, prosopagnosia patients exhibit residual covert recognition by a variety of tests. For example, when patients are asked to categorize names as famous or nonfamous, their response times are faster to a famous name when the name is primed by a picture of a semantically related face (e.g., the name “Bill Clinton” when preceded by a photograph of Hillary), despite the fact that they could not identify the related face. Farah et al. model face recognition in a neural network, and show that when the network is damaged, it loses the ability to perform tasks requiring high ﬁdelity representations (e.g., identiﬁcation) but not tasks requiring only coarse information (e.g., semantic priming). They argue that conscious perception is associated with a certain minimal quality of representation. Dehaene and Naccache (2001) outline a framework based on Baars’ (1989) notion of conscious states as residing in a global workspace. They describe the workspace as a “distributed neural system...with long-distance connectivity that can potentially interconnect multiple specialized brain areas in a coordinated, though variable manner.” (p. 13) Dehaene, Sergent, and Changeaux (2003) implement this framework in a complicated architecture of integrate-and-ﬁre neurons and show that the model can qualitatively account for the attentional blink phenomenon. The attentional blink is observed in experiments where participants are shown a rapid series of stimuli, which includes two targets (T1 and T2). If T2 appears shortly after T1, the ability to report T2 drops, as if attention is distracted. Dehane et al. explain this phenomenon as follows. When T1 is presented, its activation propagates to frontal cortical areas (the global workspace). Feedback connections lead to a resonance between frontal and posterior areas, which strengthen T1 but block T2 from entering the workspace. If the T1-T2 lag is sufﬁciently great, habituation of T1 sufﬁciently weakens the representation such that T2 can enter the workspace and suppress T1. In this account, conscious access is achieved via resonance between posterior and frontal areas. Although the Farah et al. and Dehaene et al. models might not seem to have much in common, they both make claims concerning what is required to achieve functional connectivity between perceptual and response systems. Farah et al. focus on aspects of the representation; Dehaene et al. focus on a pathway through which representations can be communicated. These two aspects are not incompatible, and in fact, a third model incorporates both. Mathis and Mozer (1996) describe an architecture with processing modules for perceptual and response processes, implemented as attractor neural nets. They argue that in order for a representation in some perceptual module to be assured of inﬂuencing a response module, (a) it must have certain characteristics–temporal persistence and well-formedness– which is quite similar to Farah et al.’s notion of quality, and (b) the two modules must be interconnected—which is the purpose of Dehaene et al.’s global workspace. The model has two limitations that restrict its value as a contemporary account of conscious access. First, it addressed classical subliminal priming data, but more reliable data has recently been reported. Second, like the other two models, Mathis and Mozer used a complex neural network architecture with arbitrary assumptions built in, and the sensitivity of the model’s behavior to these assumptions is far from clear. In this paper, we present a model that embodies the same assumptions as Mathis and Mozer, but overcomes its two limitations, and explains subliminal-priming data that has yet to be interpreted via a computational model. 3 The Probabilistic Information Transmission (PIT) Framework Our model is based on the probabilistic information transmission or PIT framework of Mozer, Colagrosso, and Huber (2002, 2003). The framework characterizes the transmission of information from perceptual to response systems, and how the time course of information transmission changes with experience (i.e., priming). Mozer et al. used this framework to account for a variety of facilitation effects from supraliminal repetition priming. The framework describes cognition in terms of a collection of information-processing pathways, and supposes that any act of cognition involves coordination among multiple pathways. For example, to model a letter-naming task where a letter printed in upper or lower case is presented visually and the letter must be named, the framework would assume a perceptual pathway that maps the visual input to an identity representation, and a response pathway that maps a identity representation to a naming response. The framework is formalized as a probabilistic model: the pathway input and output are random variables and microinference in a pathway is carried out by Bayesian belief revision. The framework captures the time course of information processing for a single experimental trial. To elaborate, consider a pathway whose input at time t is a discrete random variable, denoted X(t), which can assume values x1 , x2 , x3 , . . . , xnx corresponding to alternative input states. Similarly, the output of the pathway at time t is a discrete random variable, denoted Y (t), which can assume values y1 , y2 , y3 , . . . , yny . For example, in the letter-naming task, the input to the perceptual pathway would be one of nx = 52 visual patterns corresponding to the upper- and lower-case letters of the alphabet, and the output is one of ny = 26 letter identities. To present a particular input alternative, say xi , to the model for T time steps, we specify X(t) = xi for t = 1 . . . T , and allow the model to compute P(Y (t) | X(1) . . . X(t)). A pathway is modeled as a dynamic Bayes network; the minimal version of the model used in the present simulations is simply a hidden Markov model, where the X(t) are observations and the Y (t) are inferred state (see Figure 1a). In typical usage, an HMM is presented with a sequence of distinct inputs, whereas we maintain the same input for many successive time steps; and an HMM transitions through a sequence of distinct hidden states, whereas we attempt to converge with increasing conﬁdence on a single state. Figure 1b illustrates the time course of inference in a single pathway with 52 input and 26 output alternatives and two-to-one associations. The solid line in the Figure shows, as a function of time t, P(Y (t) = yi | X(1) = x2i . . . X(t) = x2i ), i.e., the probability that input i (say, the visual pattern of an upper case O) will produce its target output (the letter identity). Evidence for the target output accumulates gradually over time, yielding a speed-accuracy curve that relates the number of iterations to the accuracy of identiﬁcation. Y0 Y1 X1 Y2 X2 P(Output) 1 Y3 X3 (a) (b) 0.8 0.6 O 0.4 0.2 0 Q Time Figure 1: (a) basic pathway architecture—a hidden Markov model; (b) time course of inference in a pathway when the letter O is presented, causing activation of both O and the visually similar Q. The exact shape of the speed-accuracy curve—the pathway dynamics—are determined by three probability distributions, which embody the knowledge and past experience of the model. First, P(Y (0)) is the prior distribution over outputs in the absence of any information about the input. Second, P(Y (t) | Y (t − 1)) characterizes how the pathway output evolves over time. We assume the transition probability matrix serves as a memory with diffusion, i.e., P(Y (t) = yi |Y (t − 1) = yj ) = (1 − β)δij + βP(Y (0) = yi ), where β is the diffusion constant and δij is the Kronecker delta. Third, P(X(t) | Y (t)) characterizes the strength of association between inputs and outputs. The greater the association strength, the more rapidly that information about X will be communicated to Y . We parameterize this distribution as P(X(t) = xi |Y (t) = yj ) ∼ 1 + k γik αkj , where αij indicates the frequency of experience with the association between states xi and yj , and γik speciﬁes the similarity between states xi and xk . (Although the representation of states is localist, the γ terms allow us to design in the similarity structure inherent in a distributed representation.) These association strengths are highly constrained by the task structure and the similarity structure and familiarity of the inputs. Fundamental to the framework is the assumption that with each experience, a pathway becomes more efﬁcient at processing an input. Efﬁciency is reﬂected by a shift in the speedaccuracy curve to the left. In Mozer, Colagrosso, and Huber (2002, 2003), we propose two distinct mechanisms to model phenomena of supraliminal priming. First, the association frequencies, αij , are increased following a trial in which xi leads to activation of yj , resulting in more efﬁcient transmission of information, corresponding to an increased slope of the solid line in Figure 1b. The increase is Hebbian, based on the maximum activation achieved by xi and yj : ∆αij = η maxt P(X(t) = xi )P(Y (t) = yj ), where η is a step size. Second, the priors, which serve as a model of the environment, are increased to indicate a greater likelihood of the same output occurring again in the future. In modeling data from supraliminal priming, we found that the increases to association frequencies are long lasting, but the increases to the priors decay over the course of a few minutes or a few trials. As a result, the prior updating does not play into the simulation we report here; we refer the reader to Mozer, Colagrosso, and Huber (2003) for details. 4 Access Consciousness and PIT We have described the operation of a single pathway, but to model any cognitive task, we require a series of pathways in cascade. For a simple choice task, we use a percpetual pathway cascaded to a response pathway. The interconnection between the pathways is achieved by copying the output of the perceptual pathway, Y p (t), to the input of the response pathway, X r (t), at each time t. This multiple-pathway architecture allows us to characterize the notion of access consciousness. Considering the output of the perceptual pathway, access is achieved when: (1) the output representation is sufﬁcient to trigger the correct behavior in the response pathway, and (2) the perceptual and response pathways are functionally interconnected. In more general terms, access for a perceptual pathway output requires that these two condi- tions be met not just for a speciﬁc response pathway, but for arbitrary response pathways (e.g., pathways for naming, choice, evaluation, working memory, etc.). In Mozer and Colagrosso (in preparation) we characterize the sufﬁciency requirements of condition 1; they involve a representation of low entropy that stays active for long enough that the representation can propagate to the next pathway. As we will show, a brieﬂy presented stimulus fails to achieve a representation that supports choice and naming responses. Nonetheless, the stimulus evokes activity in the perceptual pathway. Because perceptual priming depends on the magnitude of the activation in the perceptual pathway, not on the activation being communicated to response pathways, the framework is consistent with the notion of priming occurring in the absence of awareness. 4.1 Simulation of Bar and Biederman (1998) Bar and Biederman (1998) presented a sequence of masked line drawings of objects and asked participants to name the objects, even if they had to guess. If the guess was incorrect, participants were required to choose the object name from a set of four alternatives. Unbeknownst to the participant, some of the drawings in the series were repeated, and Bar and Biederman were interested in whether participants would beneﬁt from the ﬁrst presentation even if it could not be identiﬁed. The repeated objects could be the same or a different exemplar of the object, and it could appear in either the same or a different display position. Participants were able to name 13.5% of drawings on presentation 1, but accuracy jumped to 34.5% on presentation 2. Accuracy did improve, though not as much, if the same shape was presented in a different position, but not if a different drawing of the same object was presented, suggesting a locus of priming early in the visual stream. The improvement in accuracy is not due to practice in general, because accuracy rose only 4.0% for novel control objects over the course of the experiment. The priming is ﬁrmly subliminal, because participants were not only unable to name objects on the ﬁrst presentation, but their fouralternative forced choice (4AFC) performance was not much above chance (28.5%). To model these phenomena, we created a response pathway with ﬁfty states representing names of objects that are used in the experiment, e.g., chair and lamp. We also created a perceptual pathway with states representing visual patterns that correspond to the names in the response pathway. Following the experimental design, every object identity was instantiated in two distinct shapes, and every shape could be in one of nine different visualﬁeld positions, leading to 900 distinct states in the perceptual pathway to model the possible visual stimuli. The following parameters were ﬁt to the data. If two perceptual states, xi and xk are the same shape in different positions, they are assigned a similarity coefﬁcient γik = 0.95; all other similarity coefﬁcients are zero. The association frequency, α, for valid associations in the perceptual pathway was 22, and the response pathway 18. Other parameters were β p = .05, β r = .01, and η = 1.0. The PIT model achieves a good ﬁt to the human experimental data (Figure 2). Speciﬁcally, priming is greatest for the same shape in the same position, some priming occurs for the same shape in a different position, and no substantial priming occurs for the different shape. Figure 3a shows the time course of activation of a stimulus representation in the perceptual pathway when the stimulus is presented for 50 iterations, on both the ﬁrst and third presentations. The third presentation was chosen instead of the second to make the effect of priming clearer. Even though a shape cannot be named on the ﬁrst presentation, partial information about the shape may nonetheless be available for report. The 4AFC test of Bar and Biederman provides a more sensitive measure of residual stimulus information. In past work, we modeled forced-choice tasks using a response pathway with only the alternatives under consideration. However, in this experiment, forced-choice performance must be estimated conditional on incorrect naming. In PIT framework, we achieve this using naming and 40 40 First Block Second Block 30 25 20 15 10 5 First Block 35 Percent Correct Naming Percent Correct Naming 35 Second Block 30 25 20 15 10 5 0 0 Control Objects Prime SHAPE: Same Objects POSITION: Same Same Different Different Different Second Same Different Control Control Objects Prime SHAPE: Same Objects POSITION: Same Same Different Different Different Second Same Different Control Figure 2: (left panel) Data from Bar and Biederman (1998) (right panel) Simulation of PIT. White bar: accuracy on ﬁrst presentation of a prime object. Black bars: the accuracy when the object is repeated, either with the same or different shape, and in the same or different position. Grey bars: accuracy for control objects at the beginning and the end of the experiment. forced-choice output pathways having output distributions N (t) and F (t), which are linked via the perceptual state, Y p (t). F (t) must be reestimated with the evidence that N (t) is not the target state. This inference problem is intractable. We therefore used a shortcut in which a single response pathway is used, augmented with a simple three-node belief net (Figure 3b) to capture the dependence between naming and forced choice. The belief net has a response pathway node Y r (t) connected to F (t) and N (t), with conditional distribution P (N (t) = ni |Y r (t) = yj ) = θδij + (1 − θ)/|Y r |, and an analogous distribution for P (F (t) = fi |Y r (t) = yj ). The free parameter θ determines how veridically naming and forced-choice actions reﬂect response-pathway output. Over a range of θ, θ < 1, the model obtains forced-choice performance near chance on the ﬁrst presentation when the naming response is incorrect. For example, with θ = 0.72, the model produces a forced-choice accuracy on presentation 1 of 26.1%. (Interestingly, the model also produces below chance performance on presentation 2 if the object is not named correctly—23.5%—which is also found in the human data—20.0%.) Thus, by the stringent criterion of 4AFC, the model shows no access consciousness, and therefore illustrates a dissociation between priming and access consciousness. In our simulation, we followed the procedure of Bar and Biederman by including distractor alternatives with visual and semantic similarity to the target. These distractors are critical: with unrelated distractors, the model’s 4AFC performance is signiﬁcantly above chance, illustrating that a perceptual representation can be adequate to support some responses but not others, as Farah et al. (1994) also argued. 4.2 Simulation of Abrams and Greenwald (2000) During an initial phase of the experiment, participants categorized 24 clearly visible target words as pleasant (e.g., HUMOR) or unpleasant (e.g., SMUT). They became quite familiar with the task by categorizing each word a total of eight times. In a second phase, participants were asked to classify the same targets and were given a response deadline to induce errors. The targets were preceded by masked primes that could not be identiﬁed. Of interest is the effective valence (or EV) of the target for different prime types, deﬁned as the error rate difference between unpleasant and pleasant targets. A positive (negative) EV indicates that responses are biased toward a pleasant (unpleasant) interpretation by the prime. As one would expect, pleasant primes resulted in a positive EV, unpleasant primes in a negative EV. Of critical interest is the ﬁnding that a nonword prime formed by recombining two pleasant targets (e.g., HULIP from HUMOR and TULIP) or unpleasant targets (e.g., BIUT from BILE and SMUT ) also served to bias the targets. More surprising, a positive EV resulted from unpleasant prime words formed by recombining two pleasant targets (TUMOR from TULIP and HUMOR ), indicating that subliminal priming arises from word fragments, not words as unitary entities, and providing further evidence for an early locus of subliminal priming. Note that the results depend critically on the ﬁrst phase of the experiment, which gave participants extensive practice on a relatively small set of words that were then used as and recombined to form primes. Words not studied in the ﬁrst phase (orphans) provided Probability 0.6 0.5 0.4 0.3 0.2 0.1 0 object, first presentation object, third presentation different object N(t) F(t) Yr(t) 1 50 1000 (a) (b) Time (msec) Figure 3: (a) Activation of the perceptual representation in PIT as a function of processing iterations Effective Valence on the ﬁrst (thin solid line) and third (thick solid line) presentations of target. (b) Bayes net for performing 4AFC conditional on incorrect naming response. 0.4 Experiment Model 0.3 0.2 0.1 0 targets hulip-type tumor-type orphans Figure 4: Effective valence of primes in the Abrams and Greenwald (2000) experiment for human subjects (black bars) and PIT model (grey bars). HULIP-type primes are almost as strong as target repetitions, and TUMOR-type primes have a positive valence, contrary to the meaning of the word. no signiﬁcant EV effect when used as primes. In this simulation, we used a three pathway model: a perceptual pathway that maps visual patterns to orthography with 200 input states corresponding both to words, nonwords, and nonword recombinations of words; a semantic pathway that maps to 100 distinct lexical/semantic states; and a judgement pathway that maps to two responses, pleasant and unpleasant. In the perceptual pathway, similarity structure was based on letter overlap, so that HULIP was similar to both TULIP and HUMOR, with γ = 0.837. No similarity was assumed in the semantic state representation; consistent with the previous simulation, β p = .05, β s = .01, β j = .01, and η = .01. At the outset of the simulation, α frequencies for correct associations were 15, 19, and 25 in the perceptual, semantic, and judgement pathways. The initial phase of the experiment was simulated by repeated supraliminal presentation of words, which increased the association frequencies in all three pathways through the ∆αij learning rule. Long-term supraliminal priming is essential in establishing the association strengths, as we’ll explain. Short-term subliminal priming also plays a key role in the experiment. During the second phase of the experiment, residual activity from the prime—primarily in the judgement pathway—biases the response to the target. Residual activation of the prime is present even if the representation of the prime does not reach sufﬁcient strength that it could be named or otherwise reported. The outcome of the simulation is consistent with the human data (Figure 4). When a HULIP -type prime is presented, HUMOR and TULIP become active in the semantic pathway because of their visual similarity to HULIP. Partial activation of these two practiced words pushes the judgement pathway toward a pleasant response, resulting in a positive EV. When a TUMOR-type prime is presented, three different words become active in the semantic pathway: HUMOR, TULIP, and TUMOR itself. Although TUMOR is more active, it was not one of the words studied during the initial phase of the experiment, and as a result, it has a relatively weak association to the unpleasant judgement, in contrast to the other two words which have strong associations to the pleasant judgement. Orphan primes have little effect because they were not studied during the initial phase of the experiment, and consequently their association to pleasant and unpleasant judgements is also weak. In summary, activation of the prime along a critical, well-practiced pathway may not be sufﬁcient to support an overt naming response, yet it may be sufﬁcient to bias the processing of the immediately following target. 5 Discussion An important contribution of this work has been to demonstrate that speciﬁc experimental results relating to access consciousness and subliminal priming can be interpreted in a concrete computational framework. By necessity, the PIT framework, which we previously used to model supraliminal priming data, predicts the existence of subliminal priming, because the mechanisms giving rise to priming depend on degree of activation of a representation, whereas the processes giving rise to access consciousness also depend on the temporal persistence of a representation. Another contribution of this work has been to argue that two previous computational models each tell only part of the story. Farah et al. argue that quality of representation is critical; Dehaene et al. argue that pathways to communicate representations is critical. The PIT framework argues that both of these features are necessary for access consciousness. Although the PIT framework is not completely developed, it nonetheless makes a clear prediction: that subliminal priming is can never be stronger than supraliminal priming, because the maximal activation of subliminal primes is never greater than that of supraliminal primes. One might argue that many theoretical frameworks might predict the same, but no other computational model is sufﬁciently well developed—in terms of addressing both priming and access consciousness—to make this prediction. In its current stage of development, a weakness of the PIT framework is that it is silent as to how perceptual and response pathways become ﬂexibly interconnected based on task demands. However, the PIT framework is not alone in failing to address this critical issue: The Dehaene et al. model suggests that once a representation enters the global workspace, all response modules can access it, but the model does not specify how the appropriate perceptual module wins the competition to enter the global workspace, or how the appropriate response module is activated. Clearly, ﬂexible cognitive control structures that perform these functions are intricately related to mechanisms of consciousness. Acknowledgments This research was supported by NIH/IFOPAL R01 MH61549–01A1. References Abrams, R. L., & Greenwald, A. G. (2000). Parts outweigh the whole (word) in unconscious analysis of meaning. Psychological Science, 11(2), 118–124. Baars, B. (1989). A cognitive theory of consciousness. Cambridge: Cambridge University Press. Bar, M., & Biederman, I. (1998). Subliminal visual priming. Psychological Science, 9(6), 464–468. Block, N. (1995). On a confusion about a function of consciousness. Brain and Behavioral Sciences, 18(2), 227–247. Dehaene, S., & Naccache, L. (2001). Towards a cognitive neuroscience of consciousness: basic evidence and a workspace framework. Cognition, 79, 1–37. Dehaene, S., Sergent, C., & Changeux, J.-P. (2003). A neuronal network model linking subjective reports and objective physiological data during conscious perception. Proceedings of the National Academy of Sciences, 100, 8520–8525. Farah, M. J., O’Reilly, R. C., & Vecera, S. P. (1994). Dissociated overt and covert recognition as an emergent property of a lesioned neural network. Psychological Review, 100, 571–588. Mathis, D. W., & Mozer, M. C. (1996). Conscious and unconscious perception: a computational theory. In G. Cottrell (Ed.), Proceedings of the Eighteenth Annual Conference of the Cognitive Science Society (pp. 324–328). Hillsdale, NJ: Erlbaum & Associates. Mozer, M. C., Colagrosso, M. D., & Huber, D. E. (2002). A rational analysis of cognitive control in a speeded discrimination task. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press. Mozer, M. C., Colagrosso, M. D., & Huber, D. E. (2003). Mechanisms of long-term repetition priming and skill reﬁnement: A probabilistic pathway model. In Proceedings of the TwentyFifth Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Erlbaum Associates.

5 0.43200698 183 nips-2004-Temporal-Difference Networks

Author: Richard S. Sutton, Brian Tanner

Abstract: We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a ﬁxed interval, which is not possible with conventional TD methods. Secondly, we show that if the interpredictive relationships are made conditional on action, then the usual learning-efﬁciency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms. Temporal-difference (TD) learning is widely used in reinforcement learning methods to learn moment-to-moment predictions of total future reward (value functions). In this setting, TD learning is often simpler and more data-efﬁcient than other methods. But the idea of TD learning can be used more generally than it is in reinforcement learning. TD learning is a general method for learning predictions whenever multiple predictions are made of the same event over time, value functions being just one example. The most pertinent of the more general uses of TD learning have been in learning models of an environment or task domain (Dayan, 1993; Kaelbling, 1993; Sutton, 1995; Sutton, Precup & Singh, 1999). In these works, TD learning is used to predict future values of many observations or state variables of a dynamical system. The essential idea of TD learning can be described as “learning a guess from a guess”. In all previous work, the two guesses involved were predictions of the same quantity at two points in time, for example, of the discounted future reward at successive time steps. In this paper we explore a few of the possibilities that open up when the second guess is allowed to be different from the ﬁrst. To be more precise, we must make a distinction between the extensive deﬁnition of a prediction, expressing its desired relationship to measurable data, and its TD deﬁnition, expressing its desired relationship to other predictions. In reinforcement learning, for example, state values are extensively deﬁned as an expectation of the discounted sum of future rewards, while they are TD deﬁned as the solution to the Bellman equation (a relationship to the expectation of the value of successor states, plus the immediate reward). It’s the same prediction, just deﬁned or expressed in different ways. In past work with TD methods, the TD relationship was always between predictions with identical or very similar extensive semantics. In this paper we retain the TD idea of learning predictions based on others, but allow the predictions to have different extensive semantics. 1 The Learning-to-predict Problem The problem we consider in this paper is a general one of learning to predict aspects of the interaction between a decision making agent and its environment. At each of a series of discrete time steps t, the environment generates an observation o t ∈ O, and the agent takes an action at ∈ A. Whereas A is an arbitrary discrete set, we assume without loss of generality that ot can be represented as a vector of bits. The action and observation events occur in sequence, o1 , a1 , o2 , a2 , o3 · · ·, with each event of course dependent only on those preceding it. This sequence will be called experience. We are interested in predicting not just each next observation but more general, action-conditional functions of future experience, as discussed in the next section. In this paper we use a random-walk problem with seven states, with left and right actions available in every state: 1 1 0 2 0 3 0 4 0 5 0 6 1 7 The observation upon arriving in a state consists of a special bit that is 1 only at the two ends of the walk and, in the ﬁrst two of our three experiments, seven additional bits explicitly indicating the state number (only one of them is 1). This is a continuing task: reaching an end state does not end or interrupt experience. Although the sequence depends deterministically on action, we assume that the actions are selected randomly with equal probability so that the overall system can be viewed as a Markov chain. The TD networks introduced in this paper can represent a wide variety of predictions, far more than can be represented by a conventional TD predictor. In this paper we take just a few steps toward more general predictions. In particular, we consider variations of the problem of prediction by a ﬁxed interval. This is one of the simplest cases that cannot otherwise be handled by TD methods. For the seven-state random walk, we will predict the special observation bit some numbers of discrete steps in advance, ﬁrst unconditionally and then conditioned on action sequences. 2 TD Networks A TD network is a network of nodes, each representing a single scalar prediction. The nodes are interconnected by links representing the TD relationships among the predictions and to the observations and actions. These links determine the extensive semantics of each prediction—its desired or target relationship to the data. They represent what we seek to predict about the data as opposed to how we try to predict it. We think of these links as determining a set of questions being asked about the data, and accordingly we call them the question network. A separate set of interconnections determines the actual computational process—the updating of the predictions at each node from their previous values and the current action and observation. We think of this process as providing the answers to the questions, and accordingly we call them the answer network. The question network provides targets for a learning process shaping the answer network and does not otherwise affect the behavior of the TD network. It is natural to consider changing the question network, but in this paper we take it as ﬁxed and given. Figure 1a shows a suggestive example of a question network. The three squares across the top represent three observation bits. The node labeled 1 is directly connected to the ﬁrst observation bit and represents a prediction that that bit will be 1 on the next time step. The node labeled 2 is similarly a prediction of the expected value of node 1 on the next step. Thus the extensive deﬁnition of Node 2’s prediction is the probability that the ﬁrst observation bit will be 1 two time steps from now. Node 3 similarly predicts the ﬁrst observation bit three time steps in the future. Node 4 is a conventional TD prediction, in this case of the future discounted sum of the second observation bit, with discount parameter γ. Its target is the familiar TD target, the data bit plus the node’s own prediction on the next time step (with weightings 1 − γ and γ respectively). Nodes 5 and 6 predict the probability of the third observation bit being 1 if particular actions a or b are taken respectively. Node 7 is a prediction of the average of the ﬁrst observation bit and Node 4’s prediction, both on the next step. This is the ﬁrst case where it is not easy to see or state the extensive semantics of the prediction in terms of the data. Node 8 predicts another average, this time of nodes 4 and 5, and the question it asks is even harder to express extensively. One could continue in this way, adding more and more nodes whose extensive deﬁnitions are difﬁcult to express but which would nevertheless be completely deﬁned as long as these local TD relationships are clear. The thinner links shown entering some nodes are meant to be a suggestion of the entirely separate answer network determining the actual computation (as opposed to the goals) of the network. In this paper we consider only simple question networks such as the left column of Figure 1a and of the action-conditional tree form shown in Figure 1b. 1−γ 1 4 γ a 5 b L 6 L 2 7 R L R R 8 3 (a) (b) Figure 1: The question networks of two TD networks. (a) a question network discussed in the text, and (b) a depth-2 fully-action-conditional question network used in Experiments 2 and 3. Observation bits are represented as squares across the top while actual nodes of the TD network, corresponding each to a separate prediction, are below. The thick lines represent the question network and the thin lines in (a) suggest the answer network (the bulk of which is not shown). Note that all of these nodes, arrows, and numbers are completely different and separate from those representing the random-walk problem on the preceding page. i More formally and generally, let yt ∈ [0, 1], i = 1, . . . , n, denote the prediction of the 1 n ith node at time step t. The column vector of predictions yt = (yt , . . . , yt )T is updated according to a vector-valued function u with modiﬁable parameter W: yt = u(yt−1 , at−1 , ot , Wt ) ∈ n . (1) The update function u corresponds to the answer network, with W being the weights on its links. Before detailing that process, we turn to the question network, the deﬁning TD i i relationships between nodes. The TD target zt for yt is an arbitrary function z i of the successive predictions and observations. In vector form we have 1 zt = z(ot+1 , ˜t+1 ) ∈ n , y (2) where ˜t+1 is just like yt+1 , as in (1), except calculated with the old weights before they y are updated on the basis of zt : ˜t = u(yt−1 , at−1 , ot , Wt−1 ) ∈ n . y (3) (This temporal subtlety also arises in conventional TD learning.) For example, for the 1 2 1 3 2 4 4 nodes in Figure 1a we have zt = o1 , zt = yt+1 , zt = yt+1 , zt = (1 − γ)o2 + γyt+1 , t+1 t+1 1 1 1 4 1 4 1 5 5 6 3 7 8 zt = zt = ot+1 , zt = 2 ot+1 + 2 yt+1 , and zt = 2 yt+1 + 2 yt+1 . The target functions z i are only part of specifying the question network. The other part has to do with making them potentially conditional on action and observation. For example, Node 5 in Figure 1a predicts what the third observation bit will be if action a is taken. To arrange for such i semantics we introduce a new vector ct of conditions, ci , indicating the extent to which yt t i is held responsible for matching zt , thus making the ith prediction conditional on ci . Each t ci is determined as an arbitrary function ci of at and yt . In vector form we have: t ct = c(at , yt ) ∈ [0, 1]n . (4) For example, for Node 5 in Figure 1a, c5 = 1 if at = a, otherwise c5 = 0. t t Equations (2–4) correspond to the question network. Let us now turn to deﬁning u, the update function for yt mentioned earlier and which corresponds to the answer network. In general u is an arbitrary function approximator, but for concreteness we deﬁne it to be of a linear form yt = σ(Wt xt ) (5) m where xt ∈ is a feature vector, Wt is an n × m matrix, and σ is the n-vector form of the identity function (Experiments 1 and 2) or the S-shaped logistic function σ(s) = 1 1+e−s (Experiment 3). The feature vector is an arbitrary function of the preceding action, observation, and node values: xt = x(at−1 , ot , yt−1 ) ∈ m . (6) For example, xt might have one component for each observation bit, one for each possible action (one of which is 1, the rest 0), and n more for the previous node values y t−1 . The ij learning algorithm for each component wt of Wt is ij ij i i wt+1 − wt = α(zt − yt )ci t i ∂yt , (7) ij ∂wt where α is a step-size parameter. The timing details may be clariﬁed by writing the sequence of quantities in the order in which they are computed: yt at ct ot+1 xt+1 ˜t+1 zt Wt+1 yt+1 . y (8) Finally, the target in the extensive sense for yt is (9) y∗ = Et,π (1 − ct ) · y∗ + ct · z(ot+1 , y∗ ) , t t+1 t where · represents component-wise multiplication and π is the policy being followed, which is assumed ﬁxed. 1 In general, z is a function of all the future predictions and observations, but in this paper we treat only the one-step case. 3 Experiment 1: n-step Unconditional Prediction In this experiment we sought to predict the observation bit precisely n steps in advance, for n = 1, 2, 5, 10, and 25. In order to predict n steps in advance, of course, we also have to predict n − 1 steps in advance, n − 2 steps in advance, etc., all the way down to predicting one step ahead. This is speciﬁed by a TD network consisting of a single chain of predictions like the left column of Figure 1a, but of length 25 rather than 3. Random-walk sequences were constructed by starting at the center state and then taking random actions for 50, 100, 150, and 200 steps (100 sequences each). We applied a TD network and a corresponding Monte Carlo method to this data. The Monte Carlo method learned the same predictions, but learned them by comparing them to the i actual outcomes in the sequence (instead of zt in (7)). This involved signiﬁcant additional complexity to store the predictions until their corresponding targets were available. Both algorithms used feature vectors of 7 binary components, one for each of the seven states, all of which were zero except for the one corresponding to the current state. Both algorithms formed their predictions linearly (σ(·) was the identity) and unconditionally (c i = 1 ∀i, t). t In an initial set of experiments, both algorithms were applied online with a variety of values for their step-size parameter α. Under these conditions we did not ﬁnd that either algorithm was clearly better in terms of the mean square error in their predictions over the data sets. We found a clearer result when both algorithms were trained using batch updating, in which weight changes are collected “on the side” over an experience sequence and then made all at once at the end, and the whole process is repeated until convergence. Under batch updating, convergence is to the same predictions regardless of initial conditions or α value (as long as α is sufﬁciently small), which greatly simpliﬁes comparison of algorithms. The predictions learned under batch updating are also the same as would be computed by least squares algorithms such as LSTD(λ) (Bradtke & Barto, 1996; Boyan, 2000; Lagoudakis & Parr, 2003). The errors in the ﬁnal predictions are shown in Table 1. For 1-step predictions, the Monte-Carlo and TD methods performed identically of course, but for longer predictions a signiﬁcant difference was observed. The RMSE of the Monte Carlo method increased with prediction length whereas for the TD network it decreased. The largest standard error in any of the numbers shown in the table is 0.008, so almost all of the differences are statistically signiﬁcant. TD methods appear to have a signiﬁcant data-efﬁciency advantage over non-TD methods in this prediction-by-n context (and this task) just as they do in conventional multi-step prediction (Sutton, 1988). Time Steps 50 100 150 200 1-step MC/TD 0.205 0.124 0.089 0.076 2-step MC TD 0.219 0.172 0.133 0.100 0.103 0.073 0.084 0.060 5-step MC TD 0.234 0.159 0.160 0.098 0.121 0.076 0.109 0.065 10-step MC TD 0.249 0.139 0.168 0.079 0.130 0.063 0.112 0.056 25-step MC TD 0.297 0.129 0.187 0.068 0.153 0.054 0.118 0.049 Table 1: RMSE of Monte-Carlo and TD-network predictions of various lengths and for increasing amounts of training data on the random-walk example with batch updating. 4 Experiment 2: Action-conditional Prediction The advantage of TD methods should be greater for predictions that apply only when the experience sequence unfolds in a particular way, such as when a particular sequence of actions are made. In a second experiment we sought to learn n-step-ahead predictions conditional on action selections. The question network for learning all 2-step-ahead pre- dictions is shown in Figure 1b. The upper two nodes predict the observation bit conditional on taking a left action (L) or a right action (R). The lower four nodes correspond to the two-step predictions, e.g., the second lower node is the prediction of what the observation bit will be if an L action is taken followed by an R action. These predictions are the same as the e-tests used in some of the work on predictive state representations (Littman, Sutton & Singh, 2002; Rudary & Singh, 2003). In this experiment we used a question network like that in Figure 1b except of depth four, consisting of 30 (2+4+8+16) nodes. The conditions for each node were set to 0 or 1 depending on whether the action taken on the step matched that indicated in the ﬁgure. The feature vectors were as in the previous experiment. Now that we are conditioning on action, the problem is deterministic and α can be set uniformly to 1. A Monte Carlo prediction can be learned only when its corresponding action sequence occurs in its entirety, but then it is complete and accurate in one step. The TD network, on the other hand, can learn from incomplete sequences but must propagate them back one level at a time. First the one-step predictions must be learned, then the two-step predictions from them, and so on. The results for online and batch training are shown in Tables 2 and 3. As anticipated, the TD network learns much faster than Monte Carlo with both online and batch updating. Because the TD network learns its n step predictions based on its n − 1 step predictions, it has a clear advantage for this task. Once the TD Network has seen each action in each state, it can quickly learn any prediction 2, 10, or 1000 steps in the future. Monte Carlo, on the other hand, must sample actual sequences, so each exact action sequence must be observed. Time Step 100 200 300 400 500 1-Step MC/TD 0.153 0.019 0.000 0.000 0.000 2-Step MC TD 0.222 0.182 0.092 0.044 0.040 0.000 0.019 0.000 0.019 0.000 3-Step MC TD 0.253 0.195 0.142 0.054 0.089 0.013 0.055 0.000 0.038 0.000 4-Step MC TD 0.285 0.185 0.196 0.062 0.139 0.017 0.093 0.000 0.062 0.000 Table 2: RMSE of the action-conditional predictions of various lengths for Monte-Carlo and TD-network methods on the random-walk problem with online updating. Time Steps 50 100 150 200 MC 53.48% 30.81% 19.26% 11.69% TD 17.21% 4.50% 1.57% 0.14% Table 3: Average proportion of incorrect action-conditional predictions for batch-updating versions of Monte-Carlo and TD-network methods, for various amounts of data, on the random-walk task. All differences are statistically signiﬁcant. 5 Experiment 3: Learning a Predictive State Representation Experiments 1 and 2 showed advantages for TD learning methods in Markov problems. The feature vectors in both experiments provided complete information about the nominal state of the random walk. In Experiment 3, on the other hand, we applied TD networks to a non-Markov version of the random-walk example, in particular, in which only the special observation bit was visible and not the state number. In this case it is not possible to make accurate predictions based solely on the current action and observation; the previous time step’s predictions must be used as well. As in the previous experiment, we sought to learn n-step predictions using actionconditional question networks of depths 2, 3, and 4. The feature vector xt consisted of three parts: a constant 1, four binary features to represent the pair of action a t−1 and observation bit ot , and n more features corresponding to the components of y t−1 . The features vectors were thus of length m = 11, 19, and 35 for the three depths. In this experiment, σ(·) was the S-shaped logistic function. The initial weights W0 and predictions y0 were both 0. Fifty random-walk sequences were constructed, each of 250,000 time steps, and presented to TD networks of the three depths, with a range of step-size parameters α. We measured the RMSE of all predictions made by the networks (computed from knowledge of the task) and also the “empirical RMSE,” the error in the one-step prediction for the action actually taken on each step. We found that in all cases the errors approached zero over time, showing that the problem was completely solved. Figure 2 shows some representative learning curves for the depth-2 and depth-4 TD networks. .3 Empirical RMS error .2 α=.1 .1 α=.5 α=.5 α=.75 0 0 α=.25 depth 2 50K 100K 150K 200K 250K Time Steps Figure 2: Prediction performance on the non-Markov random walk with depth-4 TD networks (and one depth-2 network) with various step-size parameters, averaged over 50 runs and 1000 time-step bins. The “bump” most clearly seen with small step sizes is reliably present and may be due to predictions of different lengths being learned at different times. In ongoing experiments on other non-Markov problems we have found that TD networks do not always ﬁnd such complete solutions. Other problems seem to require more than one step of history information (the one-step-preceding action and observation), though less than would be required using history information alone. Our results as a whole suggest that TD networks may provide an effective alternative learning algorithm for predictive state representations (Littman et al., 2000). Previous algorithms have been found to be effective on some tasks but not on others (e.g, Singh et al., 2003; Rudary & Singh, 2004; James & Singh, 2004). More work is needed to assess the range of effectiveness and learning rate of TD methods vis-a-vis previous methods, and to explore their combination with history information. 6 Conclusion TD networks suggest a large set of possibilities for learning to predict, and in this paper we have begun exploring the ﬁrst few. Our results show that even in a fully observable setting there may be signiﬁcant advantages to TD methods when learning TD-deﬁned predictions. Our action-conditional results show that TD methods can learn dramatically faster than other methods. TD networks allow the expression of many new kinds of predictions whose extensive semantics is not immediately clear, but which are ultimately fully grounded in data. It may be fruitful to further explore the expressive potential of TD-deﬁned predictions. Although most of our experiments have concerned the representational expressiveness and efﬁciency of TD-deﬁned predictions, it is also natural to consider using them as state, as in predictive state representations. Our experiments suggest that this is a promising direction and that TD learning algorithms may have advantages over previous learning methods. Finally, we note that adding nodes to a question network produces new predictions and thus may be a way to address the discovery problem for predictive representations. Acknowledgments The authors gratefully acknowledge the ideas and encouragement they have received in this work from Satinder Singh, Doina Precup, Michael Littman, Mark Ring, Vadim Bulitko, Eddie Rafols, Anna Koop, Tao Wang, and all the members of the rlai.net group. References Boyan, J. A. (2000). Technical update: Least-squares temporal difference learning. Machine Learning 49:233–246. Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning 22(1/2/3):33–57. Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural Computation 5(4):613–624. James, M. and Singh, S. (2004). Learning and discovery of predictive state representations in dynamical systems with reset. In Proceedings of the Twenty-First International Conference on Machine Learning, pages 417–424. Kaelbling, L. P. (1993). Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the Tenth International Conference on Machine Learning, pp. 167–173. Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research 4(Dec):1107–1149. Littman, M. L., Sutton, R. S. and Singh, S. (2002). Predictive representations of state. In Advances In Neural Information Processing Systems 14:1555–1561. Rudary, M. R. and Singh, S. (2004). A nonlinear predictive state representation. In Advances in Neural Information Processing Systems 16:855–862. Singh, S., Littman, M. L., Jong, N. K., Pardoe, D. and Stone, P. (2003) Learning predictive state representations. In Proceedings of the Twentieth Int. Conference on Machine Learning, pp. 712–719. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3:9–44. Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In A. Prieditis and S. Russell (eds.), Proceedings of the Twelfth International Conference on Machine Learning, pp. 531–539. Morgan Kaufmann, San Francisco. Sutton, R. S., Precup, D. and Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artiﬁcial Intelligence 112:181–121.

6 0.40649635 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors

7 0.3791841 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces

8 0.3758415 56 nips-2004-Dynamic Bayesian Networks for Brain-Computer Interfaces

9 0.37580061 88 nips-2004-Intrinsically Motivated Reinforcement Learning

10 0.36945471 117 nips-2004-Methods Towards Invasive Human Brain Computer Interfaces

11 0.3618671 109 nips-2004-Mass Meta-analysis in Talairach Space

12 0.35595712 184 nips-2004-The Cerebellum Chip: an Analog VLSI Implementation of a Cerebellar Model of Classical Conditioning

13 0.34566227 39 nips-2004-Coarticulation in Markov Decision Processes

14 0.32252657 24 nips-2004-Approximately Efficient Online Mechanism Design

15 0.31935164 21 nips-2004-An Information Maximization Model of Eye Movements

16 0.31597942 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models

17 0.30649707 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

18 0.29993036 84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture

19 0.2990135 64 nips-2004-Experts in a Markov Decision Process

20 0.29319265 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.087), (15, 0.094), (19, 0.026), (26, 0.043), (31, 0.018), (33, 0.151), (35, 0.031), (39, 0.014), (50, 0.029), (54, 0.366), (58, 0.011), (76, 0.013), (89, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81063378 155 nips-2004-Responding to Modalities with Different Latencies

Author: Fredrik Bissmarck, Hiroyuki Nakahara, Kenji Doya, Okihide Hikosaka

2 0.73529899 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem

Author: Suvrit Sra, Joel Tropp, Inderjit S. Dhillon

Abstract: Various problems in machine learning, databases, and statistics involve pairwise distances among a set of objects. It is often desirable for these distances to satisfy the properties of a metric, especially the triangle inequality. Applications where metric data is useful include clustering, classiﬁcation, metric-based indexing, and approximation algorithms for various graph problems. This paper presents the Metric Nearness Problem: Given a dissimilarity matrix, ﬁnd the “nearest” matrix of distances that satisfy the triangle inequalities. For p nearness measures, this paper develops efﬁcient triangle ﬁxing algorithms that compute globally optimal solutions by exploiting the inherent structure of the problem. Empirically, the algorithms have time and storage costs that are linear in the number of triangle constraints. The methods can also be easily parallelized for additional speed. 1

3 0.62539047 125 nips-2004-Multiple Relational Embedding

Author: Roland Memisevic, Geoffrey E. Hinton

Abstract: We describe a way of using multiple different types of similarity relationship to learn a low-dimensional embedding of a dataset. Our method chooses different, possibly overlapping representations of similarity by individually reweighting the dimensions of a common underlying latent space. When applied to a single similarity relation that is based on Euclidean distances between the input data points, the method reduces to simple dimensionality reduction. If additional information is available about the dataset or about subsets of it, we can use this information to clean up or otherwise improve the embedding. We demonstrate the potential usefulness of this form of semi-supervised dimensionality reduction on some simple examples. 1

4 0.48583555 131 nips-2004-Non-Local Manifold Tangent Learning

Author: Yoshua Bengio, Martin Monperrus

Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1

5 0.48460093 102 nips-2004-Learning first-order Markov models for control

Author: Pieter Abbeel, Andrew Y. Ng

Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a ﬁrst-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the ﬁrst-order (single-step) transitions. But for many problems, the ﬁrstorder conditional independence assumptions are not satisﬁed, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a ﬁrst-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the beneﬁts of using ﬁrst-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1

6 0.48455635 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

7 0.48322934 178 nips-2004-Support Vector Classification with Input Data Uncertainty

8 0.48244536 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

9 0.48177683 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

10 0.4816542 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

11 0.48070017 163 nips-2004-Semi-parametric Exponential Family PCA

12 0.48048255 127 nips-2004-Neighbourhood Components Analysis

13 0.48017845 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

14 0.47988442 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

15 0.47977129 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

16 0.47974664 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

17 0.47973293 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

18 0.47963527 116 nips-2004-Message Errors in Belief Propagation

19 0.47881284 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

20 0.47863385 70 nips-2004-Following Curved Regularized Optimization Solution Paths