nips nips2007 nips2007-191 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marcus Hutter, Shane Legg
Abstract: We derive an equation for temporal difference learning from statistical principles. Specifically, we start with the variational principle and then bootstrap to produce an updating rule for discounted state value estimates. The resulting equation is similar to the standard equation for temporal difference learning with eligibility traces, so called TD(λ), however it lacks the parameter α that specifies the learning rate. In the place of this free parameter there is now an equation for the learning rate that is specific to each state transition. We experimentally test this new learning rule against TD(λ) and find that it offers superior performance in various settings. Finally, we make some preliminary investigations into how to extend our new temporal difference algorithm to reinforcement learning. To do this we combine our update equation with both Watkins’ Q(λ) and Sarsa(λ) and find that it again offers superior performance without a learning rate parameter. 1
Reference: text
sentIndex sentText sentNum sentScore
1 org/shane Abstract We derive an equation for temporal difference learning from statistical principles. [sent-7, score-0.197]
2 Specifically, we start with the variational principle and then bootstrap to produce an updating rule for discounted state value estimates. [sent-8, score-0.299]
3 The resulting equation is similar to the standard equation for temporal difference learning with eligibility traces, so called TD(λ), however it lacks the parameter α that specifies the learning rate. [sent-9, score-0.329]
4 In the place of this free parameter there is now an equation for the learning rate that is specific to each state transition. [sent-10, score-0.237]
5 We experimentally test this new learning rule against TD(λ) and find that it offers superior performance in various settings. [sent-11, score-0.086]
6 Finally, we make some preliminary investigations into how to extend our new temporal difference algorithm to reinforcement learning. [sent-12, score-0.184]
7 To do this we combine our update equation with both Watkins’ Q(λ) and Sarsa(λ) and find that it again offers superior performance without a learning rate parameter. [sent-13, score-0.182]
8 1 Introduction In the field of reinforcement learning, perhaps the most popular way to estimate the future discounted reward of states is the method of temporal difference learning. [sent-14, score-0.449]
9 It is unclear who exactly introduced this first, however the first explicit version of temporal difference as a learning rule appears to be Witten [9]. [sent-15, score-0.15]
10 The idea is as follows: The expected future discounted reward of a state s is, V s := E rk + γrk+1 + γ 2 rk+2 + · · · |sk = s , where the rewards rk , rk+1 , . [sent-16, score-0.463]
11 are geometrically discounted into the future by γ < 1. [sent-19, score-0.165]
12 (1) Our task, at time t, is to compute an estimate Vst of V s for each state s. [sent-21, score-0.103]
13 The only information we have to base this estimate on is the current history of state transitions, s1 , s2 , . [sent-22, score-0.119]
14 , st , and the current history of observed rewards, r1 , r2 , . [sent-25, score-0.203]
15 Equation (1) suggests that at time t + 1 the value of rt + γVst+1 provides us with information on what Vst should be: If it is higher than Vstt then perhaps this estimate should be increased, and vice versa. [sent-29, score-0.165]
16 This intuition gives us the following estimation heuristic for state st , Vst+1 := Vstt + α rt + γVstt+1 − Vstt , t where α is a parameter that controls the rate of learning. [sent-30, score-0.47]
17 This type of temporal difference learning is known as TD(0). [sent-31, score-0.127]
18 1 One shortcoming of this method is that at each time step the value of only the last state st is updated. [sent-32, score-0.293]
19 States before the last state are also affected by changes in the last state’s value and thus these could be updated too. [sent-33, score-0.106]
20 This is what happens with so called temporal difference learning with eligibility traces, where a history, or trace, is kept of which states have been recently visited. [sent-34, score-0.22]
21 Under this method, when we update the value of a state we also go back through the trace updating the earlier states as well. [sent-35, score-0.218]
22 Formally, for any state s its eligibility trace is computed by, t−1 γλEs if s = st , t Es := t−1 γλEs + 1 if s = st , where λ is used to control the rate at which the eligibility trace is discounted. [sent-36, score-0.719]
23 The temporal difference update is then, for all states s, t Vst+1 := Vst + αEs r + γVstt+1 − Vstt . [sent-37, score-0.151]
24 (2) This more powerful version of temporal different learning is known as TD(λ) [7]. [sent-38, score-0.096]
25 The main idea of this paper is to derive a temporal difference rule from statistical principles and compare it to the standard heuristic described above. [sent-39, score-0.159]
26 On the other hand, our algorithm “exactly” coincides with TD/Q/Sarsa(λ) for finite state spaces, but with a novel learning rate derived from statistical principles. [sent-42, score-0.168]
27 We then test it on a random Markov chain in Section 4 and a non-stationary environment in Section 5. [sent-48, score-0.094]
28 In Section 6 we derive two new methods for policy learning based on HL(λ), and compare them to Sarsa(λ) and Watkins’ Q(λ) on a simple reinforcement learning problem. [sent-49, score-0.129]
29 2 Derivation The empirical future discounted reward of a state sk is the sum of actual rewards following from state sk in time steps k, k + 1, . [sent-51, score-0.68]
30 , where the rewards are discounted as they go into the future. [sent-54, score-0.167]
31 Formally, the empirical value of state sk at time k for k = 1, . [sent-55, score-0.216]
32 , t is, vk := ∞ γ u−k ru , (3) u=k where the future rewards ru are geometrically discounted by γ < 1. [sent-58, score-0.412]
33 In practice the exact value of vk is always unknown to us as it depends not only on rewards that have been already observed, but also on unknown future rewards. [sent-59, score-0.174]
34 Note that if sm = sn for m = n, that is, we have visited the same state twice at different times m and n, this does not imply that vn = vm as the observed rewards following the state visit may be different each time. [sent-60, score-0.289]
35 Our goal is that for each state s the estimate Vst should be as close as possible to the true expected future discounted reward V s . [sent-61, score-0.334]
36 Thus, for each state s we would like Vs to be close to vk for all k such that s = sk . [sent-62, score-0.277]
37 Formally, we want to minimise the loss function, L := 1 2 t λt−k vk − Vstk 2 . [sent-64, score-0.1]
38 By defining a discounted state visit counter Ns := k=1 λt−k δsk s we get t t Vst Ns = λt−k δsk s vk . [sent-67, score-0.336]
39 (5) k=1 Since vk depends on future rewards rk , Equation (5) can not be used in its current form. [sent-68, score-0.203]
40 Specifically, the tail of the future discounted reward sum for each state depends on the empirical value at time t in the following way, t−1 γ u−k ru + γ t−k vt . [sent-70, score-0.427]
41 (6) t t t For state s = st , this simplifies to Vstt = Rst /(Nst − Est ). [sent-76, score-0.276]
42 Substituting this back into Equation (6) we obtain, Rt t t t (7) Vst Ns = Rs + Es t st t . [sent-77, score-0.213]
43 To derive this we make use of the relations, t+1 t Ns = λNs + δst+1 s , t+1 t Es = λγEs + δst+1 s , t+1 t t Rs = λRs + λEs rt , 0 0 0 with Ns = Es = Rs = 0. [sent-80, score-0.149]
44 Inserting these into Equation (7) with t replaced by t + 1, t+1 Vst+1 Ns t+1 t+1 = R s + Es t+1 Rst+1 t+1 Nst+1 t+1 − Est+1 t Rt + Est+1 rt t+1 st+1 t t = λRs + λEs rt + Es t t Nst+1 − γEst+1 . [sent-81, score-0.268]
45 + t t t t (λγEs + δst+1 s )(Nst+1 Vstt+1 − Est+1 Vstt + Est+1 rt ) t t t (Nst+1 − γEst+1 )(λNs + δst+1 s ) . [sent-83, score-0.134]
46 t t t Nst+1 − γEst+1 Ns (9) Examining Equation (8), we find the usual update equation for temporal difference learning with eligibility traces (see Equation (2)), however the learning rate α has now been replaced by βt (s, st+1 ). [sent-86, score-0.381]
47 This learning rate was derived from statistical principles by minimising the squared loss between the estimated and true state value. [sent-87, score-0.181]
48 This gives us an equation for the learning rate for each state transition at time t, as opposed to the standard temporal difference learning where the learning rate α is either a fixed free parameter for all transitions, or is decreased over time by some monotonically decreasing function. [sent-89, score-0.475]
49 In either case, the learning rate is not automatic and must be experimentally tuned for good performance. [sent-90, score-0.098]
50 The meaning of second term however can be understood as t t follows: Ns measures how often we have visited state s in the recent past. [sent-93, score-0.115]
51 Therefore, if Ns ≪ t Nst+1 then state s has a value estimate based on relatively few samples, while state st+1 has a value estimate based on relatively many samples. [sent-94, score-0.24]
52 In such a situation, the second term in βt boosts the learning rate so that Vst+1 moves more aggressively towards the presumably more accurate rt + γVstt+1 . [sent-95, score-0.213]
53 In the opposite situation when st+1 is a less visited state, we see that the reverse occurs and the learning rate is reduced in order to maintain the existing value of Vs . [sent-96, score-0.136]
54 In each step the state number is either incremented or decremented by one with equal probability, unless the system is in state 0 or 50 in which case it always transitions to state 25 in the following step. [sent-98, score-0.296]
55 When the state transitions from 0 to 25 a reward of 1. [sent-99, score-0.205]
56 0 is generated, and for a transition from 50 to 25 a reward of -1. [sent-100, score-0.119]
57 99 and then computed the true discounted value of each state by running a brute force Monte Carlo simulation. [sent-104, score-0.244]
58 We ran our algorithm 10 times on the above Markov chain and computed the root mean squared error in the value estimate across the states at each time step averaged across each run. [sent-105, score-0.124]
59 0 x1e+4 Figure 1: 51 state Markov process averaged over 10 runs. [sent-138, score-0.114]
60 0 x1e+4 Figure 2: 51 state Markov process averaged over 300 runs. [sent-144, score-0.114]
61 0, which was to be expected given that the environment is stationary and thus discounting old experience is not helpful. [sent-146, score-0.11]
62 To test this we explored the performance of a number of different learning rate functions on the 51 state Markov chain described above. [sent-155, score-0.197]
63 With a variable learning rate TD(λ) is performing much better, however we were still unable to find an equation that reduced the learning rate in such a way that TD(λ) would outperform HL(λ). [sent-159, score-0.237]
64 This is evidence that HL(λ) is adapting the learning rate optimally without the need for manual equation tuning. [sent-160, score-0.148]
65 4 Random Markov process To test on a Markov process with a more complex transition structure, we created a random 50 state Markov process. [sent-161, score-0.121]
66 Then to transition between states we interpreted the ith row as a probability distribution over which state follows state i. [sent-165, score-0.246]
67 To compute the reward associated with each transition we created a random matrix as above, but without normalising. [sent-166, score-0.119]
68 9 and then ran a brute force Monte Carlo simulation to compute the true discounted value of each state. [sent-168, score-0.174]
69 For TD we experimented with a range of parameter settings and learning rate decrease functions. [sent-171, score-0.094]
70 Furthermore, stability in the error towards the end of the run is better with HL(λ) and no manual learning tuning was required for these performance gains. [sent-177, score-0.086]
71 0 5000 Time Figure 3: Random 50 state Markov process. [sent-194, score-0.089]
72 0 x1e+4 Figure 4: 21 state non-stationary Markov process. [sent-210, score-0.089]
73 Non-stationary Markov process The λ parameter in HL(λ), introduced in Equation (4), reduces the importance of old observations when computing the state value estimates. [sent-211, score-0.131]
74 When the environment is stationary this is not useful and so we can set λ = 1. [sent-212, score-0.085]
75 0, however in a non-stationary environment we need to reduce this value so that the state values adapt properly to changes in the environment. [sent-213, score-0.171]
76 The more rapidly the environment is changing, the lower we need to make λ in order to more rapidly forget old observations. [sent-214, score-0.09]
77 At that point, we changed the reward when transitioning from the last state to middle state to from -1. [sent-217, score-0.285]
78 At time 10,000 we then switched back to the original Markov chain, and so on alternating between the models of the environment every 5,000 steps. [sent-220, score-0.091]
79 At each switch, we also changed the target state values that the algorithm was trying to estimate to match the current configuration of the environment. [sent-221, score-0.123]
80 As we would expect, for both algorithms when we pushed λ above its optimal value this caused poor performance in the periods following each switch in the environment (these bad parameter settings are not shown in the results). [sent-230, score-0.12]
81 On the other hand, setting λ too low produced initially fast adaption to each environment switch, but poor performance after that until the next environment change. [sent-231, score-0.13]
82 6 Windy Gridworld Reinforcement learning algorithms such as Watkins’ Q(λ) [8] and Sarsa(λ) [5, 4] are based on temporal difference updates. [sent-240, score-0.127]
83 This suggests that new reinforcement learning algorithms based on HL(λ) should be possible. [sent-241, score-0.095]
84 For our first experiment we took the standard Sarsa(λ) algorithm and modified it in the obvious way to use an HL temporal difference update. [sent-242, score-0.108]
85 In the presentation of this algorithm we have changed notation slightly to make things more consistent with that typical in reinforcement learning. [sent-243, score-0.112]
86 Our new reinforcement learning algorithm, which we call HLS(λ) is given in Algorithm 1. [sent-245, score-0.095]
87 Essentially the only changes to the standard Sarsa(λ) algorithm have been to add code to compute the visit counter N (s, a), add a loop to compute the β values, and replace α with β in the temporal difference update. [sent-246, score-0.161]
88 The agent starts in the 4th row of the 1st column and receives a reward of 1 when it finds its way to the 4th row of the 8th column. [sent-250, score-0.151]
89 To make things more difficult, there is a “wind” blowing the agent up 1 row in columns 4, 5, 6, and 9, and a strong wind of 2 in columns 7 and 8. [sent-251, score-0.092]
90 Unlike in the original version, we have set up this problem to be a continuing discounted task with an automatic transition from the goal state back to the start state. [sent-253, score-0.279]
91 99 and in each run computed the empirical future discounted reward at each point in time. [sent-255, score-0.249]
92 Despite putting considerable effort into tuning the parameters of Sarsa(λ), we were unable to achieve a final future discounted reward above 5. [sent-259, score-0.271]
93 This combination of superior performance and fewer parameters to tune suggest that the benefits of HL(λ) carry over into the reinforcement learning setting. [sent-265, score-0.135]
94 Similar to Sarsa(λ) above, we simply inserted the HL(λ) temporal difference update into the usual Q(λ) algorithm in the obvious way. [sent-267, score-0.131]
95 005 0 0 1 2 3 Time Figure 5: [Windy Gridworld] S marks the start state and G the goal state, at which the agent jumps back to S with a reward of 1. [sent-278, score-0.25]
96 Conclusions We have derived a new equation for setting the learning rate in temporal difference learning with eligibility traces. [sent-283, score-0.334]
97 The equation replaces the free learning rate parameter α, which is normally experimentally tuned by hand. [sent-284, score-0.167]
98 In every setting tested, be it stationary Markov chains, non-stationary Markov chains or reinforcement learning, our new method produced superior results. [sent-285, score-0.144]
99 In terms of experimental results, it would be interesting to try different types of reinforcement learning problems and to more clearly identify where the ability to set the learning rate differently for different state transition pairs helps performance. [sent-289, score-0.295]
100 Finally, just as we have successfully merged HL(λ) with Sarsa(λ) and Watkins’ Q(λ), we would also like to see if the same can be done with Peng’s Q(λ) [3], and perhaps other reinforcement learning algorithms. [sent-291, score-0.095]
wordName wordTfidf (topN-words)
[('vst', 0.451), ('vstt', 0.423), ('nst', 0.352), ('hl', 0.314), ('td', 0.237), ('est', 0.235), ('ns', 0.192), ('st', 0.187), ('sarsa', 0.157), ('rt', 0.134), ('discounted', 0.116), ('sk', 0.11), ('state', 0.089), ('reward', 0.087), ('rs', 0.086), ('hls', 0.085), ('vk', 0.078), ('temporal', 0.077), ('reinforcement', 0.076), ('eligibility', 0.073), ('es', 0.068), ('environment', 0.065), ('watkins', 0.061), ('rate', 0.06), ('ru', 0.059), ('windy', 0.056), ('equation', 0.055), ('markov', 0.055), ('rewards', 0.051), ('rk', 0.046), ('hlq', 0.042), ('vstk', 0.042), ('gridworld', 0.037), ('visit', 0.034), ('rmse', 0.032), ('transition', 0.032), ('agent', 0.032), ('vt', 0.031), ('difference', 0.031), ('transitions', 0.029), ('chain', 0.029), ('lstd', 0.028), ('shane', 0.028), ('wind', 0.028), ('future', 0.028), ('visited', 0.026), ('back', 0.026), ('trace', 0.025), ('averaged', 0.025), ('old', 0.025), ('superior', 0.025), ('peng', 0.025), ('unable', 0.024), ('traces', 0.024), ('rule', 0.023), ('chains', 0.023), ('switch', 0.023), ('update', 0.023), ('brute', 0.022), ('factoring', 0.022), ('minimise', 0.022), ('substituting', 0.022), ('initialise', 0.021), ('marcus', 0.021), ('beat', 0.021), ('geometrically', 0.021), ('states', 0.02), ('bootstrap', 0.02), ('stationary', 0.02), ('changed', 0.02), ('learning', 0.019), ('end', 0.019), ('counter', 0.019), ('experimentally', 0.019), ('ran', 0.019), ('run', 0.018), ('xy', 0.018), ('updating', 0.018), ('tests', 0.017), ('value', 0.017), ('kronecker', 0.017), ('tuning', 0.016), ('start', 0.016), ('things', 0.016), ('history', 0.016), ('row', 0.016), ('super', 0.016), ('settings', 0.015), ('derivation', 0.015), ('tune', 0.015), ('derive', 0.015), ('discount', 0.014), ('incremental', 0.014), ('manual', 0.014), ('estimate', 0.014), ('vs', 0.014), ('situation', 0.014), ('free', 0.014), ('environments', 0.013), ('principles', 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 191 nips-2007-Temporal Difference Updating without a Learning Rate
Author: Marcus Hutter, Shane Legg
Abstract: We derive an equation for temporal difference learning from statistical principles. Specifically, we start with the variational principle and then bootstrap to produce an updating rule for discounted state value estimates. The resulting equation is similar to the standard equation for temporal difference learning with eligibility traces, so called TD(λ), however it lacks the parameter α that specifies the learning rate. In the place of this free parameter there is now an equation for the learning rate that is specific to each state transition. We experimentally test this new learning rule against TD(λ) and find that it offers superior performance in various settings. Finally, we make some preliminary investigations into how to extend our new temporal difference algorithm to reinforcement learning. To do this we combine our update equation with both Watkins’ Q(λ) and Sarsa(λ) and find that it again offers superior performance without a learning rate parameter. 1
2 0.20884776 102 nips-2007-Incremental Natural Actor-Critic Algorithms
Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton
Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1
3 0.1199826 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini
Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1
4 0.097796611 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
Author: Alexander L. Strehl, Michael L. Littman
Abstract: We provide a provably efficient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. Specifically, we take a model-based approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernalized) linearly parameterized dynamics. This result builds on Kearns and Singh’s work that provides a provably efficient algorithm for finite state MDPs. Our approach is not restricted to the linear setting, and is applicable to other classes of continuous MDPs.
5 0.094070494 86 nips-2007-Exponential Family Predictive Representations of State
Author: David Wingate, Satinder S. Baveja
Abstract: In order to represent state in controlled, partially observable, stochastic dynamical systems, some sort of sufficient statistic for history is necessary. Predictive representations of state (PSRs) capture state as statistics of the future. We introduce a new model of such systems called the “Exponential family PSR,” which defines as state the time-varying parameters of an exponential family distribution which models n sequential observations in the future. This choice of state representation explicitly connects PSRs to state-of-the-art probabilistic modeling, which allows us to take advantage of current efforts in high-dimensional density estimation, and in particular, graphical models and maximum entropy models. We present a parameter learning algorithm based on maximum likelihood, and we show how a variety of current approximate inference methods apply. We evaluate the quality of our model with reinforcement learning by directly evaluating the control performance of the model. 1
6 0.080481723 57 nips-2007-Congruence between model and human attention reveals unique signatures of critical visual events
7 0.078311414 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
8 0.064278647 30 nips-2007-Bayes-Adaptive POMDPs
9 0.062524609 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
10 0.061136447 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning
11 0.059423365 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
12 0.055969398 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models
13 0.054582018 197 nips-2007-The Infinite Markov Model
14 0.052898414 127 nips-2007-Measuring Neural Synchrony by Message Passing
15 0.048557796 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion
16 0.047988608 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
17 0.046001721 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess
18 0.045034859 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding
19 0.044078011 100 nips-2007-Hippocampal Contributions to Control: The Third Way
20 0.043068245 171 nips-2007-Scan Strategies for Meteorological Radars
topicId topicWeight
[(0, -0.117), (1, -0.172), (2, 0.062), (3, -0.064), (4, -0.098), (5, 0.054), (6, 0.019), (7, -0.019), (8, 0.026), (9, 0.003), (10, -0.029), (11, -0.045), (12, -0.017), (13, -0.007), (14, 0.015), (15, 0.021), (16, -0.004), (17, 0.032), (18, -0.13), (19, 0.175), (20, -0.065), (21, 0.105), (22, 0.121), (23, -0.061), (24, 0.072), (25, 0.111), (26, -0.015), (27, 0.008), (28, 0.051), (29, -0.02), (30, 0.106), (31, -0.029), (32, -0.031), (33, 0.004), (34, 0.056), (35, 0.038), (36, -0.073), (37, 0.005), (38, -0.0), (39, -0.002), (40, -0.047), (41, -0.045), (42, 0.049), (43, 0.042), (44, 0.107), (45, 0.095), (46, -0.114), (47, -0.03), (48, 0.086), (49, 0.04)]
simIndex simValue paperId paperTitle
same-paper 1 0.93751425 191 nips-2007-Temporal Difference Updating without a Learning Rate
Author: Marcus Hutter, Shane Legg
Abstract: We derive an equation for temporal difference learning from statistical principles. Specifically, we start with the variational principle and then bootstrap to produce an updating rule for discounted state value estimates. The resulting equation is similar to the standard equation for temporal difference learning with eligibility traces, so called TD(λ), however it lacks the parameter α that specifies the learning rate. In the place of this free parameter there is now an equation for the learning rate that is specific to each state transition. We experimentally test this new learning rule against TD(λ) and find that it offers superior performance in various settings. Finally, we make some preliminary investigations into how to extend our new temporal difference algorithm to reinforcement learning. To do this we combine our update equation with both Watkins’ Q(λ) and Sarsa(λ) and find that it again offers superior performance without a learning rate parameter. 1
2 0.7816335 102 nips-2007-Incremental Natural Actor-Critic Algorithms
Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton
Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1
3 0.66757393 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini
Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1
4 0.57940304 86 nips-2007-Exponential Family Predictive Representations of State
Author: David Wingate, Satinder S. Baveja
Abstract: In order to represent state in controlled, partially observable, stochastic dynamical systems, some sort of sufficient statistic for history is necessary. Predictive representations of state (PSRs) capture state as statistics of the future. We introduce a new model of such systems called the “Exponential family PSR,” which defines as state the time-varying parameters of an exponential family distribution which models n sequential observations in the future. This choice of state representation explicitly connects PSRs to state-of-the-art probabilistic modeling, which allows us to take advantage of current efforts in high-dimensional density estimation, and in particular, graphical models and maximum entropy models. We present a parameter learning algorithm based on maximum likelihood, and we show how a variety of current approximate inference methods apply. We evaluate the quality of our model with reinforcement learning by directly evaluating the control performance of the model. 1
5 0.52631378 127 nips-2007-Measuring Neural Synchrony by Message Passing
Author: Justin Dauwels, François Vialatte, Tomasz Rutkowski, Andrzej S. Cichocki
Abstract: A novel approach to measure the interdependence of two time series is proposed, referred to as “stochastic event synchrony” (SES); it quantifies the alignment of two point processes by means of the following parameters: time delay, variance of the timing jitter, fraction of “spurious” events, and average similarity of events. SES may be applied to generic one-dimensional and multi-dimensional point processes, however, the paper mainly focusses on point processes in time-frequency domain. The average event similarity is in that case described by two parameters: the average frequency offset between events in the time-frequency plane, and the variance of the frequency offset (“frequency jitter”); SES then consists of five parameters in total. Those parameters quantify the synchrony of oscillatory events, and hence, they provide an alternative to existing synchrony measures that quantify amplitude or phase synchrony. The pairwise alignment of point processes is cast as a statistical inference problem, which is solved by applying the maxproduct algorithm on a graphical model. The SES parameters are determined from the resulting pairwise alignment by maximum a posteriori (MAP) estimation. The proposed interdependence measure is applied to the problem of detecting anomalies in EEG synchrony of Mild Cognitive Impairment (MCI) patients; the results indicate that SES significantly improves the sensitivity of EEG in detecting MCI.
6 0.49402097 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess
7 0.48777261 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
8 0.47751239 57 nips-2007-Congruence between model and human attention reveals unique signatures of critical visual events
9 0.38936567 171 nips-2007-Scan Strategies for Meteorological Radars
10 0.37139517 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models
11 0.37119317 197 nips-2007-The Infinite Markov Model
12 0.3430075 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning
13 0.28431579 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
14 0.27625054 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
15 0.27226743 52 nips-2007-Competition Adds Complexity
16 0.25605366 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
17 0.23729283 100 nips-2007-Hippocampal Contributions to Control: The Third Way
18 0.2317791 150 nips-2007-Optimal models of sound localization by barn owls
19 0.23034813 203 nips-2007-The rat as particle filter
20 0.22461343 59 nips-2007-Continuous Time Particle Filtering for fMRI
topicId topicWeight
[(5, 0.028), (13, 0.505), (16, 0.013), (18, 0.01), (21, 0.053), (31, 0.013), (34, 0.014), (35, 0.02), (47, 0.089), (83, 0.067), (85, 0.018), (90, 0.049)]
simIndex simValue paperId paperTitle
1 0.95279378 14 nips-2007-A configurable analog VLSI neural network with spiking neurons and self-regulating plastic synapses
Author: Massimiliano Giulioni, Mario Pannunzi, Davide Badoni, Vittorio Dante, Paolo D. Giudice
Abstract: We summarize the implementation of an analog VLSI chip hosting a network of 32 integrate-and-fire (IF) neurons with spike-frequency adaptation and 2,048 Hebbian plastic bistable spike-driven stochastic synapses endowed with a selfregulating mechanism which stops unnecessary synaptic changes. The synaptic matrix can be flexibly configured and provides both recurrent and AER-based connectivity with external, AER compliant devices. We demonstrate the ability of the network to efficiently classify overlapping patterns, thanks to the self-regulating mechanism.
same-paper 2 0.91080499 191 nips-2007-Temporal Difference Updating without a Learning Rate
Author: Marcus Hutter, Shane Legg
Abstract: We derive an equation for temporal difference learning from statistical principles. Specifically, we start with the variational principle and then bootstrap to produce an updating rule for discounted state value estimates. The resulting equation is similar to the standard equation for temporal difference learning with eligibility traces, so called TD(λ), however it lacks the parameter α that specifies the learning rate. In the place of this free parameter there is now an equation for the learning rate that is specific to each state transition. We experimentally test this new learning rule against TD(λ) and find that it offers superior performance in various settings. Finally, we make some preliminary investigations into how to extend our new temporal difference algorithm to reinforcement learning. To do this we combine our update equation with both Watkins’ Q(λ) and Sarsa(λ) and find that it again offers superior performance without a learning rate parameter. 1
3 0.88078475 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines
Author: Fred Richardson, William M. Campbell
Abstract: Many tasks in speech processing involve classification of long term characteristics of a speech segment such as language, speaker, dialect, or topic. A natural technique for determining these characteristics is to first convert the input speech into a sequence of tokens such as words, phones, etc. From these tokens, we can then look for distinctive sequences, keywords, that characterize the speech. In many applications, a set of distinctive keywords may not be known a priori. In this case, an automatic method of building up keywords from short context units such as phones is desirable. We propose a method for the construction of keywords based upon Support Vector Machines. We cast the problem of keyword selection as a feature selection problem for n-grams of phones. We propose an alternating filter-wrapper method that builds successively longer keywords. Application of this method to language recognition and topic recognition tasks shows that the technique produces interesting and significant qualitative and quantitative results.
4 0.83833039 22 nips-2007-Agreement-Based Learning
Author: Percy Liang, Dan Klein, Michael I. Jordan
Abstract: The learning of probabilistic models with many hidden variables and nondecomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks. 1
5 0.79415834 62 nips-2007-Convex Learning with Invariances
Author: Choon H. Teo, Amir Globerson, Sam T. Roweis, Alex J. Smola
Abstract: Incorporating invariances into a learning algorithm is a common problem in machine learning. We provide a convex formulation which can deal with arbitrary loss functions and arbitrary losses. In addition, it is a drop-in replacement for most optimization algorithms for kernels, including solvers of the SVMStruct family. The advantage of our setting is that it relies on column generation instead of modifying the underlying optimization problem directly. 1
6 0.62810779 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
7 0.57025003 117 nips-2007-Learning to classify complex patterns using a VLSI network of spiking neurons
8 0.56233495 84 nips-2007-Expectation Maximization and Posterior Constraints
9 0.49871451 102 nips-2007-Incremental Natural Actor-Critic Algorithms
10 0.48668149 205 nips-2007-Theoretical Analysis of Learning with Reward-Modulated Spike-Timing-Dependent Plasticity
11 0.48442116 9 nips-2007-A Probabilistic Approach to Language Change
12 0.47941905 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion
13 0.46987966 86 nips-2007-Exponential Family Predictive Representations of State
14 0.45446813 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
15 0.44659561 63 nips-2007-Convex Relaxations of Latent Variable Training
17 0.44071272 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
18 0.44034144 24 nips-2007-An Analysis of Inference with the Universum
19 0.43791512 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
20 0.43021122 100 nips-2007-Hippocampal Contributions to Control: The Third Way