nips nips2006 nips2006-202 knowledge-graph by maker-knowledge-mining

202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis


Source: pdf

Author: Alborz Geramifard, Michael Bowling, Martin Zinkevich, Richard S. Sutton

Abstract: We present new theoretical and empirical results with the iLSTD algorithm for policy evaluation in reinforcement learning with linear function approximation. iLSTD is an incremental method for achieving results similar to LSTD, the dataefficient, least-squares version of temporal difference learning, without incurring the full cost of the LSTD computation. LSTD is O(n2 ), where n is the number of parameters in the linear function approximator, while iLSTD is O(n). In this paper, we generalize the previous iLSTD algorithm and present three new results: (1) the first convergence proof for an iLSTD algorithm; (2) an extension to incorporate eligibility traces without changing the asymptotic computational complexity; and (3) the first empirical results with an iLSTD algorithm for a problem (mountain car) with feature vectors large enough (n = 10, 000) to show substantial computational advantages over LSTD. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract We present new theoretical and empirical results with the iLSTD algorithm for policy evaluation in reinforcement learning with linear function approximation. [sent-4, score-0.123]

2 iLSTD is an incremental method for achieving results similar to LSTD, the dataefficient, least-squares version of temporal difference learning, without incurring the full cost of the LSTD computation. [sent-5, score-0.054]

3 1 Introduction A key part of many reinforcement learning algorithms is a policy evaluation process, in which the value function of a policy is estimated online from data. [sent-8, score-0.164]

4 In this paper, we consider the problem of policy evaluation where the value function estimate is a linear function of state features and is updated after each time step. [sent-9, score-0.112]

5 The TD algorithm updates its value-function estimate based on the observed TD error on each time step. [sent-11, score-0.033]

6 The TD update takes only O(n) computation per time step, where n is the number of features. [sent-12, score-0.035]

7 Rather than making updates on each step to improve the estimate, these methods maintain compact summaries of all observed state transitions and rewards and solve for the value function which has zero expected TD error over the observed data. [sent-15, score-0.096]

8 However, although LSTD and LSTD(λ) make more efficient use of the data, they require O(n2 ) computation per time step, which is often impractical for the large feature sets needed in many applications. [sent-16, score-0.058]

9 Hence, practitioners are often faced with the dilemma of having to chose between excessive computational expense and excessive data expense. [sent-17, score-0.03]

10 Recently, Geramifard and colleagues [2006] introduced an incremental least-squares TD algorithm, iLSTD, as a compromise between the computational burden of LSTD and the relative data inefficiency of TD. [sent-18, score-0.029]

11 The algorithm focuses on the common situation of large feature sets where only a small number of features are non-zero on any given time step. [sent-19, score-0.059]

12 First, we include the use of eligibility traces, defining iLSTD(λ) consistent with the family of TD(λ) and LSTD(λ) algorithms. [sent-23, score-0.075]

13 We prove that for a general class of selection mechanisms, iLSTD(λ) converges to the same solution as TD(λ) and LSTD(λ), for all 0 ≤ λ ≤ 1. [sent-26, score-0.06]

14 An MDP is a tuple, a a (S, A, Pss , Ra , γ), where S is a set of states, A is a set of actions, Pss is the probability of ss a reaching state s after taking action a in state s, and Rss is the reward received when that transition occurs, and γ ∈ [0, 1] is a discount rate parameter. [sent-31, score-0.086]

15 , where the agent in s1 takes action a1 and receives reward r2 while transitioning to s2 before taking a2 , etc. [sent-35, score-0.038]

16 Given a policy, one often wants to estimate the policy’s state-value function, or expected sum of discounted future rewards: ∞ V π (s) γ t−1 rt s0 = s, π . [sent-36, score-0.078]

17 Let φ : S → n , be some features of the state space. [sent-38, score-0.043]

18 In this work we will exclusively consider sparse feature representations: for all states s the number of non-zero features in φ(s) is no more than k n. [sent-40, score-0.093]

19 Sparse feature representations are quite common as a generic approach to handling non-linearity [e. [sent-41, score-0.04]

20 1 Temporal Difference Learning TD(λ) is the traditional approach to policy evaluation [see Sutton and Barto, 1998]. [sent-46, score-0.069]

21 i=1 Note that the λ-return is a weighted sum of k-step returns, each of which looks ahead k steps summing the discounted rewards as well as the estimated value of the resulting state. [sent-48, score-0.025]

22 This “forward view” requires a complete trajectory to compute the λ-return and update the parameters. [sent-50, score-0.034]

23 The “backward view” is a more efficient implementation that depends only on one-step returns and an eligibility trace vector: θ t+1 = θ t + αt ut (θt ) ut (θ) = zt (rt+1 + γVθ (st+1 ) − Vθ (st )) zt = λγzt+1 + φ(st ), where zt is the eligibility trace and ut (θ) is the TD update. [sent-51, score-0.285]

24 In the special case where λ = 0 and the feature representation is sparse, this complexity can be reduced to O(k). [sent-53, score-0.055]

25 1 Throughout this paper we will use non-bolded symbols to refer to scalars (e. [sent-55, score-0.025]

26 , θ and bt ), and bold-faced upper-case symbols for matrices (e. [sent-59, score-0.07]

27 LSTD(λ) can be viewed as immediately solving for the value function parameters which would result in the sum of TD updates over the observed trajectory being zero. [sent-64, score-0.05]

28 Let µt (θ) be the sum of the TD updates through time t. [sent-65, score-0.033]

29 If we let φt = φ(st ) then, t µt (θ) = t zi ri+1 + γVθ (si+1 ) − Vθ (si ) ui (θ) = i=1 t i=1 zi ri+1 + γφT θ − φT θ i+1 i = i=1 t t zi (φi − γφi+1 )T θ = bt − At θ. [sent-66, score-0.096]

30 zi ri+1 − = (1) i=1 i=1 bt At Since we want to choose parameters such that the sum of TD updates is zero, we set Equation 1 to zero and solve for the new parameter vector, θ t+1 = A−1 bt . [sent-67, score-0.14]

31 t The online version of LSTD(λ) incorporates each observed reward and state transition into the b vector and the A matrix and then solves for a new θ. [sent-68, score-0.047]

32 3 iLSTD iLSTD was recently introduced to provide a balance between LSTD’s data efficiency and TD’s time efficiency for λ = 0 when the feature representation is sparse [Geramifard et al. [sent-74, score-0.06]

33 The update to θ requires some care as the sum TD update itself would require O(n2 ). [sent-77, score-0.034]

34 iLSTD instead updates only single dimensions of θ, each of which requires O(n). [sent-78, score-0.033]

35 The result is that iLSTD can scale to much larger feature spaces than LSTD, while still retaining much of its data efficiency. [sent-80, score-0.04]

36 By also generalizing the mechanism used to select the feature parameters to update, we additionally prove sufficient conditions for convergence. [sent-83, score-0.118]

37 First, it uses eligibility traces (z) to handle λ > 0. [sent-86, score-0.12]

38 Line 5 updates z, and lines 5–9 incrementally compute the same At , bt , and µt as described in Equation 1. [sent-87, score-0.078]

39 Any feature selection mechanism can be employed in line 11 to select a dimension of the sum TD update vector (µ). [sent-89, score-0.19]

40 2 Line 12 will then take a step in that dimension, and line 13 updates the µ vector accordingly. [sent-90, score-0.062]

41 The original iLSTD algorithm can be recovered by simply setting λ to zero and selecting features according to the dimension of µ with maximal magnitude. [sent-91, score-0.033]

42 2 The choice of this mechanism will determine the convergence properties of the algorithm, as discussed in the next section. [sent-93, score-0.083]

43 If there are n features and, for any given state s, φ(s) has at most k non-zero elements, then the iLSTD(λ) algorithm requires O((m + k)n) computation per time step. [sent-95, score-0.061]

44 Since we assumed that each feature vector has at most k non-zero elements, and the T z vector can have up to n non-zero elements, the z (φ(s) − γφ(s )) matrix (line 7) has at most 2kn non-zero elements. [sent-97, score-0.04]

45 However, whereas in their analysis they considered Ct and dt that had expectations that converged quickly, we consider Ct and dt that may converge more slowly, but in value instead of expectation. [sent-104, score-0.106]

46 In order to establish our result, we consider the theoretical model where for all t, yt ∈ Rn ,dt ∈ Rn , Rt , Ct ∈ Rn×n , βt ∈ R, and: yt+1 = yt + βt (Rt )(Ct yt + dt ). [sent-105, score-0.254]

47 (2) On every round, Ct and dt are selected first, followed by Rt . [sent-106, score-0.053]

48 In order to prove convergence of yt , we assume that there is a C∗ , d∗ , v, µ > 0, and M such that: A1. [sent-109, score-0.086]

49 C∗ is negative definite, Ct converges to C∗ with probability 1, dt converges to d∗ with probability 1, E[Rt |Ft ] = I, and Rt ≤ M , T A5. [sent-113, score-0.093]

50 Theorem 2 Given the above assumptions, yt converges to −(C∗ )−1 d∗ with probability 1. [sent-117, score-0.087]

51 yt = θ t , βt = tα/n, Ct = −At /t, dt = bt /t, and Rt is a matrix, where there is an n on the diagonal in position (kt , kt ) (where kt is uniform random over the set {1,. [sent-124, score-0.207]

52 The final assumption defines the simplest possible feature selection mechanism sufficient for convergence, viz. [sent-131, score-0.144]

53 Theorem 3 If the Markov decision process is finite, iLSTD(λ) with a uniform random feature selection mechanism converges to the same result as TD(λ). [sent-133, score-0.164]

54 However, the greedy selection of the original iLSTD algorithm does not meet these conditions, and so has no guarantee of convergence. [sent-135, score-0.15]

55 As we will see in the next section, though, greedy selection performs quite well despite this lack of asymptotic guarantee. [sent-136, score-0.155]

56 In summary, finding a good feature selection mechanism remains an open research question. [sent-137, score-0.144]

57 This does not correspond to any feature selection mechanism and in fact requires O(n2 ) computation. [sent-141, score-0.144]

58 Theorem 4 If Ct is negative definite, for some β dependent upon Ct , if Rt = I, then there exists an ζ ∈ (0, 1) such that for all yt , if yt+1 = yt + β(Ct yt + dt ), then yt+1 + (Ct )−1 dt < ζ yt + (Ct )−1 dt . [sent-144, score-0.427]

59 5 Empirical Results We now examine the empirical performance of iLSTD(λ). [sent-146, score-0.028]

60 We then explore the larger mountain car problem with a tile coding function approximator. [sent-148, score-0.259]

61 We evaluate both the random feature selection mechanism (“iLSTD-random”), which is guaranteed to converge,4 as well as the original iLSTD feature selection rule (“iLSTD-greedy”), which is not. [sent-150, score-0.24]

62 75 1 0 0 0 1 0 0 0 0 0 0 (a) (b) Figure 1: The two experimental domains: (a) Boyan’s chain example and (b) mountain car. [sent-165, score-0.178]

63 9 1 Figure 2: Performance of various algorithms in Boyan’s chain problem with 6 different lambda values. [sent-170, score-0.076]

64 Each line represents the averaged error over last 100 episodes after 100, 200, and 1000 episodes respectively. [sent-171, score-0.201]

65 1 Boyan Chain Problem The first domain we consider is the Boyan chain problem. [sent-174, score-0.075]

66 Figure 1(a) shows the Markov chain together with the feature vectors corresponding to each state. [sent-175, score-0.097]

67 The chain starts in state 13 and finishes in state 0. [sent-177, score-0.105]

68 The reward is - 3 for all transitions except from state 2 to 1 and state 1 to 0, where the rewards are - 2 and 0, respectively. [sent-179, score-0.096]

69 The horizontal axis corresponds to different λ values, while the vertical axis illustrates the RMS error in a log scale averaged over all states uniformly. [sent-181, score-0.074]

70 Note that in this domain, the optimum solution is in the space spanned by the feature vectors: θ ∗ = (−24, −16, −8, 0)T . [sent-182, score-0.04]

71 Each line shows the averaged error over last 100 episodes after 100, 200, and 1000 episodes over the same set of observed trajectories based on 30 trials. [sent-183, score-0.201]

72 Finally, notice that iLSTD- Greedy(λ) despite its lack of asymptotic guarantee, is actually performing slightly better than iLSTD- Random(λ) for all cases of λ. [sent-186, score-0.036]

73 Table 1 shows the total averaged per- step CPU time for each method. [sent-188, score-0.034]

74 For all methods sparse matrix optimizations were utilized and LSTD used the efficient incremental inverse implementation. [sent-189, score-0.079]

75 Although TD(λ) is the fastest method, the overall difference between the timings in this domain is very small, which is due to the small number of features and a small ratio n . [sent-190, score-0.037]

76 In the next domain, we k illustrate the effect of a larger and more interesting feature space where this ratio is larger. [sent-191, score-0.04]

77 Algorithm TD(λ) iLSTD(λ) LSTD(λ) CPU time/step (msec) Boyan’s chain Mountain car 0. [sent-192, score-0.174]

78 42 Table 1: The averaged CPU time per step of the algorithms used in Boyan’s chain and mountain car problems. [sent-203, score-0.347]

79 −Greedy iLSTD-Random Boltzmann 200 200 LSTD 400 400 600 Episode 800 1000 600 Episode 800 1000 −1 1 3 10 10 10 10 Figure 3: Performance of various methods in mountain car problem with two different lambda 10 10 values. [sent-216, score-0.257]

80 Illustrated in 10 10 Small Large Easy Hard Boyan Chain Figure 1(b), the episodic task for the car isMountain Car the goal state. [sent-223, score-0.138]

81 Possible actions are accelerate to reach 10 forward, accelerate backward, and coast. [sent-224, score-0.048]

82 Further details about the mountain car problem are available online [RL Library, 2006]. [sent-227, score-0.238]

83 As we are focusing on policy evaluation, the policy was fixed for the car to always accelerate in the direction of its current velocity, although the environment is stochastic and the chosen action is replaced with a random one with 10% probability. [sent-228, score-0.28]

84 We used ten tilings (k = 10) over the combination of the two parameter sets and hashed the tilings into 10,000 features (n = 10, 000). [sent-232, score-0.067]

85 The rest of the settings were identical to those in the Boyan chain domain. [sent-233, score-0.057]

86 The horizontal axis shows the number of episodes, while the vertical axis represents our loss function in log scale. [sent-235, score-0.055]

87 The loss we used was ||bλ − Aλ θ||2 , where Aλ and bλ were computed for each λ from 200,000 episodes of interaction with the environment. [sent-236, score-0.098]

88 This fact may suggest that the greedy feature selection mechanism does not converge, or it may simply be more sensitive to the learning rate. [sent-243, score-0.24]

89 Finally, notice that the plotted loss depends upon λ, and so the two graphs cannot be directly compared. [sent-244, score-0.032]

90 Again sparse matrix optimizations were utilized and LSTD(λ) used the efficient incremental ivnerse implementation. [sent-246, score-0.079]

91 The computational demands of LSTD(λ) can easily prohibit its application in domains with a large feature space. [sent-247, score-0.057]

92 When the feature representation is sparse, though, iLSTD(λ) can still achieve results competitive with LSTD(λ) using computation more on par with the time efficient TD(λ). [sent-248, score-0.04]

93 6 Conclusion In this paper, we extended the previous iLSTD algorithm by incorporating eligibility traces without increasing the asymptotic per time-step complexity. [sent-249, score-0.157]

94 This extension resulted in improvements in performance in both the Boyan chain and mountain car domains. [sent-250, score-0.295]

95 We also relaxed the dimension selection mechanism of the algorithm and presented sufficient conditions on the mechanism under which iLSTD(λ) is guaranteed to converge. [sent-251, score-0.198]

96 Our results have focused on two very simple feature selection mechanisms: random and greedy. [sent-254, score-0.08]

97 Although the greedy mechanism does not meet our sufficient conditions for convergence, it actually performed slightly better on the examined domains than the theoretically guaranteed random selection. [sent-255, score-0.207]

98 It would be interesting to perform a thorough exploration of possible mechanisms to find a mechanism with both good empirical performance while satisfying our sufficient conditions for convergence. [sent-256, score-0.101]

99 In addition, it would be interesting to apply iLSTD(λ) in even more challenging environments where the large number of features has completely prevented the least-squares approach, such as in simulated soccer keepaway [Stone et al. [sent-257, score-0.038]

100 Generalization in reinforcement learning: Successful examples using sparse coarse coding. [sent-309, score-0.061]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ilstd', 0.654), ('lstd', 0.463), ('td', 0.334), ('boyan', 0.285), ('mountain', 0.121), ('car', 0.117), ('greedy', 0.096), ('barto', 0.083), ('episodes', 0.083), ('sutton', 0.083), ('rt', 0.078), ('eligibility', 0.075), ('ct', 0.07), ('yt', 0.067), ('mechanism', 0.064), ('episode', 0.059), ('geramifard', 0.059), ('tsitsiklis', 0.059), ('chain', 0.057), ('policy', 0.054), ('dt', 0.053), ('medium', 0.053), ('bradtke', 0.048), ('boltzmann', 0.047), ('bt', 0.045), ('traces', 0.045), ('reinforcement', 0.041), ('feature', 0.04), ('selection', 0.04), ('richard', 0.034), ('stone', 0.033), ('updates', 0.033), ('st', 0.032), ('incremental', 0.029), ('temporal', 0.025), ('symbols', 0.025), ('alberta', 0.025), ('rewards', 0.025), ('zt', 0.025), ('state', 0.024), ('mechanisms', 0.024), ('alborz', 0.024), ('measureerror', 0.024), ('pss', 0.024), ('tilings', 0.024), ('bertsekas', 0.024), ('accelerate', 0.024), ('hard', 0.023), ('reward', 0.023), ('kn', 0.023), ('kt', 0.021), ('tile', 0.021), ('episodic', 0.021), ('bowling', 0.021), ('justin', 0.021), ('averaged', 0.02), ('ut', 0.02), ('converges', 0.02), ('axis', 0.02), ('sparse', 0.02), ('asymptotic', 0.019), ('features', 0.019), ('convergence', 0.019), ('lambda', 0.019), ('soccer', 0.019), ('rl', 0.018), ('cpu', 0.018), ('domain', 0.018), ('per', 0.018), ('easy', 0.017), ('trajectory', 0.017), ('domains', 0.017), ('update', 0.017), ('roy', 0.017), ('notice', 0.017), ('zi', 0.017), ('guaranteed', 0.016), ('environment', 0.016), ('ciency', 0.015), ('complexity', 0.015), ('loss', 0.015), ('excessive', 0.015), ('utilized', 0.015), ('optimizations', 0.015), ('action', 0.015), ('examine', 0.015), ('evaluation', 0.015), ('line', 0.015), ('library', 0.014), ('meet', 0.014), ('theorem', 0.014), ('step', 0.014), ('van', 0.014), ('rms', 0.014), ('generalizing', 0.014), ('dimension', 0.014), ('states', 0.014), ('ri', 0.014), ('tuple', 0.013), ('empirical', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis

Author: Alborz Geramifard, Michael Bowling, Martin Zinkevich, Richard S. Sutton

Abstract: We present new theoretical and empirical results with the iLSTD algorithm for policy evaluation in reinforcement learning with linear function approximation. iLSTD is an incremental method for achieving results similar to LSTD, the dataefficient, least-squares version of temporal difference learning, without incurring the full cost of the LSTD computation. LSTD is O(n2 ), where n is the number of parameters in the linear function approximator, while iLSTD is O(n). In this paper, we generalize the previous iLSTD algorithm and present three new results: (1) the first convergence proof for an iLSTD algorithm; (2) an extension to incorporate eligibility traces without changing the asymptotic computational complexity; and (3) the first empirical results with an iLSTD algorithm for a problem (mountain car) with feature vectors large enough (n = 10, 000) to show substantial computational advantages over LSTD. 1

2 0.073430821 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

Author: Jeremy Lewi, Robert Butera, Liam Paninski

Abstract: Adaptively optimizing experiments can significantly reduce the number of trials needed to characterize neural responses using parametric statistical models. However, the potential for these methods has been limited to date by severe computational challenges: choosing the stimulus which will provide the most information about the (typically high-dimensional) model parameters requires evaluating a high-dimensional integration and optimization in near-real time. Here we present a fast algorithm for choosing the optimal (most informative) stimulus based on a Fisher approximation of the Shannon information and specialized numerical linear algebra techniques. This algorithm requires only low-rank matrix manipulations and a one-dimensional linesearch to choose the stimulus and is therefore efficient even for high-dimensional stimulus and parameter spaces; for example, we require just 15 milliseconds on a desktop computer to optimize a 100-dimensional stimulus. Our algorithm therefore makes real-time adaptive experimental design feasible. Simulation results show that model parameters can be estimated much more efficiently using these adaptive techniques than by using random (nonadaptive) stimuli. Finally, we generalize the algorithm to efficiently handle both fast adaptation due to spike-history effects and slow, non-systematic drifts in the model parameters. Maximizing the efficiency of data collection is important in any experimental setting. In neurophysiology experiments, minimizing the number of trials needed to characterize a neural system is essential for maintaining the viability of a preparation and ensuring robust results. As a result, various approaches have been developed to optimize neurophysiology experiments online in order to choose the “best” stimuli given prior knowledge of the system and the observed history of the cell’s responses. The “best” stimulus can be defined a number of different ways depending on the experimental objectives. One reasonable choice, if we are interested in finding a neuron’s “preferred stimulus,” is the stimulus which maximizes the firing rate of the neuron [1, 2, 3, 4]. Alternatively, when investigating the coding properties of sensory cells it makes sense to define the optimal stimulus in terms of the mutual information between the stimulus and response [5]. Here we take a system identification approach: we define the optimal stimulus as the one which tells us the most about how a neural system responds to its inputs [6, 7]. We consider neural systems in † ‡ http://www.prism.gatech.edu/∼gtg120z http://www.stat.columbia.edu/∼liam which the probability p(rt |{xt , xt−1 , ..., xt−tk }, {rt−1 , . . . , rt−ta }) of the neural response rt given the current and past stimuli {xt , xt−1 , ..., xt−tk }, and the observed recent history of the neuron’s activity, {rt−1 , . . . , rt−ta }, can be described by a model p(rt |{xt }, {rt−1 }, θ), specified by a finite vector of parameters θ. Since we estimate these parameters from experimental trials, we want to choose our stimuli so as to minimize the number of trials needed to robustly estimate θ. Two inconvenient facts make it difficult to realize this goal in a computationally efficient manner: 1) model complexity — we typically need a large number of parameters to accurately model a system’s response p(rt |{xt }, {rt−1 }, θ); and 2) stimulus complexity — we are typically interested in neural responses to stimuli xt which are themselves very high-dimensional (e.g., spatiotemporal movies if we are dealing with visual neurons). In particular, it is computationally challenging to 1) update our a posteriori beliefs about the model parameters p(θ|{rt }, {xt }) given new stimulus-response data, and 2) find the optimal stimulus quickly enough to be useful in an online experimental context. In this work we present methods for solving these problems using generalized linear models (GLM) for the input-output relationship p(rt |{xt }, {rt−1 }, θ) and certain Gaussian approximations of the posterior distribution of the model parameters. Our emphasis is on finding solutions which scale well in high dimensions. We solve problem (1) by using efficient rank-one update methods to update the Gaussian approximation to the posterior, and problem (2) by a reduction to a highly tractable onedimensional optimization problem. Simulation results show that the resulting algorithm produces a set of stimulus-response pairs which is much more informative than the set produced by random sampling. Moreover, the algorithm is efficient enough that it could feasibly run in real-time. Neural systems are highly adaptive and more generally nonstatic. A robust approach to optimal experimental design must be able to cope with changes in θ. We emphasize that the model framework analyzed here can account for three key types of changes: stimulus adaptation, spike rate adaptation, and random non-systematic changes. Adaptation which is completely stimulus dependent can be accounted for by including enough stimulus history terms in the model p(rt |{xt , ..., xt−tk }, {rt−1 , ..., rt−ta }). Spike-rate adaptation effects, and more generally spike history-dependent effects, are accounted for explicitly in the model (1) below. Finally, we consider slow, non-systematic changes which could potentially be due to changes in the health, arousal, or attentive state of the preparation. Methods We model a neuron as a point process whose conditional intensity function (instantaneous firing rate) is given as the output of a generalized linear model (GLM) [8, 9]. This model class has been discussed extensively elsewhere; briefly, this class is fairly natural from a physiological point of view [10], with close connections to biophysical models such as the integrate-and-fire cell [9], and has been applied in a wide variety of experimental settings [11, 12, 13, 14]. The model is summarized as: tk λt = E(rt ) = f ta aj rt−j ki,t−l xi,t−l + i l=1 (1) j=1 In the above summation the filter coefficients ki,t−l capture the dependence of the neuron’s instantaneous firing rate λt on the ith component of the vector stimulus at time t − l, xt−l ; the model therefore allows for spatiotemporal receptive fields. For convenience, we arrange all the stimulus coefficients in a vector, k, which allows for a uniform treatment of the spatial and temporal components of the receptive field. The coefficients aj model the dependence on the observed recent activity r at time t − j (these terms may reflect e.g. refractory effects, burstiness, firing-rate adaptation, etc., depending on the value of the vector a [9]). For convenience we denote the unknown parameter vector as θ = {k; a}. The experimental objective is the estimation of the unknown filter coefficients, θ, given knowledge of the stimuli, xt , and the resulting responses rt . We chose the nonlinear stage of the GLM, the link function f (), to be the exponential function for simplicity. This choice ensures that the log likelihood of the observed data is a concave function of θ [9]. Representing and updating the posterior. As emphasized above, our first key task is to efficiently update the posterior distribution of θ after t trials, p(θt |xt , rt ), as new stimulus-response pairs are trial 100 trial 500 trial 2500 trial 5000 θ true 1 info. max. trial 0 0 random −1 (a) random info. max. 2000 Time(Seconds) Entropy 1500 1000 500 0 −500 0 1000 2000 3000 Iteration (b) 4000 5000 0.1 total time diagonalization posterior update 1d line Search 0.01 0.001 0 200 400 Dimensionality 600 (c) Figure 1: A) Plots of the estimated receptive field for a simulated visual neuron. The neuron’s receptive field θ has the Gabor structure shown in the last panel (spike history effects were set to zero for simplicity here, a = 0). The estimate of θ is taken as the mean of the posterior, µt . The images compare the accuracy of the estimates using information maximizing stimuli and random stimuli. B) Plots of the posterior entropies for θ in these two cases; note that the information-maximizing stimuli constrain the posterior of θ much more effectively than do random stimuli. C) A plot of the timing of the three steps performed on each iteration as a function of the dimensionality of θ. The timing for each step was well-fit by a polynomial of degree 2 for the diagonalization, posterior update and total time, and degree 1 for the line search. The times are an average over many iterations. The error-bars for the total time indicate ±1 std. observed. (We use xt and rt to abbreviate the sequences {xt , . . . , x0 } and {rt , . . . , r0 }.) To solve this problem, we approximate this posterior as a Gaussian; this approximation may be justified by the fact that the posterior is the product of two smooth, log-concave terms, the GLM likelihood function and the prior (which we assume to be Gaussian, for simplicity). Furthermore, the main theorem of [7] indicates that a Gaussian approximation of the posterior will be asymptotically accurate. We use a Laplace approximation to construct the Gaussian approximation of the posterior, p(θt |xt , rt ): we set µt to the peak of the posterior (i.e. the maximum a posteriori (MAP) estimate of θ), and the covariance matrix Ct to the negative inverse of the Hessian of the log posterior at µt . In general, computing these terms directly requires O(td2 + d3 ) time (where d = dim(θ); the time-complexity increases with t because to compute the posterior we must form a product of t likelihood terms, and the d3 term is due to the inverse of the Hessian matrix), which is unfortunately too slow when t or d becomes large. Therefore we further approximate p(θt−1 |xt−1 , rt−1 ) as Gaussian; to see how this simplifies matters, we use Bayes to write out the posterior: 1 −1 log p(θ|rt , xt ) = − (θ − µt−1 )T Ct−1 (θ − µt−1 ) + − exp {xt ; rt−1 }T θ 2 + rt {xt ; rt−1 }T θ + const d log p(θ|rt , xt ) −1 = −(θ − µt−1 )T Ct−1 + (2) − exp({xt ; rt−1 }T θ) + rt {xt ; rt−1 }T dθ d2 log p(θ|rt , xt ) −1 = −Ct−1 − exp({xt ; rt−1 }T θ){xt ; rt−1 }{xt ; rt−1 }T dθi dθj (3) Now, to update µt we only need to find the peak of a one-dimensional function (as opposed to a d-dimensional function); this follows by noting that that the likelihood only varies along a single direction, {xt ; rt−1 }, as a function of θ. At the peak of the posterior, µt , the first term in the gradient must be parallel to {xt ; rt−1 } because the gradient is zero. Since Ct−1 is non-singular, µt − µt−1 must be parallel to Ct−1 {xt ; rt−1 }. Therefore we just need to solve a one dimensional problem now to determine how much the mean changes in the direction Ct−1 {xt ; rt−1 }; this requires only O(d2 ) time. Moreover, from the second derivative term above it is clear that computing Ct requires just a rank-one matrix update of Ct−1 , which can be evaluated in O(d2 ) time via the Woodbury matrix lemma. Thus this Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) provides a large gain in efficiency; our simulations (data not shown) showed that, despite this improved efficiency, the loss in accuracy due to this approximation was minimal. Deriving the (approximately) optimal stimulus. To simplify the derivation of our maximization strategy, we start by considering models in which the firing rate does not depend on past spiking, so θ = {k}. To choose the optimal stimulus for trial t + 1, we want to maximize the conditional mutual information I(θ; rt+1 |xt+1 , xt , rt ) = H(θ|xt , rt ) − H(θ|xt+1 , rt+1 ) (4) with respect to the stimulus xt+1 . The first term does not depend on xt+1 , so maximizing the information requires minimizing the conditional entropy H(θ|xt+1 , rt+1 ) = p(rt+1 |xt+1 ) −p(θ|rt+1 , xt+1 ) log p(θ|rt+1 , xt+1 )dθ = Ert+1 |xt+1 log det[Ct+1 ] + const. rt+1 (5) We do not average the entropy of p(θ|rt+1 , xt+1 ) over xt+1 because we are only interested in the conditional entropy for the particular xt+1 which will be presented next. The equality above is due to our Gaussian approximation of p(θ|xt+1 , rt+1 ). Therefore, we need to minimize Ert+1 |xt+1 log det[Ct+1 ] with respect to xt+1 . Since we set Ct+1 to be the negative inverse Hessian of the log-posterior, we have: −1 Ct+1 = Ct + Jobs (rt+1 , xt+1 ) −1 , (6) Jobs is the observed Fisher information. Jobs (rt+1 , xt+1 ) = −∂ 2 log p(rt+1 |ε = xt θ)/∂ε2 xt+1 xt t+1 t+1 (7) Here we use the fact that for the GLM, the likelihood depends only on the dot product, ε = xt θ. t+1 We can use the Woodbury lemma to evaluate the inverse: Ct+1 = Ct I + D(rt+1 , ε)(1 − D(rt+1 , ε)xt Ct xt+1 )−1 xt+1 xt Ct t+1 t+1 (8) where D(rt+1 , ε) = ∂ 2 log p(rt+1 |ε)/∂ε2 . Using some basic matrix identities, log det[Ct+1 ] = log det[Ct ] − log(1 − D(rt+1 , ε)xt Ct xt+1 ) t+1 = log det[Ct ] + D(rt+1 , ε)xt Ct xt+1 t+1 + o(D(rt+1 , ε)xt Ct xt+1 ) t+1 (9) (10) Ignoring the higher order terms, we need to minimize Ert+1 |xt+1 D(rt+1 , ε)xt Ct xt+1 . In our case, t+1 with f (θt xt+1 ) = exp(θt xt+1 ), we can use the moment-generating function of the multivariate Trial info. max. i.i.d 2 400 −10−4 0 0.05 −10−1 −2 ai 800 2 0 −2 −7 −10 i i.i.d k info. max. 1 1 50 i 1 50 i 1 10 1 i (a) i 100 0 −0.05 10 1 1 (b) i 10 (c) Figure 2: A comparison of parameter estimates using information-maximizing versus random stimuli for a model neuron whose conditional intensity depends on both the stimulus and the spike history. The images in the top row of A and B show the MAP estimate of θ after each trial as a row in the image. Intensity indicates the value of the coefficients. The true value of θ is shown in the second row of images. A) The estimated stimulus coefficients, k. B) The estimated spike history coefficients, a. C) The final estimates of the parameters after 800 trials: dashed black line shows true values, dark gray is estimate using information maximizing stimuli, and light gray is estimate using random stimuli. Using our algorithm improved the estimates of k and a. Gaussian p(θ|xt , rt ) to evaluate this expectation. After some algebra, we find that to maximize I(θ; rt+1 |xt+1 , xt , rt ), we need to maximize 1 F (xt+1 ) = exp(xT µt ) exp( xT Ct xt+1 )xT Ct xt+1 . t+1 t+1 2 t+1 (11) Computing the optimal stimulus. For the GLM the most informative stimulus is undefined, since increasing the stimulus power ||xt+1 ||2 increases the informativeness of any putatively “optimal” stimulus. To obtain a well-posed problem, we optimize the stimulus under the usual power constraint ||xt+1 ||2 ≤ e < ∞. We maximize Eqn. 11 under this constraint using Lagrange multipliers and an eigendecomposition to reduce our original d-dimensional optimization problem to a onedimensional problem. Expressing Eqn. 11 in terms of the eigenvectors of Ct yields: 1 2 2 F (xt+1 ) = exp( u i yi + ci yi ) ci yi (12) 2 i i i = g( 2 ci yi ) ui yi )h( i (13) i where ui and yi represent the projection of µt and xt+1 onto the ith eigenvector and ci is the corresponding eigenvalue. To simplify notation we also introduce the functions g() and h() which are monotonically strictly increasing functions implicitly defined by Eqn. 12. We maximize F (xt+1 ) by breaking the problem into an inner and outer problem by fixing the value of i ui yi and maximizing h() subject to that constraint. A single line search over all possible values of i ui yi will then find the global maximum of F (.). This approach is summarized by the equation: max F (y) = max g(b) · y:||y||2 =e b max y:||y||2 =e,y t u=b 2 ci yi ) h( i Since h() is increasing, to solve the inner problem we only need to solve: 2 ci yi max y:||y||2 =e,y t u=b (14) i This last expression is a quadratic function with quadratic and linear constraints and we can solve it using the Lagrange method for constrained optimization. The result is an explicit system of 1 true θ random info. max. info. max. no diffusion 1 0.8 0.6 trial 0.4 0.2 400 0 −0.2 −0.4 800 1 100 θi 1 θi 100 1 θi 100 1 θ i 100 −0.6 random info. max. θ true θ i 1 0 −1 Entropy θ i 1 0 −1 random info. max. 250 200 i 1 θ Trial 400 Trial 200 Trial 0 (a) 0 −1 20 40 (b) i 60 80 100 150 0 200 400 600 Iteration 800 (c) Figure 3: Estimating the receptive field when θ is not constant. A) The posterior means µt and true θt plotted after each trial. θ was 100 dimensional, with its components following a Gabor function. To simulate nonsystematic changes in the response function, the center of the Gabor function was moved according to a random walk in between trials. We modeled the changes in θ as a random walk with a white covariance matrix, Q, with variance .01. In addition to the results for random and information-maximizing stimuli, we also show the µt given stimuli chosen to maximize the information under the (mistaken) assumption that θ was constant. Each row of the images plots θ using intensity to indicate the value of the different components. B) Details of the posterior means µt on selected trials. C) Plots of the posterior entropies as a function of trial number; once again, we see that information-maximizing stimuli constrain the posterior of θt more effectively. equations for the optimal yi as a function of the Lagrange multiplier λ1 . ui e yi (λ1 ) = ||y||2 2(ci − λ1 ) (15) Thus to find the global optimum we simply vary λ1 (this is equivalent to performing a search over b), and compute the corresponding y(λ1 ). For each value of λ1 we compute F (y(λ1 )) and choose the stimulus y(λ1 ) which maximizes F (). It is possible to show (details omitted) that the maximum of F () must occur on the interval λ1 ≥ c0 , where c0 is the largest eigenvalue. This restriction on the optimal λ1 makes the implementation of the linesearch significantly faster and more stable. To summarize, updating the posterior and finding the optimal stimulus requires three steps: 1) a rankone matrix update and one-dimensional search to compute µt and Ct ; 2) an eigendecomposition of Ct ; 3) a one-dimensional search over λ1 ≥ c0 to compute the optimal stimulus. The most expensive step here is the eigendecomposition of Ct ; in principle this step is O(d3 ), while the other steps, as discussed above, are O(d2 ). Here our Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) is once again quite useful: recall that in this setting Ct is just a rank-one modification of Ct−1 , and there exist efficient algorithms for rank-one eigendecomposition updates [15]. While the worst-case running time of this rank-one modification of the eigendecomposition is still O(d3 ), we found the average running time in our case to be O(d2 ) (Fig. 1(c)), due to deflation which reduces the cost of matrix multiplications associated with finding the eigenvectors of repeated eigenvalues. Therefore the total time complexity of our algorithm is empirically O(d2 ) on average. Spike history terms. The preceding derivation ignored the spike-history components of the GLM model; that is, we fixed a = 0 in equation (1). Incorporating spike history terms only affects the optimization step of our algorithm; updating the posterior of θ = {k; a} proceeds exactly as before. The derivation of the optimization strategy proceeds in a similar fashion and leads to an analogous optimization strategy, albeit with a few slight differences in detail which we omit due to space constraints. The main difference is that instead of maximizing the quadratic expression in Eqn. 14 to find the maximum of h(), we need to maximize a quadratic expression which includes a linear term due to the correlation between the stimulus coefficients, k, and the spike history coefficients,a. The results of our simulations with spike history terms are shown in Fig. 2. Dynamic θ. In addition to fast changes due to adaptation and spike-history effects, animal preparations often change slowly and nonsystematically over the course of an experiment [16]. We model these effects by letting θ experience diffusion: θt+1 = θt + wt (16) Here wt is a normally distributed random variable with mean zero and known covariance matrix Q. This means that p(θt+1 |xt , rt ) is Gaussian with mean µt and covariance Ct + Q. To update the posterior and choose the optimal stimulus, we use the same procedure as described above1 . Results Our first simulation considered the use of our algorithm for learning the receptive field of a visually sensitive neuron. We took the neuron’s receptive field to be a Gabor function, as a proxy model of a V1 simple cell. We generated synthetic responses by sampling Eqn. 1 with θ set to a 25x33 Gabor function. We used this synthetic data to compare how well θ could be estimated using information maximizing stimuli compared to using random stimuli. The stimuli were 2-d images which were rasterized in order to express x as a vector. The plots of the posterior means µt in Fig. 1 (recall these are equivalent to the MAP estimate of θ) show that the information maximizing strategy converges an order of magnitude more rapidly to the true θ. These results are supported by the conclusion of [7] that the information maximization strategy is asymptotically never worse than using random stimuli and is in general more efficient. The running time for each step of the algorithm as a function of the dimensionality of θ is plotted in Fig. 1(c). These results were obtained on a machine with a dual core Intel 2.80GHz XEON processor running Matlab. The solid lines indicate fitted polynomials of degree 1 for the 1d line search and degree 2 for the remaining curves; the total running time for each trial scaled as O(d2 ), as predicted. When θ was less than 200 dimensions, the total running time was roughly 50 ms (and for dim(θ) ≈ 100, the runtime was close to 15 ms), well within the range of tolerable latencies for many experiments. In Fig. 2 we apply our algorithm to characterize the receptive field of a neuron whose response depends on its past spiking. Here, the stimulus coefficients k were chosen to follow a sine-wave; 1 The one difference is that the covariance matrix of p(θt+1 |xt+1 , rt+1 ) is in general no longer just a rankone modification of the covariance matrix of p(θt |xt , rt ); thus, we cannot use the rank-one update to compute the eigendecomposition. However, it is often reasonable to take Q to be white, Q = cI; in this case the eigenvectors of Ct + Q are those of Ct and the eigenvalues are ci + c where ci is the ith eigenvalue of Ct ; thus in this case, our methods may be applied without modification. the spike history coefficients a were inhibitory and followed an exponential function. When choosing stimuli we updated the posterior for the full θ = {k; a} simultaneously and maximized the information about both the stimulus coefficients and the spike history coefficients. The information maximizing strategy outperformed random sampling for estimating both the spike history and stimulus coefficients. Our final set of results, Fig. 3, considers a neuron whose receptive field drifts non-systematically with time. We take the receptive field to be a Gabor function whose center moves according to a random walk (we have in mind a slow random drift of eye position during a visual experiment). The results demonstrate the feasibility of the information-maximization strategy in the presence of nonstationary response properties θ, and emphasize the superiority of adaptive methods in this context. Conclusion We have developed an efficient implementation of an algorithm for online optimization of neurophysiology experiments based on information-theoretic criterion. Reasonable approximations based on a GLM framework allow the algorithm to run in near-real time even for high dimensional parameter and stimulus spaces, and in the presence of spike-rate adaptation and time-varying neural response properties. Despite these approximations the algorithm consistently provides significant improvements over random sampling; indeed, the differences in efficiency are large enough that the information-optimization strategy may permit robust system identification in cases where it is simply not otherwise feasible to estimate the neuron’s parameters using random stimuli. Thus, in a sense, the proposed stimulus-optimization technique significantly extends the reach and power of classical neurophysiology methods. Acknowledgments JL is supported by the Computational Science Graduate Fellowship Program administered by the DOE under contract DE-FG02-97ER25308 and by the NSF IGERT Program in Hybrid Neural Microsystems at Georgia Tech via grant number DGE-0333411. LP is supported by grant EY018003 from the NEI and by a Gatsby Foundation Pilot Grant. We thank P. Latham for helpful conversations. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] I. Nelken, et al., Hearing Research 72, 237 (1994). P. Foldiak, Neurocomputing 38–40, 1217 (2001). K. Zhang, et al., Proceedings (Computational and Systems Neuroscience Meeting, 2004). R. C. deCharms, et al., Science 280, 1439 (1998). C. Machens, et al., Neuron 47, 447 (2005). A. Watson, et al., Perception and Psychophysics 33, 113 (1983). L. Paninski, Neural Computation 17, 1480 (2005). P. McCullagh, et al., Generalized linear models (Chapman and Hall, London, 1989). L. Paninski, Network: Computation in Neural Systems 15, 243 (2004). E. Simoncelli, et al., The Cognitive Neurosciences, M. Gazzaniga, ed. (MIT Press, 2004), third edn. P. Dayan, et al., Theoretical Neuroscience (MIT Press, 2001). E. Chichilnisky, Network: Computation in Neural Systems 12, 199 (2001). F. Theunissen, et al., Network: Computation in Neural Systems 12, 289 (2001). L. Paninski, et al., Journal of Neuroscience 24, 8551 (2004). M. Gu, et al., SIAM Journal on Matrix Analysis and Applications 15, 1266 (1994). N. A. Lesica, et al., IEEE Trans. On Neural Systems And Rehabilitation Engineering 13, 194 (2005).

3 0.072004065 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Author: Peter Auer, Ronald Ortner

Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1

4 0.059372675 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

Author: Peter L. Bartlett, Ambuj Tewari

Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.

5 0.053450961 203 nips-2006-implicit Online Learning with Kernels

Author: Li Cheng, Dale Schuurmans, Shaojun Wang, Terry Caelli, S.v.n. Vishwanathan

Abstract: We present two new algorithms for online learning in reproducing kernel Hilbert spaces. Our first algorithm, ILK (implicit online learning with kernels), employs a new, implicit update technique that can be applied to a wide variety of convex loss functions. We then introduce a bounded memory version, SILK (sparse ILK), that maintains a compact representation of the predictor without compromising solution quality, even in non-stationary environments. We prove loss bounds and analyze the convergence rate of both. Experimental evidence shows that our proposed algorithms outperform current methods on synthetic and real data. 1

6 0.046557177 79 nips-2006-Fast Iterative Kernel PCA

7 0.044746239 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

8 0.043307908 44 nips-2006-Bayesian Policy Gradient Algorithms

9 0.042740718 154 nips-2006-Optimal Change-Detection and Spiking Neurons

10 0.042347044 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation

11 0.038080156 146 nips-2006-No-regret Algorithms for Online Convex Programs

12 0.03494383 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes

13 0.034660369 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning

14 0.033523474 33 nips-2006-Analysis of Representations for Domain Adaptation

15 0.032617796 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

16 0.03181744 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

17 0.03058565 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

18 0.029931482 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems

19 0.029769376 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments

20 0.028244389 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.089), (1, -0.009), (2, -0.107), (3, -0.036), (4, -0.021), (5, -0.038), (6, 0.019), (7, -0.091), (8, 0.052), (9, -0.031), (10, 0.002), (11, 0.004), (12, -0.011), (13, -0.018), (14, 0.002), (15, -0.006), (16, -0.019), (17, 0.025), (18, 0.019), (19, -0.017), (20, -0.002), (21, 0.017), (22, 0.034), (23, -0.013), (24, -0.007), (25, 0.021), (26, -0.016), (27, -0.008), (28, -0.02), (29, -0.01), (30, 0.005), (31, -0.023), (32, -0.027), (33, 0.032), (34, -0.021), (35, 0.006), (36, -0.003), (37, -0.033), (38, 0.065), (39, 0.001), (40, -0.061), (41, 0.017), (42, 0.021), (43, -0.056), (44, -0.018), (45, -0.012), (46, -0.068), (47, 0.045), (48, 0.033), (49, -0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.881818 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis

Author: Alborz Geramifard, Michael Bowling, Martin Zinkevich, Richard S. Sutton

Abstract: We present new theoretical and empirical results with the iLSTD algorithm for policy evaluation in reinforcement learning with linear function approximation. iLSTD is an incremental method for achieving results similar to LSTD, the dataefficient, least-squares version of temporal difference learning, without incurring the full cost of the LSTD computation. LSTD is O(n2 ), where n is the number of parameters in the linear function approximator, while iLSTD is O(n). In this paper, we generalize the previous iLSTD algorithm and present three new results: (1) the first convergence proof for an iLSTD algorithm; (2) an extension to incorporate eligibility traces without changing the asymptotic computational complexity; and (3) the first empirical results with an iLSTD algorithm for a problem (mountain car) with feature vectors large enough (n = 10, 000) to show substantial computational advantages over LSTD. 1

2 0.6534586 44 nips-2006-Bayesian Policy Gradient Algorithms

Author: Mohammad Ghavamzadeh, Yaakov Engel

Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1

3 0.6092062 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Author: Peter Auer, Ronald Ortner

Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1

4 0.60505432 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

Author: Peter L. Bartlett, Ambuj Tewari

Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.

5 0.53682983 154 nips-2006-Optimal Change-Detection and Spiking Neurons

Author: Angela J. Yu

Abstract: Survival in a non-stationary, potentially adversarial environment requires animals to detect sensory changes rapidly yet accurately, two oft competing desiderata. Neurons subserving such detections are faced with the corresponding challenge to discern “real” changes in inputs as quickly as possible, while ignoring noisy fluctuations. Mathematically, this is an example of a change-detection problem that is actively researched in the controlled stochastic processes community. In this paper, we utilize sophisticated tools developed in that community to formalize an instantiation of the problem faced by the nervous system, and characterize the Bayes-optimal decision policy under certain assumptions. We will derive from this optimal strategy an information accumulation and decision process that remarkably resembles the dynamics of a leaky integrate-and-fire neuron. This correspondence suggests that neurons are optimized for tracking input changes, and sheds new light on the computational import of intracellular properties such as resting membrane potential, voltage-dependent conductance, and post-spike reset voltage. We also explore the influence that factors such as timing, uncertainty, neuromodulation, and reward should and do have on neuronal dynamics and sensitivity, as the optimal decision strategy depends critically on these factors. 1

6 0.52934164 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

7 0.50784159 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems

8 0.4847838 203 nips-2006-implicit Online Learning with Kernels

9 0.48323986 98 nips-2006-Inferring Network Structure from Co-Occurrences

10 0.48120549 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes

11 0.43860438 79 nips-2006-Fast Iterative Kernel PCA

12 0.41845569 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising

13 0.40678626 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

14 0.40099612 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight

15 0.39111149 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning

16 0.35864639 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints

17 0.34514365 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

18 0.34118432 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

19 0.33654386 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation

20 0.33296013 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.077), (2, 0.029), (3, 0.019), (7, 0.047), (8, 0.369), (9, 0.035), (20, 0.014), (22, 0.038), (44, 0.055), (57, 0.063), (65, 0.056), (69, 0.029), (71, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73241252 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis

Author: Alborz Geramifard, Michael Bowling, Martin Zinkevich, Richard S. Sutton

Abstract: We present new theoretical and empirical results with the iLSTD algorithm for policy evaluation in reinforcement learning with linear function approximation. iLSTD is an incremental method for achieving results similar to LSTD, the dataefficient, least-squares version of temporal difference learning, without incurring the full cost of the LSTD computation. LSTD is O(n2 ), where n is the number of parameters in the linear function approximator, while iLSTD is O(n). In this paper, we generalize the previous iLSTD algorithm and present three new results: (1) the first convergence proof for an iLSTD algorithm; (2) an extension to incorporate eligibility traces without changing the asymptotic computational complexity; and (3) the first empirical results with an iLSTD algorithm for a problem (mountain car) with feature vectors large enough (n = 10, 000) to show substantial computational advantages over LSTD. 1

2 0.69930053 153 nips-2006-Online Clustering of Moving Hyperplanes

Author: René Vidal

Abstract: We propose a recursive algorithm for clustering trajectories lying in multiple moving hyperplanes. Starting from a given or random initial condition, we use normalized gradient descent to update the coefficients of a time varying polynomial whose degree is the number of hyperplanes and whose derivatives at a trajectory give an estimate of the vector normal to the hyperplane containing that trajectory. As time proceeds, the estimates of the hyperplane normals are shown to track their true values in a stable fashion. The segmentation of the trajectories is then obtained by clustering their associated normal vectors. The final result is a simple recursive algorithm for segmenting a variable number of moving hyperplanes. We test our algorithm on the segmentation of dynamic scenes containing rigid motions and dynamic textures, e.g., a bird floating on water. Our method not only segments the bird motion from the surrounding water motion, but also determines patterns of motion in the scene (e.g., periodic motion) directly from the temporal evolution of the estimated polynomial coefficients. Our experiments also show that our method can deal with appearing and disappearing motions in the scene.

3 0.62735569 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis

Abstract: We propose a non-linear generative model for human motion data that uses an undirected model with binary latent variables and real-valued “visible” variables that represent joint angles. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. Such an architecture makes on-line inference efficient and allows us to use a simple approximate learning procedure. After training, the model finds a single set of parameters that simultaneously capture several different kinds of motion. We demonstrate the power of our approach by synthesizing various motion sequences and by performing on-line filling in of data lost during motion capture. Website: http://www.cs.toronto.edu/∼gwtaylor/publications/nips2006mhmublv/

4 0.41096479 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

Author: Dong S. Cheng, Vittorio Murino, Mário Figueiredo

Abstract: This paper proposes a new approach to model-based clustering under prior knowledge. The proposed formulation can be interpreted from two different angles: as penalized logistic regression, where the class labels are only indirectly observed (via the probability density of each class); as finite mixture learning under a grouping prior. To estimate the parameters of the proposed model, we derive a (generalized) EM algorithm with a closed-form E-step, in contrast with other recent approaches to semi-supervised probabilistic clustering which require Gibbs sampling or suboptimal shortcuts. We show that our approach is ideally suited for image segmentation: it avoids the combinatorial nature Markov random field priors, and opens the door to more sophisticated spatial priors (e.g., wavelet-based) in a simple and computationally efficient way. Finally, we extend our formulation to work in unsupervised, semi-supervised, or discriminative modes. 1

5 0.40648735 45 nips-2006-Blind Motion Deblurring Using Image Statistics

Author: Anat Levin

Abstract: We address the problem of blind motion deblurring from a single image, caused by a few moving objects. In such situations only part of the image may be blurred, and the scene consists of layers blurred in different degrees. Most of of existing blind deconvolution research concentrates at recovering a single blurring kernel for the entire image. However, in the case of different motions, the blur cannot be modeled with a single kernel, and trying to deconvolve the entire image with the same kernel will cause serious artifacts. Thus, the task of deblurring needs to involve segmentation of the image into regions with different blurs. Our approach relies on the observation that the statistics of derivative filters in images are significantly changed by blur. Assuming the blur results from a constant velocity motion, we can limit the search to one dimensional box filter blurs. This enables us to model the expected derivatives distributions as a function of the width of the blur kernel. Those distributions are surprisingly powerful in discriminating regions with different blurs. The approach produces convincing deconvolution results on real world images with rich texture.

6 0.4024514 111 nips-2006-Learning Motion Style Synthesis from Perceptual Observations

7 0.39406851 174 nips-2006-Similarity by Composition

8 0.38022891 15 nips-2006-A Switched Gaussian Process for Estimating Disparity and Segmentation in Binocular Stereo

9 0.36481246 65 nips-2006-Denoising and Dimension Reduction in Feature Space

10 0.36228812 31 nips-2006-Analysis of Contour Motions

11 0.36193103 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

12 0.36048773 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

13 0.36021391 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

14 0.35914317 175 nips-2006-Simplifying Mixture Models through Function Approximation

15 0.35857734 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

16 0.35834864 79 nips-2006-Fast Iterative Kernel PCA

17 0.35818753 167 nips-2006-Recursive ICA

18 0.35694772 154 nips-2006-Optimal Change-Detection and Spiking Neurons

19 0.35694489 27 nips-2006-An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments

20 0.35659733 117 nips-2006-Learning on Graph with Laplacian Regularization