nips nips2002 nips2002-3 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Theodore J. Perkins, Doina Precup
Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. [sent-6, score-2.484]
2 We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. [sent-7, score-2.274]
3 To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions. [sent-8, score-1.002]
4 1 Introduction In recent years, methods for reinforcement learning control based on approximating value functions have come under fire for their poor, or poorly-understood, convergence properties. [sent-9, score-0.246]
5 With tabular storage of state or state-action values, algorithms such as Real-Time Dynamic Programming, Q-Learning, and Sarsa [2, 13] are known to converge to optimal values. [sent-10, score-0.075]
6 Far fewer results exist for the case in which value functions are approximated using generalizing function approximators, such as state-aggregators, linear approximators, or neural networks. [sent-11, score-0.082]
7 , [15]), and there are a few positive convergence results, particularly for the case of linear approximators [16, 7, 8]. [sent-14, score-0.137]
8 However, simple examples demonstrate that many standard reinforcement learning algorithms, such as Q-Learning, Sarsa, and approximate policy iteration, can diverge or cycle without converging when combined with generalizing function approximators (e. [sent-15, score-0.964]
9 We focus on a more recent observation, which faults the discontinuity of the action selection strategies usually employed by reinforcement learning agents [5, 10]. [sent-19, score-0.262]
10 If an agent uses almost any kind of generalizing function approximator to estimate state-values or state-action values, the values that are learned depend on the visitation frequencies of different states or state-action pairs. [sent-20, score-0.358]
11 If the agent’s behavior is discontinuous in its value estimates, as is the case with greedy and -greedy behavior [14], then slight changes in value estimates may result in radical changes in the agent’s behavior. [sent-21, score-0.232]
12 One way to avoid this problem is to ensure that small changes in action values result in small changes in the agent’s behavior—that is, to make the agent’s policy a continuous function of its values. [sent-23, score-0.96]
13 De Farias and Van Roy [5] showed that a form of approximate value iteration which relies on linear value function approximations and softmax policy improvement is guaranteed to possess fixed points. [sent-24, score-1.159]
14 For partially-observable Markov decision processes, Perkins and Pendrith [10] showed that observation-action values that are fixed points under Q-Learning or Sarsa update rules are guaranteed to exist if the agent uses any continuous action selection strategy. [sent-25, score-0.65]
15 Both of these papers demonstrate that continuity of the agent’s action selection strategy leads to the existence of fixed points to which the algorithms can converge. [sent-26, score-0.239]
16 1 We show that if the policy improvement operator, analogous to the action selection strategy of an on-line agent, is -soft and Lipschitz continuous in the action values, with a constant that is not too large, then the sequence of policies generated is guaranteed to converge. [sent-30, score-1.37]
17 2 Markov Decision Processes and Value Functions We consider infinite-horizon discounted Markov decision problems [3]. [sent-32, score-0.08]
18 We assume that the Markov decision process has a finite state set, , and a finite action set, , with sizes and . [sent-33, score-0.241]
19 When the process is in state and the agent chooses action , the agent receives an immediate reward with expectation , and the process transitions to next state with probability . [sent-34, score-0.874]
20 If ability that the agent chooses action when the process is in state is denoted is deterministic in state , i. [sent-39, score-0.577]
21 For let , , and denote, respectively, the state of the process at time , the action chosen by the agent at time , and the reward received by the agent at time . [sent-42, score-0.816]
22 For policy , the state-value function, , and state-action value function (or just action-value function), , are defined as: where the expectation is with respect to the stochasticity of the process and the fact that the agent chooses actions according to , and is a discount factor. [sent-43, score-1.052]
23 It is well-known [11] that there exists at least one deterministic, optimal policy for which for all , , and . [sent-44, score-0.719]
24 Note that a policy, , can be viewed as an element of , and can be viewed as a compact subset of . [sent-47, score-0.089]
25 We make the following assumption: Assumption 1 Under any policy , the Markov decision process behaves as an irreducible, aperiodic Markov chain over the state set . [sent-48, score-0.959]
26 ———————————————————————————————————— Inputs: initial policy , and policy improvement operator . [sent-50, score-1.59]
27 do Policy evaluation: Sarsa updates under policy Initialize arbitrarily. [sent-57, score-0.707]
28 With environment in state : Choose according to . [sent-58, score-0.077]
29 end for ———————————————————————————————————— Figure 1: The version of approximate policy iteration that we study. [sent-62, score-0.927]
30 The approximate policy iteration algorithm we propose learns linear approximations to the action value functions of policies. [sent-63, score-1.085]
31 ) For weights , the approximate action-value for is , where denotes the transpose of . [sent-66, score-0.139]
32 Letting be the -by- matrix whose rows correspond to the feature vectors of the state-action pairs, the entire approximate action-value function given by weights is represented by the vector . [sent-67, score-0.164]
33 3 Approximate Policy Iteration c 1 The standard, exact policy iteration algorithm [3] starts with an arbitrary policy and alternates between two steps: policy evaluation, in which is computed, and pol, is computed. [sent-69, score-2.182]
34 is taken to be a greedy, deterministic policy with respect to . [sent-71, score-0.699]
35 It is well-known that the sequence of policies eration terminates when generated is monotonically improving in the sense that for all , and that the algorithm terminates after a finite number of iterations [3]. [sent-74, score-0.185]
36 S I¤¦@9H QGR PECQ ££ %)VUAT@9('R &©$ Q@9uT PI5%GFED402 31 D)5©$ 2A0¡ 1 $ C R C H C 2¡ © 9 @R T © © 0 2A¡ 1 R Q @9R Q 9 B Bertsekas and Tsitsiklis [4] describe several versions of approximate policy iteration in which the policy evaluation step is not exact. [sent-77, score-1.692]
37 Instead, is approximated by a weighted linear combination of state features, with weights determined by Monte Carlo or TD( ) learning rules. [sent-78, score-0.077]
38 However, they assume that the policy improvement step is the same as in the standard policy iteration algorithm—the next policy is greedy with respect to the (approximate) action values of the previous policy. [sent-79, score-2.497]
39 Bertsekas and Tsitsiklis show that if the approximation error in the evaluation step is low, then such algorithms generate solutions that are near optimal [4]. [sent-80, score-0.137]
40 However, they also demonstrate by example that the sequence of policies generated does not converge for some problems, and that poor performance can result when the approximation error is high. [sent-81, score-0.194]
41 X We study the version of approximate policy iteration shown in Figure 1. [sent-82, score-0.927]
42 Like the versions studied by Bertsekas and Tsitsiklis, we assume that policy evaluation is not performed exactly. [sent-83, score-0.74]
43 In particular, we assume that Sarsa updating is used to learn the weights of a linear approximation to the action-value function. [sent-84, score-0.093]
44 We use action-value functions instead of statevalue functions so that the algorithm can be performed based on interactive experience with the environment, without knowledge of the state transition probabilities. [sent-85, score-0.081]
45 The weights learned in the policy evaluation step converge under conditions specified by Tsitsiklis and Van Roy [17], one of which is Assumption 2. [sent-86, score-0.839]
46 The key difference from previous work is that we assume a generic policy improvement operator, , which maps every to a stochastic policy. [sent-87, score-0.824]
47 This operator may produce, for example, greedy policies, -greedy policies, or policies with action selection probabilities based on the softmax function [14]. [sent-88, score-0.46]
48 is Lipschitz continuous with constant if, for all , , where denotes the Euclidean norm. [sent-89, score-0.106]
49 The fact that we allow for a policy improvement step that is not strictly greedy enables us to establish the following theorem. [sent-91, score-0.885]
50 A y 2 1 c g1 A y In other words, if the behavior of the agent does not change too greatly in response to changes in its action value estimates, then convergence is guaranteed. [sent-93, score-0.578]
51 The strength of the theorem is that it states a simple condition under which a form of model-free reinforcement learning control based on approximating value functions converges for a general class of problems. [sent-96, score-0.265]
52 The values of (and hence, range of policy improvement operators) which ensure convergence depend on properties of the decision process, such as its transition probabilities and rewards, which we assume to be unknown. [sent-98, score-0.984]
53 The theorem also offers no guarantee on the quality of the policy to which the algorithm converges. [sent-99, score-0.725]
54 Intuitively, if the policy improvement operator is Lipschitz continuous with a small constant , then the agent is limited in the extent to which it can optimize its behavior. [sent-100, score-1.313]
55 For example, even if an agent correctly learns that the value of action is much higher than the value of action , limits the frequency with which the agent can choose in favor of , and this may limit performance. [sent-101, score-0.906]
56 1 Probabilities Related to State-Action Pairs Because the approximate policy iteration algorithm in Figure 1 approximates action-values, our analysis relies extensively on certain probabilities that are associated with state-action pairs. [sent-104, score-0.927]
57 First, we define to be the -bymatrix whose entries correspond to the probabilities that one state-action pair follows another when the agent behaves according to . [sent-105, score-0.319]
58 can be viewed as the stochastic transition matrix of a Markov chain over state-action pairs. [sent-107, score-0.163]
59 Hence, Under Assumption 1, fixing a policy, , induces an irreducible, aperiodic Markov chain over . [sent-124, score-0.147]
60 If for all and , then all elements of are positive and it is easily verified that is the unique stationary distribution of the irreducible, aperiodic Markov chain over state-action pairs with transition matrix . [sent-128, score-0.322]
61 such that for all , Proof: For any , let be the largest eigenvalue of with modulus strictly less than 1. [sent-130, score-0.113]
62 is well-defined since the transition matrix of any irreducible, aperiodic Markov chain has precisely one eigenvalue equal to one [11]. [sent-131, score-0.237]
63 Since the eigenvalues of a matrix are continuous in the elements of the matrix [9], and since is compact, there exists for some . [sent-132, score-0.175]
64 Seneta [12], showed that for any and and stationary two irreducible aperiodic Markov chains with transition matrices distributions and , on a state set with elements, , where is the largest eigenvalue of with modulus strictly less than one. [sent-133, score-0.418]
65 2 The Weights Learned in the Policy Evaluation Step Consider the approximate policy evaluation step of the algorithm in Figure 1. [sent-139, score-0.866]
66 Suppose that the agent follows policy and uses Sarsa updates to learn weights , and suppose that for all and . [sent-140, score-1.04]
67 Then is the stochastic transition matrix of an irreducible, aperiodic Markov chain over state-action pairs, and has the unique stationary distribution of that chain on its diagonal. [sent-141, score-0.372]
68 ) In essence, this equation says that the “expected update” to the weights under the stationary distribution, , is zero. [sent-143, score-0.078]
69 Tsitsiklis and Van Roy [17] show that is invertible, hence we can write for the unique weights which satisfy Equation 1. [sent-145, score-0.084]
70 Proof: By Lemmas 1 and 2, and by the continuity of matrix inverses [11], is a continuous function of . [sent-150, score-0.148]
71 Because is a compact subset of , and because continuous functions map compact sets to compact sets, the existence of the bound, , follows. [sent-152, score-0.234]
72 Proof: Lemma 7, in the Appendix, shows that is a continuous mapping and that is , . [sent-156, score-0.084]
73 3 Contraction Argument Proof of Theorem 1: For a given infinite-horizon discounted Markov decision problem, let and be fixed. [sent-166, score-0.113]
74 Suppose that is Lipschitz continuous with constant , where is yet to be determined. [sent-167, score-0.106]
75 The policies that result from and after one iteration of the approximate policy iteration algorithm of Figure 1 are and respectively. [sent-169, score-1.18]
76 Each iteration of the approximate policy iteration is a contraction. [sent-172, score-1.075]
77 By the Contraction Mapping Theorem [3], there is a unique fixed point of the mapping , and the sequence of policies generated according to that mapping from any initial policy converges to the fixed point. [sent-173, score-0.904]
78 1 9 u7R T 5 Conclusions and Future Work We described a model-free, approximate version of policy iteration for infinite-horizon discounted Markov decision problems. [sent-175, score-1.007]
79 In this algorithm, the policy evaluation step of classical policy iteration is replaced by learning a linear approximation to the action-value function using on-line Sarsa updating. [sent-176, score-1.616]
80 The policy improvement step is given by an arbitrary policy improvement operator, which maps any possible action-value function to a new policy. [sent-177, score-1.621]
81 We are hopeful that similar ideas can be used to establish the convergence of other reinforcement learning algorithms, such as on-line Sarsa or Sarsa( ) control with linear function approximation. [sent-179, score-0.194]
82 X The magnitude of the constant that ensures convergence depends on the model of the environment and on properties of the feature representation. [sent-180, score-0.135]
83 If the model is not known, then choosing a policy improvement operator that guarantees convergence is not immediate. [sent-181, score-0.987]
84 To be safe, an operator for which is small should be chosen. [sent-182, score-0.114]
85 However, one generally prefers to be large, so that the agent can exploit its knowledge by choosing actions with higher estimated action-values as frequently as possible. [sent-183, score-0.295]
86 One approach to determining a proper value of would be to make an initial guess and begin the approximate policy iteration procedure. [sent-184, score-0.954]
87 If the contraction property fails on any iteration, one should choose a new policy improvement operator that is Lipschitz continuous with a smaller constant. [sent-185, score-1.054]
88 A potential advantage of this approach is that one can begin with a high choice of , which allows exploitation of action value differences, and switch to lower values of only as necessary. [sent-186, score-0.158]
89 It is possible that convergence could be obtained with much higher values of than are suggested by the bound in the proof of Theorem 1. [sent-187, score-0.126]
90 Discontinuous improvement operators/action selection strategies can lead to nonconvergent behavior for many reinforcement learning algorithms, including Q-Learning, Sarsa, and forms of approximate policy iteration and approximate value iteration. [sent-188, score-1.333]
91 For some of these algorithms, (non-unique) fixed points have been shown to exist when the action selection strategy/improvement operator is continuous [5, 10]. [sent-189, score-0.392]
92 Whether or not convergence also follows remains to be seen. [sent-190, score-0.075]
93 For the algorithm studied in this paper, we have constructed an example demonstrating non-convergence with improvement operators that are Lipschitz continuous but with too large of a constant. [sent-191, score-0.226]
94 One direction for future work is determining minimal restrictions on action selection (if any) that ensure the convergence of other reinforcement learning algorithms. [sent-193, score-0.358]
95 Ensuring convergence answers one standing objection to reinforcement learning control methods based on approximating value functions. [sent-194, score-0.246]
96 However, an important open issue for our approach, and for other approaches advocating continuous action selection [5, 10], is to characterize the solutions that they produce. [sent-195, score-0.282]
97 We know of no theoretical guarantees on the quality of solutions found, and there is little experimental work comparing algorithms that use continuous action selection with those that do not. [sent-196, score-0.282]
98 On the existence of fixed points for approximate value iteration and temporal-difference learning. [sent-230, score-0.303]
99 Sensitivity analysis, ergodicity coefficients, and rank-one updates for finite markov chains. [sent-270, score-0.112]
100 Optimal stopping of markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. [sent-299, score-0.108]
wordName wordTfidf (topN-words)
[('policy', 0.678), ('sarsa', 0.31), ('agent', 0.295), ('lipschitz', 0.154), ('iteration', 0.148), ('tsitsiklis', 0.135), ('action', 0.131), ('improvement', 0.12), ('irreducible', 0.116), ('pr', 0.116), ('operator', 0.114), ('policies', 0.105), ('aperiodic', 0.101), ('approximate', 0.101), ('perkins', 0.097), ('reinforcement', 0.089), ('continuous', 0.084), ('markov', 0.083), ('convergence', 0.075), ('lemma', 0.068), ('ah', 0.064), ('evaluation', 0.062), ('approximators', 0.062), ('bertsekas', 0.062), ('contraction', 0.058), ('roy', 0.057), ('proof', 0.051), ('gx', 0.051), ('decision', 0.048), ('lemmas', 0.047), ('theorem', 0.047), ('converges', 0.047), ('van', 0.047), ('chain', 0.046), ('unique', 0.046), ('transition', 0.042), ('selection', 0.042), ('compact', 0.041), ('exists', 0.041), ('stationary', 0.04), ('state', 0.039), ('greedy', 0.039), ('discontinuous', 0.039), ('doina', 0.039), ('pur', 0.039), ('puu', 0.039), ('theodore', 0.039), ('continuity', 0.039), ('environment', 0.038), ('weights', 0.038), ('converge', 0.036), ('generalizing', 0.034), ('td', 0.034), ('amherst', 0.034), ('bap', 0.034), ('modulus', 0.034), ('precup', 0.034), ('let', 0.033), ('discounted', 0.032), ('farias', 0.031), ('control', 0.03), ('updating', 0.03), ('updates', 0.029), ('chooses', 0.029), ('approximator', 0.029), ('athena', 0.029), ('softmax', 0.029), ('guaranteed', 0.029), ('sequence', 0.028), ('behavior', 0.027), ('value', 0.027), ('pu', 0.027), ('existence', 0.027), ('stochastic', 0.026), ('terminates', 0.026), ('step', 0.025), ('approximation', 0.025), ('solutions', 0.025), ('approximating', 0.025), ('matrix', 0.025), ('xed', 0.024), ('viewed', 0.024), ('assumption', 0.024), ('behaves', 0.024), ('processes', 0.023), ('eigenvalue', 0.023), ('strictly', 0.023), ('process', 0.023), ('changes', 0.023), ('dynamic', 0.022), ('pairs', 0.022), ('operators', 0.022), ('constant', 0.022), ('ay', 0.021), ('claim', 0.021), ('rewards', 0.021), ('exist', 0.021), ('deterministic', 0.021), ('programming', 0.021), ('ensure', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 3 nips-2002-A Convergent Form of Approximate Policy Iteration
Author: Theodore J. Perkins, Doina Precup
Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.
2 0.3081634 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach
Author: Christopher G. Atkeson, Jun Morimoto
Abstract: A longstanding goal of reinforcement learning is to develop nonparametric representations of policies and value functions that support rapid learning without suffering from interference or the curse of dimensionality. We have developed a trajectory-based approach, in which policies and value functions are represented nonparametrically along trajectories. These trajectories, policies, and value functions are updated as the value function becomes more accurate or as a model of the task is updated. We have applied this approach to periodic tasks such as hopping and walking, which required handling discount factors and discontinuities in the task dynamics, and using function approximation to represent value functions at discontinuities. We also describe extensions of the approach to make the policies more robust to modeling error and sensor noise.
3 0.28450841 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming
Author: Benjamin V. Roy, Daniela D. Farias
Abstract: This paper extends our earlier analysis on approximate linear programming as an approach to approximating the cost-to-go function in a discounted-cost dynamic program [6]. In this paper, we consider the average-cost criterion and a version of approximate linear programming that generates approximations to the optimal average cost and differential cost function. We demonstrate that a naive version of approximate linear programming prioritizes approximation of the optimal average cost and that this may not be well-aligned with the objective of deriving a policy with low average cost. For that, the algorithm should aim at producing a good approximation of the differential cost function. We propose a twophase variant of approximate linear programming that allows for external control of the relative accuracy of the approximation of the differential cost function over different portions of the state space via state-relevance weights. Performance bounds suggest that the new algorithm is compatible with the objective of optimizing performance and provide guidance on appropriate choices for state-relevance weights.
4 0.27720809 20 nips-2002-Adaptive Caching by Refetching
Author: Robert B. Gramacy, Manfred K. Warmuth, Scott A. Brandt, Ismail Ari
Abstract: We are constructing caching policies that have 13-20% lower miss rates than the best of twelve baseline policies over a large variety of request streams. This represents an improvement of 49–63% over Least Recently Used, the most commonly implemented policy. We achieve this not by designing a specific new policy but by using on-line Machine Learning algorithms to dynamically shift between the standard policies based on their observed miss rates. A thorough experimental evaluation of our techniques is given, as well as a discussion of what makes caching an interesting on-line learning problem.
5 0.25603822 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
Author: Ralf Schoknecht
Abstract: There are several reinforcement learning algorithms that yield approximate solutions for the problem of policy evaluation when the value function is represented with a linear function approximator. In this paper we show that each of the solutions is optimal with respect to a specific objective function. Moreover, we characterise the different solutions as images of the optimal exact value function under different projection operations. The results presented here will be useful for comparing the algorithms in terms of the error they achieve relative to the error of the optimal approximate solution. 1
6 0.24083254 134 nips-2002-Learning to Take Concurrent Actions
7 0.23582631 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
8 0.22404993 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
9 0.2215573 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation
10 0.20852098 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions
11 0.1997003 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games
12 0.15085787 78 nips-2002-Efficient Learning Equilibrium
13 0.13431652 205 nips-2002-Value-Directed Compression of POMDPs
14 0.12534407 165 nips-2002-Ranking with Large Margin Principle: Two Approaches
15 0.1126721 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives
16 0.10528453 185 nips-2002-Speeding up the Parti-Game Algorithm
17 0.080748908 77 nips-2002-Effective Dimension and Generalization of Kernel Learning
18 0.061398655 199 nips-2002-Timing and Partial Observability in the Dopamine System
19 0.060521401 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error
20 0.053661525 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines
topicId topicWeight
[(0, -0.186), (1, -0.05), (2, -0.552), (3, -0.187), (4, 0.021), (5, -0.206), (6, 0.128), (7, -0.114), (8, 0.027), (9, 0.011), (10, 0.004), (11, -0.055), (12, -0.055), (13, 0.086), (14, 0.035), (15, -0.015), (16, 0.003), (17, -0.087), (18, 0.102), (19, 0.03), (20, -0.079), (21, -0.031), (22, -0.11), (23, -0.05), (24, -0.099), (25, 0.004), (26, 0.015), (27, 0.082), (28, -0.049), (29, -0.0), (30, 0.006), (31, -0.066), (32, 0.042), (33, -0.043), (34, 0.019), (35, 0.021), (36, -0.041), (37, -0.027), (38, -0.032), (39, 0.003), (40, -0.029), (41, 0.014), (42, -0.04), (43, -0.039), (44, -0.003), (45, 0.021), (46, -0.01), (47, -0.005), (48, 0.023), (49, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.97612941 3 nips-2002-A Convergent Form of Approximate Policy Iteration
Author: Theodore J. Perkins, Doina Precup
Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.
2 0.87770683 20 nips-2002-Adaptive Caching by Refetching
Author: Robert B. Gramacy, Manfred K. Warmuth, Scott A. Brandt, Ismail Ari
Abstract: We are constructing caching policies that have 13-20% lower miss rates than the best of twelve baseline policies over a large variety of request streams. This represents an improvement of 49–63% over Least Recently Used, the most commonly implemented policy. We achieve this not by designing a specific new policy but by using on-line Machine Learning algorithms to dynamically shift between the standard policies based on their observed miss rates. A thorough experimental evaluation of our techniques is given, as well as a discussion of what makes caching an interesting on-line learning problem.
3 0.85157239 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming
Author: Benjamin V. Roy, Daniela D. Farias
Abstract: This paper extends our earlier analysis on approximate linear programming as an approach to approximating the cost-to-go function in a discounted-cost dynamic program [6]. In this paper, we consider the average-cost criterion and a version of approximate linear programming that generates approximations to the optimal average cost and differential cost function. We demonstrate that a naive version of approximate linear programming prioritizes approximation of the optimal average cost and that this may not be well-aligned with the objective of deriving a policy with low average cost. For that, the algorithm should aim at producing a good approximation of the differential cost function. We propose a twophase variant of approximate linear programming that allows for external control of the relative accuracy of the approximation of the differential cost function over different portions of the state space via state-relevance weights. Performance bounds suggest that the new algorithm is compatible with the objective of optimizing performance and provide guidance on appropriate choices for state-relevance weights.
4 0.79976124 134 nips-2002-Learning to Take Concurrent Actions
Author: Khashayar Rohanimanesh, Sridhar Mahadevan
Abstract: We investigate a general semi-Markov Decision Process (SMDP) framework for modeling concurrent decision making, where agents learn optimal plans over concurrent temporally extended actions. We introduce three types of parallel termination schemes – all, any and continue – and theoretically and experimentally compare them. 1
5 0.76362789 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
Author: Eric Allender, Sanjeev Arora, Michael Kearns, Cristopher Moore, Alexander Russell
Abstract: We establish a new hardness result that shows that the difficulty of planning in factored Markov decision processes is representational rather than just computational. More precisely, we give a fixed family of factored MDPs with linear rewards whose optimal policies and value functions simply cannot be represented succinctly in any standard parametric form. Previous hardness results indicated that computing good policies from the MDP parameters was difficult, but left open the possibility of succinct function approximation for any fixed factored MDP. Our result applies even to policies which yield a polynomially poor approximation to the optimal value, and highlights interesting connections with the complexity class of Arthur-Merlin games.
6 0.75699234 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions
7 0.68281442 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach
8 0.53472281 205 nips-2002-Value-Directed Compression of POMDPs
9 0.52217376 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
10 0.5186336 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation
11 0.47776625 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games
12 0.46128625 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
13 0.42407599 185 nips-2002-Speeding up the Parti-Game Algorithm
14 0.38021198 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives
15 0.35923192 78 nips-2002-Efficient Learning Equilibrium
16 0.31557375 165 nips-2002-Ranking with Large Margin Principle: Two Approaches
17 0.28315252 144 nips-2002-Minimax Differential Dynamic Programming: An Application to Robust Biped Walking
18 0.27364469 199 nips-2002-Timing and Partial Observability in the Dopamine System
19 0.19645317 77 nips-2002-Effective Dimension and Generalization of Kernel Learning
20 0.19386987 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error
topicId topicWeight
[(11, 0.024), (23, 0.037), (24, 0.241), (42, 0.093), (54, 0.139), (55, 0.03), (58, 0.01), (67, 0.028), (68, 0.022), (74, 0.112), (92, 0.048), (98, 0.112)]
simIndex simValue paperId paperTitle
1 0.89947402 4 nips-2002-A Differential Semantics for Jointree Algorithms
Author: James D. Park, Adnan Darwiche
Abstract: A new approach to inference in belief networks has been recently proposed, which is based on an algebraic representation of belief networks using multi–linear functions. According to this approach, the key computational question is that of representing multi–linear functions compactly, since inference reduces to a simple process of ev aluating and differentiating such functions. W e show here that mainstream inference algorithms based on jointrees are a special case of this approach in a v ery precise sense. W e use this result to prov e new properties of jointree algorithms, and then discuss some of its practical and theoretical implications. 1
same-paper 2 0.810911 3 nips-2002-A Convergent Form of Approximate Policy Iteration
Author: Theodore J. Perkins, Doina Precup
Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.
3 0.78486133 184 nips-2002-Spectro-Temporal Receptive Fields of Subthreshold Responses in Auditory Cortex
Author: Christian K. Machens, Michael Wehr, Anthony M. Zador
Abstract: How do cortical neurons represent the acoustic environment? This question is often addressed by probing with simple stimuli such as clicks or tone pips. Such stimuli have the advantage of yielding easily interpreted answers, but have the disadvantage that they may fail to uncover complex or higher-order neuronal response properties. Here we adopt an alternative approach, probing neuronal responses with complex acoustic stimuli, including animal vocalizations and music. We have used in vivo whole cell methods in the rat auditory cortex to record subthreshold membrane potential fluctuations elicited by these stimuli. Whole cell recording reveals the total synaptic input to a neuron from all the other neurons in the circuit, instead of just its output—a sparse binary spike train—as in conventional single unit physiological recordings. Whole cell recording thus provides a much richer source of information about the neuron’s response. Many neurons responded robustly and reliably to the complex stimuli in our ensemble. Here we analyze the linear component—the spectrotemporal receptive field (STRF)—of the transformation from the sound (as represented by its time-varying spectrogram) to the neuron’s membrane potential. We find that the STRF has a rich dynamical structure, including excitatory regions positioned in general accord with the prediction of the simple tuning curve. We also find that in many cases, much of the neuron’s response, although deterministically related to the stimulus, cannot be predicted by the linear component, indicating the presence of as-yet-uncharacterized nonlinear response properties.
4 0.6966306 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
Author: Olivier Chapelle, Jason Weston, Bernhard SchĂślkopf
Abstract: We propose a framework to incorporate unlabeled data in kernel classifier, based on the idea that two points in the same cluster are more likely to have the same label. This is achieved by modifying the eigenspectrum of the kernel matrix. Experimental results assess the validity of this approach. 1
5 0.69428837 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
Author: Max Welling, Simon Osindero, Geoffrey E. Hinton
Abstract: We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some filter outputs. We encourage the system to find sparse features by using a Studentt distribution to model each filter output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent filters, the system learns a topographic map in which the orientation, spatial frequency and location of the filters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efficient learning procedure that works well even for highly overcomplete sets of filters. Once the model has been learned it can be used as a prior to derive the “iterated Wiener filter” for the purpose of denoising images.
6 0.68708766 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
7 0.68571419 124 nips-2002-Learning Graphical Models with Mercer Kernels
8 0.68339318 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
9 0.68278193 53 nips-2002-Clustering with the Fisher Score
10 0.68126476 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
11 0.68099397 169 nips-2002-Real-Time Particle Filters
12 0.68009055 74 nips-2002-Dynamic Structure Super-Resolution
13 0.67913198 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
14 0.67801762 27 nips-2002-An Impossibility Theorem for Clustering
15 0.67740542 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
16 0.67718488 10 nips-2002-A Model for Learning Variance Components of Natural Images
17 0.67658114 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
18 0.67618883 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
19 0.67584902 46 nips-2002-Boosting Density Estimation
20 0.6751616 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition