nips nips2009 nips2009-12 knowledge-graph by maker-knowledge-mining

12 nips-2009-A Generalized Natural Actor-Critic Algorithm

Source: pdf

Author: Tetsuro Morimura, Eiji Uchibe, Junichiro Yoshimoto, Kenji Doya

Abstract: Policy gradient Reinforcement Learning (RL) algorithms have received substantial attention, seeking stochastic policies that maximize the average (or discounted cumulative) reward. In addition, extensions based on the concept of the Natural Gradient (NG) show promising learning efﬁciency because these regard metrics for the task. Though there are two candidate metrics, Kakade’s Fisher Information Matrix (FIM) for the policy (action) distribution and Morimura’s FIM for the stateaction joint distribution, but all RL algorithms with NG have followed Kakade’s approach. In this paper, we describe a generalized Natural Gradient (gNG) that linearly interpolates the two FIMs and propose an efﬁcient implementation for the gNG learning based on a theory of the estimating function, the generalized Natural Actor-Critic (gNAC) algorithm. The gNAC algorithm involves a near optimal auxiliary function to reduce the variance of the gNG estimates. Interestingly, the gNAC can be regarded as a natural extension of the current state-of-the-art NAC algorithm [1], as long as the interpolating parameter is appropriately selected. Numerical experiments showed that the proposed gNAC algorithm can estimate gNG efﬁciently and outperformed the NAC algorithm.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 jp Abstract Policy gradient Reinforcement Learning (RL) algorithms have received substantial attention, seeking stochastic policies that maximize the average (or discounted cumulative) reward. [sent-4, score-0.134]

2 Though there are two candidate metrics, Kakade’s Fisher Information Matrix (FIM) for the policy (action) distribution and Morimura’s FIM for the stateaction joint distribution, but all RL algorithms with NG have followed Kakade’s approach. [sent-6, score-0.286]

3 In this paper, we describe a generalized Natural Gradient (gNG) that linearly interpolates the two FIMs and propose an efﬁcient implementation for the gNG learning based on a theory of the estimating function, the generalized Natural Actor-Critic (gNAC) algorithm. [sent-7, score-0.168]

4 The gNAC algorithm involves a near optimal auxiliary function to reduce the variance of the gNG estimates. [sent-8, score-0.149]

5 Interestingly, the gNAC can be regarded as a natural extension of the current state-of-the-art NAC algorithm [1], as long as the interpolating parameter is appropriately selected. [sent-9, score-0.108]

6 Numerical experiments showed that the proposed gNAC algorithm can estimate gNG efﬁciently and outperformed the NAC algorithm. [sent-10, score-0.07]

7 1 Introduction Policy Gradient Reinforcement Learning (PGRL) attempts to ﬁnd a policy that maximizes the average (or time-discounted) reward, based on gradient ascent in the policy parameter space [2, 3, 4]. [sent-11, score-0.662]

8 Since it is possible to handle the parameters controlling the randomness of the policy, the PGRL, rather than the value-based RL, can ﬁnd the appropriate stochastic policy and has succeeded in several practical applications [5, 6, 7]. [sent-12, score-0.289]

9 In this paper, we propose a new PGRL algorithm, a generalized Natural Actor-Critic (gNAC) algorithm, based on the natural gradient [9]. [sent-14, score-0.149]

10 Because “natural gradient” learning is the steepest gradient method in a Riemannian space and the direction of the natural gradient is deﬁned on that metric, it is an important issue how to design the Riemannian metric. [sent-15, score-0.215]

11 In the framework of PGRL, the stochastic policies are represented as parametric probability distributions. [sent-16, score-0.045]

12 Thus the Fisher Information Matrices (FIMs) with respect to the policy parameter induce appropriate Riemannian metrics. [sent-17, score-0.284]

13 Kakade [8] used an average FIM for the policy over the states and proposed a natural policy gradient (NPG) learning. [sent-18, score-0.683]

14 These are based on the actor-critic framework, called the natural actor-critic (NAC) [1]. [sent-20, score-0.059]

15 This natural gradient is on the FIM of the state-action joint distribution as the Riemannian metric for RL, which is directly associated with the average rewards as the objective function. [sent-22, score-0.208]

16 [12] showed that the metric of the NSG corresponds with 1 the changes in the stationary state-action joint distribution. [sent-24, score-0.107]

17 In contrast, the metric of the NPG takes into account only changes in the action distribution and ignores changes in the state distribution, which also depends on the policy in general. [sent-25, score-0.373]

18 However, no algorithm for estimating the NSG has been proposed, probably because the estimation for the derivative of log stationary state distribution was difﬁcult [13]. [sent-27, score-0.219]

19 Accordingly, we created a linear interpolation of both of the FIMs as a generalized Natural Gradient (gNG) and derived an efﬁcient approach to estimate the gNG by applying the theory of the estimating function for stochastic models [14] in Section 3. [sent-30, score-0.219]

20 To validate the performance of the proposed algorithm, numerical experiments are shown in Section 5, where the proposed algorithm can estimate the gNG efﬁciently and outperformed the NAC algorithm [1]. [sent-32, score-0.106]

21 2 Background of Policy Gradient and Natural Gradient for RL We brieﬂy review the policy gradient and natural gradient learning as gradient ascent methods for RL and also present the motivation of the gNAC approach. [sent-33, score-0.549]

22 Also, p : S × A × S → [0, 1] is a state transition probability function of a state s, an action a, and the following state s+1 , i. [sent-37, score-0.157]

23 R : S × A × S → R is a bounded reward function of s, a, and s+1 , which deﬁnes an immediate reward r = R(s, a, s+1 ) observed by a learning agent at each time step. [sent-40, score-0.232]

24 The action probability function π : A × S × R d → [0, 1] uses a, s, and a policy parameter θ ∈ Rd to deﬁne the decision-making rule of the learning agent, which is also called a policy, i. [sent-41, score-0.324]

25 The policy is normally parameterized by users and is controlled by tuning θ. [sent-44, score-0.269]

26 Assumption 1 The policy is always differentiable with respect to θ and is non-redundant for the task, i. [sent-46, score-0.269]

27 Under Assumption 2, there exists a unique stationary state distribution d θ(s) Pr(s|M(θ)), which is equal to the limiting distribution and independent of the initial state, d θ(s ) = limt→∞ Pr(S+t = s |S = s, M(θ)), ∀ s ∈ S. [sent-51, score-0.102]

28 The goal of PGRL is to ﬁnd the policy parameter θ ∗ that maximizes the average of the immediate rewards, the average reward, η(θ) Eθ [r] = dθ(s)π(a|s;θ)p(s+1 |s, a)R(s, a, s+1 ), (1) s∈S a∈A s+1 ∈S where Eθ [a] denotes the expectation of a on the Markov chain M(θ). [sent-53, score-0.331]

29 The derivative of the average reward for (1) with respect to the policy parameter, θ η(θ) [∂η(θ)/∂θ1 , . [sent-54, score-0.384]

30 2 Therefore, the average reward η(θ) will be increased by updating the policy parameter as θ := θ + α θ η(θ), where := denotes the right-to-left substitution and α is a sufﬁciently small learning rate. [sent-60, score-0.399]

31 Accordingly, the updating direction of the policy parameter by the ordinary gradient method will be different from the steepest directions on these manifolds. [sent-63, score-0.399]

32 Therefore, the optimization process sometimes falls into a stagnant state, commonly called a plateau [8, 12]. [sent-64, score-0.081]

33 2 Natural Gradients for PGRL To avoid the plateau problem, the concept of the natural gradient was proposed by Amari [9], which is a gradient method on a Riemannian space. [sent-66, score-0.225]

34 Under the constraint ∆θ 2 = ε2 G G for a sufﬁciently small constant ε, the steepest ascent direction of the function η(θ) on the manifold G(θ) is given by −1 G(θ) η(θ) = G(θ) θ η(θ), which is called the natural gradient (NG). [sent-68, score-0.198]

35 Thus, an appropriate choice of the Riemannian metric for the task is required. [sent-71, score-0.042]

36 (II) Considering that the average reward η(θ) in (1) is affected not only by the policy distributions π(a|s;θ) but also by the stationary state distribution dθ(s), Moimura et al. [sent-74, score-0.472]

37 [12] proposed the use of the FIM of the state-action joint distribution for RL, Fs,a (θ) Eθ θ ln {dθ(s)π(a|s;θ)} θ ln {dθ(s)π(a|s;θ)} = Fs (θ) + F a (θ), (3) where Fs (θ) s∈S dθ(s) θ ln dθ(s) θ ln dθ(s) is the FIM of dθ(s). [sent-75, score-1.045]

38 2 The reason for using F (θ) as G(θ) is because the FIM Fx (θ) is a unique metric matrix of the secondorder Taylor expansion of the Kullback-Leibler divergence Pr(x|θ+∆θ) from Pr(x|θ) [17]. [sent-83, score-0.042]

39 3 Although there were numerical experiments involving the NSG in [12], they computed the NSG analytically with the state transition probabilities and the reward function, which is typically unknown in RL. [sent-84, score-0.174]

40 Accordingly, the mixing time of M(θ) might be drastically changed with NSG learning compared to NPG learning, since the mixing time depends on the multiple (not necessarily inﬁnite) time-steps rather than the single time-step, i. [sent-86, score-0.038]

41 , while various policies can lead to the same stationary state distribution, Markov chains associated with these policies have different mixing times. [sent-88, score-0.157]

42 A larger mixing time makes it difﬁcult for the learning agent to explore the environment and to estimate the gradient with ﬁnite samples. [sent-89, score-0.143]

43 Thus, we consider a mixture of NPG and NSG as a generalized NG (gNG) and propose the approach of ’generalized Natural Actor-Critic’ (gNAC), in which the policy parameter of an actor is updated by an estimate of the gNG of a critic. [sent-91, score-0.373]

44 Then we introduce the estimating functions to build up a foundation for an efﬁcient estimation of the gNG. [sent-93, score-0.115]

45 1 Deﬁnition of gNG for RL In order to deﬁne an interpolation between NPG and NSG with a parameter ι ∈ [0, 1], we consider a linear interpolation from the FIM of (2) for the NPG to the FIM of (3) for the NSG, written as ˜ (4) Fs,a (θ, ι) ιFs (θ) + F a (θ). [sent-95, score-0.083]

46 Then the natural gradient of the interpolated FIM is −1 ˜ ˜ Fs,a (θ,ι) η(θ) = Fs,a (θ, ι) (5) θ η(θ), which we call the “generalized natural gradient” for RL with the interpolating parameter ι, gNG(ι). [sent-96, score-0.185]

47 When ˜ ι is equal to 1/t, this FIM Fs,a (θ, ι) is equivalent to the FIM of the t time-steps joint distribution from the stationary state distribution dθ(s) on M(θ) [12]. [sent-98, score-0.119]

48 Thus, this interpolation controlled by ι can be interpreted as a continuous interpolation with respect to the time-steps of the joint distribution, so that ι : 1 → 0 is inversely proportional to t : 1 → ∞. [sent-99, score-0.085]

49 2 Estimating Function of gNG(ι) We provide a general view of the estimation of the gNG(ι) using the theory of the estimating function, which provides well-established results for parameter estimation [14]. [sent-102, score-0.147]

50 Proposition 1 The d-dimensional (random) function gι,θ(s, a; ω) θ ln{dθ(s)π(a|s;θ)} r− θ ln{dθ(s) ι π(a|s;θ)} ω (9) is an estimating function for gNG(ι), such that the unique solution of E θ [gι,θ(s, a; ω)] = 0 with respect to ω is equal to the gNG(ι). [sent-104, score-0.112]

51 The remaining conditions from (7) and (8), which the estimating function must satisfy, also obviously hold (under Assumption 1). [sent-107, score-0.098]

52 4 In order to estimate gNG(ι) by using the estimating function (9) with ﬁnite T samples on M(θ), the simultaneous equation T −1 1 g (st , at ; ω) = 0 T t=0 ι,θ is solved with respect to ω. [sent-108, score-0.13]

53 The solution ω, which is also called the M-estimator [20], is an unbiased estimate of gNG(ι), so that ω = ω ∗ holds in the limit as T → ∞. [sent-109, score-0.066]

54 For that reason, we extend the estimating function using (9) by embedding an auxiliary function to create space for improvement in (9). [sent-113, score-0.19]

55 Lemma 1 The d-dimensional (random) function is an estimating function for gNG(ι), gι,θ(s, a; ω) θ ln{dθ(s)π(a|s;θ)} r− θ ln{dθ(s) ι π(a|s;θ)} ω − ρ(s, s+1 ) , (10) where ρ(s, s+1 ) is called the auxiliary function for (9): ρ(s, s+1 ) (11) c + b(s) − b(s+1 ). [sent-114, score-0.208]

56 Let Gθ denote the class of such functions gθ with various auxiliary functions ρ. [sent-118, score-0.092]

57 An optimal auxiliary function, which leads to minimizing the variance of the gNG estimate ω, is deﬁned by the ∗ optimality criterion of the estimating functions [22]. [sent-119, score-0.262]

58 An estimating function g ι,θ is optimal in Gι,θ if ∗ det | Σgι,θ | ≤ det | Σgι,θ | where Σgθ Eθ gι,θ(s, a; ω ∗ )gι,θ(s, a; ω ∗ ) . [sent-120, score-0.196]

59 (14) Proof Sketch: The covariance matrix for the criterion of the auxiliary function ρ is approximated as Σgθ ≈ Eθ θ ln{dθ(s)π(a|s;θ)} ˆ Σg θ . [sent-123, score-0.109]

60 Therefore, if the estimating function has an auxiliary function ρ∗ satisfying (14), the criterion of the optimality ˆ ∗ for ρ is minimized for det | Σgθ | = 0 due to (15). [sent-131, score-0.279]

61 From Lemma 2, the near optimal auxiliary function ρ∗ can be regarded as minimizing the mean squared residuals to zero between R(s, a) and the estimator Rρ (s, a; ω) ι ∗ θ ln{dθ(s) π(a|s;θ)} ω + ρ(s, s+1 ). [sent-132, score-0.184]

62 Corollary 1 Let b∗ (s) and c∗ be the functions in the near optimal auxiliary function ρ∗ (s, s+1 ) ι=0 ι=0 at ι = 0, then b∗ (s) and c∗ are equal to the (un-discounted) state value function [23] and the ι=0 ι=0 average reward, respectively. [sent-136, score-0.203]

63 a∈A Therefore, the following equation, which is the same as the deﬁnition of the value function b ∗ (s) ι=0 with the average reward c∗ as the solution of the Poisson equation [23], can be derived from (14): ι=0 ∀ s. [sent-138, score-0.115]

64 b∗ (s) + c∗ = Eθ [r + b∗ (s+1 ) | s], ι=0 ι=0 ι=0 4 A Generalized NAC Algorithm We now propose a useful instrumental variable for the gNG(ι) estimation and then derive a gNAC algorithm along with an algorithm for θ ln dθ(s) estimation. [sent-139, score-0.47]

65 1 Bias from Estimation of θ ln dθ(s) For computation of the M-estimator of gι,θ(s, a; ω) as the gNG(ι) estimate on M(θ), the computations of both of the derivatives, θ lnπ(a|s;θ) and θ ln dθ(s), are required. [sent-141, score-0.546]

66 While we can easily compute θ lnπ(a|s;θ) since we have parameterized the policy, we cannot compute the Logarithm stationary State distribution Derivative (LSD) θ ln dθ(s) analytically unless the state transition probabilities and the reward function are known. [sent-142, score-0.459]

67 These LSD estimates θ ln dθ(s) usually have some estimation errors with ﬁnite samples, while the estimates are unbiased, so that θ ln dθ(s) = θ ln dθ(s) + (s), where (s) is an d-dimensional random variable satisfying E{ (s)|s} = 0. [sent-144, score-0.831]

68 In such cases, the estimate of gNG(ι) from the estimating function (9) or (10) would be biased, because the ﬁrst condition of (6) for gι,θ is violated unless Eθ [ (s) (s) ] = 0. [sent-145, score-0.13]

69 2 Instrumental variables of near optimal estimating function for gNG(ι) We use a linear function to introduce the auxiliary function (deﬁned in (11)), ˜ ρ(s, s+1 ; ν) (φ(s) − [φ(s+1 ) 0] ) ν, 5 These correspond to the conditions for the estimating function, (6), (7), and (8). [sent-149, score-0.329]

70 6 (18) A Generalized Natural Actor-Critic Algorithm with LSLSD(λ) Given: A policy π(a|s;θ) with an adjustable θ and a feature vector function of the state, φ(s). [sent-150, score-0.269]

71 where ν ∈ R|S|+1 and φ(s) ∈ R|S| are the model parameter and the regressor (feature vector ˜ function) of the state s, respectively, and φ(s) [φ(s) , 1] . [sent-155, score-0.08]

72 Accordingly, the whole model parameter of the estimating function is now [ω , ν ] . [sent-157, score-0.113]

73 We propose the following instrumental variable ˜ I (s, a) [ θ lnπ(a|s;θ), φ(s)] . [sent-158, score-0.164]

74 (19) Because this instrumental variable I has the desirable property as shown in Theorem 1, the estimating function gι,θ(s, a; ) with I is a useful function, even if the LSD is estimated. [sent-159, score-0.262]

75 Proof Sketch: (i) The condition (18) for the instrumental variable is satisﬁed due to Assumption 1. [sent-161, score-0.164]

76 (ii) Considering Eθ [ θ ln dθ(s) θ lnπ(a|s;θ) ] = 0 and Assumption 1, the condition (17) det | Eθ [ gι,θ ] | = 0, is satisﬁed. [sent-162, score-0.306]

77 The optimal instrumental variable I ∗ (s, a) with respect to the variance minimization is derived straightforwardly with the results of [21, 24]. [sent-167, score-0.178]

78 3 A Generalized Natural Actor-Critic Algorithm with LSLSD We can straightforwardly derive a generalized Natural Actor-Critic algorithm, gNAC(ι), by solving the estimating function gι,θ(s, a; ) in (20), using the LSD estimate θ ln dθ(s) Ω φ(s). [sent-171, score-0.436]

79 In addition, note that gNAC(ι = 0) is equivalent to a non-episodic NAC algorithm modiﬁed to optimize the average reward, instead of the discounted cumulative reward [1]. [sent-174, score-0.131]

80 This algorithm uses the baseline function in which the state value estimates are estimated by LSTD(0) [25], while the original version did not use any baseline function. [sent-193, score-0.071]

81 The state transition probability function was set by using the Dirichlet distribution Dir(α ∈ R2 ) and the uniform distribution U(a; b) generating an integer from 1 to a other than b: we ﬁrst initialized it such that p(s |s, a) := 0, ∀ (s , s, a) and then with q(s, a) ∼ Dir(α=[. [sent-196, score-0.055]

82 The the reward function R(s, a, s+1 ) was set temporarily with the Gaussian distribution N(µ = 0, σ 2 = 1), normalized so that maxθ η(θ) = 1 and minθ η(θ) = −1; R(s, a, s+1 ) := 2(R(s, a, s+1 ) − minθ η(θ))/(maxθ η(θ) − minθ η(θ)) − 1. [sent-199, score-0.099]

83 The policy is represented by the sigmoidal function: π(l|s; θ) = 1/(1 + exp(−θ φ(s))). [sent-200, score-0.269]

84 Each ith element of the initial policy parameter θ0 ∈ R|S| and the feature vector of the jth state, φ(sj ) ∈ R|S| , were drawn from N(0, 1) and N(δij , 0. [sent-201, score-0.284]

85 Figure 1 (A) shows the angles between the true gNG(ι) and the gNG(ι) estimates with and without the auxiliary function ρ(s, s+1 , ν) at α := 0 (ﬁxed policy), β := 1, λ := 0. [sent-203, score-0.121]

86 The estimation without the auxiliary function was implemented by solving the estimating function of (9). [sent-204, score-0.207]

87 We can conﬁrm that the estimate using gι,θ(s, a; ) in (20) that implements the near-optimal estimating function became a much more efﬁcient estimator than without the auxiliary function. [sent-205, score-0.237]

88 We thus conﬁrmed that our gNAC(ι > 0) algorithm outperformed the current state-of-the-art NAC algorithm (gNAC(ι = 0)). [sent-208, score-0.054]

89 6 Summary In this paper, we proposed a generalized NG (gNG) learning algorithm that combines two Fisher information matrices for RL. [sent-209, score-0.051]

90 The theory of the estimating function provided insight to prove some important theoretical results from which our proposed gNAC algorithm was derived. [sent-210, score-0.114]

91 Numerical experiments showed that the gNAC algorithm can estimate gNGs efﬁciently and that it can outperform a current state-of-the-art NAC algorithm. [sent-211, score-0.048]

92 However, it may be possible to use different criterion, such as the optimality on the Fisher information matrix metric instead of the Euclidean metric. [sent-213, score-0.065]

93 A stochastic reinforcement learning algorithm for learning real-valued functions. [sent-223, score-0.086]

94 Stochastic policy gradient reinforcement learning on a simple 3D biped. [sent-243, score-0.392]

95 Utilizing natural gradient in temporal difference reinforcement learning with eligibility traces. [sent-270, score-0.164]

96 A new natural policy gradient by stationary distribution metric. [sent-285, score-0.431]

97 Derivatives of logarithmic stationary distributions for policy gradient reinforcement learning. [sent-293, score-0.44]

98 Information geometry of estimating functions in semi-parametric statistical models. [sent-331, score-0.098]

99 Optimal instrumental variable estimation for linear models with stochastic regressors using estimating functions. [sent-338, score-0.317]

100 Unbiased statistical estimating functions in presence of nuisance parameters. [sent-344, score-0.098]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gng', 0.63), ('gnac', 0.328), ('policy', 0.269), ('ln', 0.257), ('nsg', 0.239), ('fim', 0.214), ('npg', 0.176), ('instrumental', 0.151), ('nac', 0.151), ('pgrl', 0.151), ('st', 0.106), ('reward', 0.099), ('estimating', 0.098), ('auxiliary', 0.092), ('rl', 0.088), ('lsd', 0.088), ('riemannian', 0.079), ('morimura', 0.076), ('gradient', 0.073), ('lslsd', 0.063), ('reinforcement', 0.05), ('amari', 0.05), ('fims', 0.05), ('konda', 0.05), ('uchibe', 0.05), ('det', 0.049), ('stationary', 0.048), ('ng', 0.042), ('metric', 0.042), ('near', 0.041), ('natural', 0.041), ('mdp', 0.041), ('state', 0.04), ('plateau', 0.038), ('regressand', 0.038), ('yoshimoto', 0.038), ('generalized', 0.035), ('pr', 0.035), ('kakade', 0.035), ('accordingly', 0.034), ('interpolation', 0.034), ('estimate', 0.032), ('steepest', 0.028), ('peters', 0.027), ('fisher', 0.026), ('fs', 0.026), ('okinawa', 0.025), ('stagnant', 0.025), ('tetsuro', 0.025), ('policies', 0.025), ('regressor', 0.025), ('optimality', 0.023), ('actor', 0.022), ('outperformed', 0.022), ('action', 0.022), ('rd', 0.021), ('regarded', 0.021), ('numerical', 0.02), ('robots', 0.02), ('ascent', 0.02), ('stochastic', 0.02), ('mixing', 0.019), ('agent', 0.019), ('rewards', 0.019), ('approximator', 0.019), ('manifold', 0.018), ('regressors', 0.018), ('called', 0.018), ('lemma', 0.017), ('degrees', 0.017), ('estimation', 0.017), ('criterion', 0.017), ('satis', 0.017), ('joint', 0.017), ('pg', 0.016), ('unbiased', 0.016), ('algorithm', 0.016), ('average', 0.016), ('ai', 0.016), ('accounts', 0.015), ('states', 0.015), ('transition', 0.015), ('residuals', 0.015), ('dir', 0.015), ('interpolating', 0.015), ('estimates', 0.015), ('immediate', 0.015), ('performances', 0.015), ('actions', 0.015), ('estimator', 0.015), ('parameter', 0.015), ('japan', 0.015), ('fa', 0.015), ('equal', 0.014), ('angles', 0.014), ('straightforwardly', 0.014), ('ordinary', 0.014), ('variable', 0.013), ('euclidean', 0.013), ('markov', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

Author: Tetsuro Morimura, Eiji Uchibe, Junichiro Yoshimoto, Kenji Doya

2 0.14329486 113 nips-2009-Improving Existing Fault Recovery Policies

Author: Guy Shani, Christopher Meek

Abstract: An automated recovery system is a key component in a large data center. Such a system typically employs a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we describe a passive policy learning approach for improving existing recovery policies without exploration. We explain how to use data gathered from the interactions of the hand-made controller with the system, to create an improved controller. We suggest learning an indeﬁnite horizon Partially Observable Markov Decision Process, a model for decision making under uncertainty, and solve it using a point-based algorithm. We describe the complete process, starting with data gathering, model learning, model checking procedures, and computing a policy. 1

3 0.13673498 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

Author: Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael J. Black

Abstract: Large, relational factor graphs with structure deﬁned by ﬁrst-order logic or other languages give rise to notoriously difﬁcult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring conﬁgurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efﬁcient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. 1

4 0.12175519 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

Author: Keith Bush, Joelle Pineau

Abstract: Interesting real-world datasets often exhibit nonlinear, noisy, continuous-valued states that are unexplorable, are poorly described by ﬁrst principles, and are only partially observable. If partial observability can be overcome, these constraints suggest the use of model-based reinforcement learning. We experiment with manifold embeddings to reconstruct the observable state-space in the context of offline, model-based reinforcement learning. We demonstrate that the embedding of a system can change as a result of learning, and we argue that the best performing embeddings well-represent the dynamics of both the uncontrolled and adaptively controlled system. We apply this approach to learn a neurostimulation policy that suppresses epileptic seizures on animal brain slices. 1

5 0.095354602 209 nips-2009-Robust Value Function Approximation Using Bilinear Programming

Author: Marek Petrik, Shlomo Zilberstein

Abstract: Existing value function approximation methods have been successfully used in many applications, but they often lack useful a priori error bounds. We propose approximate bilinear programming, a new formulation of value function approximation that provides strong a priori guarantees. In particular, this approach provably ﬁnds an approximate value function that minimizes the Bellman residual. Solving a bilinear program optimally is NP-hard, but this is unavoidable because the Bellman-residual minimization itself is NP-hard. We therefore employ and analyze a common approximate algorithm for bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on a simple benchmark problem. 1 Motivation Solving large Markov Decision Problems (MDPs) is a very useful, but computationally challenging problem addressed widely in the AI literature, particularly in the area of reinforcement learning. It is widely accepted that large MDPs can only be solved approximately. The commonly used approximation methods can be divided into three broad categories: 1) policy search, which explores a restricted space of all policies, 2) approximate dynamic programming, which searches a restricted space of value functions, and 3) approximate linear programming, which approximates the solution using a linear program. While all of these methods have achieved impressive results in many domains, they have signiﬁcant limitations. Policy search methods rely on local search in a restricted policy space. The policy may be represented, for example, as a ﬁnite-state controller [22] or as a greedy policy with respect to an approximate value function [24]. Policy search methods have achieved impressive results in such domains as Tetris [24] and helicopter control [1]. However, they are notoriously hard to analyze. We are not aware of any theoretical guarantees regarding the quality of the solution. Approximate dynamic programming (ADP) methods iteratively approximate the value function [4, 20, 23]. They have been extensively analyzed and are the most commonly used methods. However, ADP methods typically do not converge and they only provide weak guarantees of approximation quality. The approximation error bounds are usually expressed in terms of the worst-case approximation of the value function over all policies [4]. In addition, most available bounds are with respect to the L∞ norm, while the algorithms often minimize the L2 norm. While there exist some L2 -based bounds [14], they require values that are difﬁcult to obtain. Approximate linear programming (ALP) uses a linear program to compute the approximate value function in a particular vector space [7]. ALP has been previously used in a wide variety of settings [2, 9, 10]. Although ALP often does not perform as well as ADP, there have been some recent 1 efforts to close the gap [18]. ALP has better theoretical properties than ADP and policy search. It is guaranteed to converge and return the closest L1 -norm approximation v of the optimal value func˜ tion v ∗ up to a multiplicative factor. However, the L1 norm must be properly weighted to guarantee a small policy loss, and there is no reliable method for selecting appropriate weights [7]. To summarize, the existing reinforcement learning techniques often provide good solutions, but typically require signiﬁcant domain knowledge [20]. The domain knowledge is needed partly because useful a priori error bounds are not available, as mentioned above. Our goal is to develop a more robust method that is guaranteed to minimize an actual bound on the policy loss. We present a new formulation of value function approximation that provably minimizes a bound on the policy loss. Unlike in some other algorithms, the bound in this case does not rely on values that are hard to obtain. The new method uniﬁes policy search and value-function search methods to minimize the L∞ norm of the Bellman residual, which bounds the policy loss. We start with a description of the framework and notation in Section 2. Then, in Section 3, we describe the proposed Approximate Bilinear Programming (ABP) formulation. A drawback of this formulation is its computational complexity, which may be exponential. We show in Section 4 that this is unavoidable, because minimizing the approximation error bound is in fact NP-hard. Although our focus is on the formulation and its properties, we also discuss some simple algorithms for solving bilinear programs. Section 5 shows that ABP can be seen as an improvement of ALP and Approximate Policy Iteration (API). Section 6 demonstrates the applicability of ABP using a common reinforcement learning benchmark problem. A complete discussion of sampling strategies–an essential component for achieving robustness–is beyond the scope of this paper, but the issue is brieﬂy discussed in Section 6. Complete proofs of the theorems can be found in [19]. 2 Solving MDPs using ALP In this section, we formally deﬁne MDPs, their ALP formulation, and the approximation errors involved. These notions serve as a basis for developing the ABP formulation. A Markov Decision Process is a tuple (S, A, P, r, α), where S is the ﬁnite set of states, A is the ﬁnite set of actions. P : S × S × A → [0, 1] is the transition function, where P (s , s, a) represents the probability of transiting to state s from state s, given action a. The function r : S × A → R is the reward function, and α : S → [0, 1] is the initial state distribution. The objective is to maximize the inﬁnite-horizon discounted cumulative reward. To shorten the notation, we assume an arbitrary ordering of the states: s1 , s2 , . . . , sn . Then, Pa and ra are used to denote the probabilistic transition matrix and reward for action a. The solution of an MDP is a policy π : S × A → [0, 1] from a set of possible policies Π, such that for all s ∈ S, a∈A π(s, a) = 1. We assume that the policies may be stochastic, but stationary [21]. A policy is deterministic when π(s, a) ∈ {0, 1} for all s ∈ S and a ∈ A. The transition and reward functions for a given policy are denoted by Pπ and rπ . The value function update for a policy π is denoted by Lπ , and the Bellman operator is denoted by L. That is: Lπ v = Pπ v + rπ Lv = max Lπ v. π∈Π The optimal value function, denoted v ∗ , satisﬁes v ∗ = Lv ∗ . We focus on linear value function approximation for discounted inﬁnite-horizon problems. In linear value function approximation, the value function is represented as a linear combination of nonlinear basis functions (vectors). For each state s, we deﬁne a row-vector φ(s) of features. The rows of the basis matrix M correspond to φ(s), and the approximation space is generated by the columns of the matrix. That is, the basis matrix M , and the value function v are represented as:   − φ(s1 ) −   M = − φ(s2 ) − v = M x. . . . Deﬁnition 1. A value function, v, is representable if v ∈ M ⊆ R|S| , where M = colspan (M ), and is transitive-feasible when v ≥ Lv. We denote the set of transitive-feasible value functions as: K = {v ∈ R|S| v ≥ Lv}. 2 Notice that the optimal value function v ∗ is transitive-feasible, and M is a linear space. Also, all the inequalities are element-wise. Because the new formulation is related to ALP, we introduce it ﬁrst. It is well known that an inﬁnite horizon discounted MDP problem may be formulated in terms of solving the following linear program: minimize v c(s)v(s) s∈S v(s) − γ s.t. P (s , s, a)v(s ) ≥ r(s, a) ∀(s, a) ∈ (S, A) (1) s ∈S We use A as a shorthand notation for the constraint matrix and b for the right-hand side. The value c represents a distribution over the states, usually a uniform one. That is, s∈S c(s) = 1. The linear program in Eq. (1) is often too large to be solved precisely, so it is approximated to get an approximate linear program by assuming that v ∈ M [8], as follows: minimize cT v x Av ≥ b s.t. (2) v∈M The constraint v ∈ M denotes the approximation. To actually solve this linear program, the value function is represented as v = M x. In the remainder of the paper, we assume that 1 ∈ M to guarantee the feasibility of the ALP, where 1 is a vector of all ones. The optimal solution of the ALP, v , satisﬁes that v ≥ v ∗ . Then, the objective of Eq. (2) represents the minimization of v − v ∗ 1,c , ˜ ˜ ˜ where · 1,c is a c-weighted L1 norm [7]. The ultimate goal of the optimization is not to obtain a good value function v , but a good policy. ˜ The quality of the policy, typically chosen to be greedy with respect to v , depends non-trivially on ˜ the approximate value function. The ABP formulation will minimize policy loss by minimizing L˜ − v ∞ , which bounds the policy loss as follows. v ˜ Theorem 2 (e.g. [25]). Let v be an arbitrary value function, and let v be the value of the greedy ˜ ˆ policy with respect to v . Then: ˜ 2 v∗ − v ∞ ≤ ˆ L˜ − v ∞ , v ˜ 1−γ In addition, if v ≥ L˜, the policy loss is smallest for the greedy policy. ˜ v Policies, like value functions, can be represented as vectors. Assume an arbitrary ordering of the state-action pairs, such that o(s, a) → N maps a state and an action to its position. The policies are represented as θ ∈ R|S|×|A| , and we use the shorthand notation θ(s, a) = θ(o(s, a)). Remark 3. The corresponding π and θ are denoted as π θ and θπ and satisfy: π θ (s, a) = θπ (s, a). We will also consider approximations of the policies in the policy-space, generated by columns of a matrix N . A policy is representable when π ∈ N , where N = colspan (N ). 3 Approximate Bilinear Programs This section shows how to formulate minv∈M Lv − v ∞ as a separable bilinear program. Bilinear programs are a generalization of linear programs with an additional bilinear term in the objective function. A separable bilinear program consists of two linear programs with independent constraints and are fairly easy to solve and analyze. Deﬁnition 4 (Separable Bilinear Program). A separable bilinear program in the normal form is deﬁned as follows: T T minimize f (w, x, y, z) = sT w + r1 x + xT Cy + r2 y + sT z 1 2 w,x y,z s.t. A1 x + B1 w = b1 A2 y + B2 z = b2 w, x ≥ 0 y, z ≥ 0 3 (3) We separate the variables using a vertical line and the constraints using different columns to emphasize the separable nature of the bilinear program. In this paper, we only use separable bilinear programs and refer to them simply as bilinear programs. An approximate bilinear program can now be formulated as follows. minimize θT λ + λ θ λ,λ ,v Bθ = 1 z = Av − b s.t. θ≥0 z≥0 (4) λ+λ1≥z λ≥0 θ∈N v∈M All variables are vectors except λ , which is a scalar. The symbol z is only used to simplify the notation and does not need to represent an optimization variable. The variable v is deﬁned for each state and represents the value function. Matrix A represents constraints that are identical to the constraints in Eq. (2). The variables λ correspond to all state-action pairs. These variables represent the Bellman residuals that are being minimized. The variables θ are deﬁned for all state-action pairs and represent policies in Remark 3. The matrix B represents the following constraints: θ(s, a) = 1 ∀s ∈ S. a∈A As with approximate linear programs, we initially assume that all the constraints on z are used. In realistic settings, however, the constraints would be sampled or somehow reduced. We defer the discussion of this issue until Section 6. Note that the constraints in our formulation correspond to elements of z and θ. Thus when constraints are omitted, also the corresponding elements of z and θ are omitted. To simplify the notation, the value function approximation in this problem is denoted only implicitly by v ∈ M, and the policy approximation is denoted by θ ∈ N . In an actual implementation, the optimization variables would be x, y using the relationships v = M x and θ = N y. We do not assume any approximation of the policy space, unless mentioned otherwise. We also use v or θ to refer to partial solutions of Eq. (4) with the other variables chosen appropriately to achieve feasibility. The ABP formulation is closely related to approximate linear programs, and we discuss the connection in Section 5. We ﬁrst analyze the properties of the optimal solutions of the bilinear program and then show and discuss the solution methods in Section 4. The following theorem states the main property of the bilinear formulation. ˜˜ ˜ ˜ Theorem 5. b Let (θ, v , λ, λ ) be an optimal solution of Eq. (4) and assume that 1 ∈ M. Then: ˜ ˜ ˜ θT λ + λ = L˜ − v v ˜ ∞ ≤ min v∈K∩M Lv − v ∞ ≤ 2 min Lv − v v∈M ∞ ≤ 2(1 + γ) min v − v ∗ v∈M ∞. ˜ In addition, π θ minimizes the Bellman residual with regard to v , and its value function v satisﬁes: ˜ ˆ 2 min Lv − v ∞ . v − v∗ ∞ ≤ ˆ 1 − γ v∈M The proof of the theorem can be found in [19]. It is important to note that, as Theorem 5 states, the ABP approach is equivalent to a minimization over all representable value functions, not only the transitive-feasible ones. Notice also the missing coefﬁcient 2 (2 instead of 4) in the last equation of Theorem 5. This follows by subtracting a constant vector 1 from v to balance the lower bounds ˜ on the Bellman residual error with the upper ones. This modiﬁed approximate value function will have 1/2 of the original Bellman residual but an identical greedy policy. Finally, note that whenever v ∗ ∈ M, both ABP and ALP will return the optimal value function. The ABP solution minimizes the L∞ norm of the Bellman residual due to: 1) the correspondence between θ and the policies, and 2) the dual representation with respect to variables λ and λ . The theorem then follows using techniques similar to those used for approximate linear programs [7]. 4 Algorithm 1: Iterative algorithm for solving Eq. (3) (x0 , w0 ) ← random ; (y0 , z0 ) ← arg miny,z f (w0 , x0 , y, z) ; i←1; while yi−1 = yi or xi−1 = xi do (yi , zi ) ← arg min{y,z A2 y+B2 z=b2 y,z≥0} f (wi−1 , xi−1 , y, z) ; (xi , wi ) ← arg min{x,w A1 x+B1 w=b1 x,w≥0} f (w, x, yi , zi ) ; i←i+1 return f (wi , xi , yi , zi ) 4 Solving Bilinear Programs In this section we describe simple methods for solving ABPs. We ﬁrst describe optimal methods, which have exponential complexity, and then discuss some approximation strategies. Solving a bilinear program is an NP-complete problem [3]. The membership in NP follows from the ﬁnite number of basic feasible solutions of the individual linear programs, each of which can be checked in polynomial time. The NP-hardness is shown by a reduction from the SAT problem [3]. The NP-completeness of ABP compares unfavorably with the polynomial complexity of ALP. However, most other ADP algorithms are not guaranteed to converge to a solution in ﬁnite time. The following theorem shows that the computational complexity of the ABP formulation is asymptotically the same as the complexity of the problem it solves. Theorem 6. b Determining minv∈K∩M Lv − v ∞ < is NP-complete for the full constraint representation, 0 < γ < 1, and a given > 0. In addition, the problem remains NP-complete when 1 ∈ M, and therefore minv∈M Lv − v ∞ < is also NP-complete. As the theorem states, the value function approximation does not become computationally simpler even when 1 ∈ M – a universal assumption in the paper. Notice that ALP can determine whether minv∈K∩M Lv − v ∞ = 0 in polynomial time. The proof of Theorem 6 is based on a reduction from SAT and can be found in [19]. The policy in the reduction determines the true literal in each clause, and the approximate value function corresponds to the truth value of the literals. The approximation basis forces literals that share the same variable to have consistent values. Bilinear programs are non-convex and are typically solved using global optimization techniques. The common solution methods are based on concave cuts [11] or branch-and-bound [6]. In ABP settings with a small number of features, the successive approximation algorithm [17] may be applied efﬁciently. We are, however, not aware of commercial solvers available for solving bilinear programs. Bilinear programs can be formulated as concave quadratic minimization problems [11], or mixed integer linear programs [11, 16], for which there are numerous commercial solvers available. Because we are interested in solving very large bilinear programs, we describe simple approximate algorithms next. Optimal scalable methods are beyond the scope of this paper. The most common approximate method for solving bilinear programs is shown in Algorithm 1. It is designed for the general formulation shown in Eq. (3), where f (w, x, y, z) represents the objective function. The minimizations in the algorithm are linear programs which can be easily solved. Interestingly, as we will show in Section 5, Algorithm 1 applied to ABP generalizes a version of API. While Algorithm 1 is not guaranteed to ﬁnd an optimal solution, its empirical performance is often remarkably good [13]. Its basic properties are summarized by the following proposition. Proposition 7 (e.g. [3]). Algorithm 1 is guaranteed to converge, assuming that the linear program solutions are in a vertex of the optimality simplex. In addition, the global optimum is a ﬁxed point of the algorithm, and the objective value monotonically improves during execution. 5 The proof is based on the ﬁnite count of the basic feasible solutions of the individual linear programs. Because the objective function does not increase in any iteration, the algorithm will eventually converge. In the context of MDPs, Algorithm 1 can be further reﬁned. For example, the constraint v ∈ M in Eq. (4) serves mostly to simplify the bilinear program and a value function that violates it may still be acceptable. The following proposition motivates the construction of a new value function from two transitive-feasible value functions. Proposition 8. Let v1 and v2 be feasible value functions in Eq. (4). Then the value function ˜ ˜ v (s) = min{˜1 (s), v2 (s)} is also feasible in Eq. (4). Therefore v ≥ v ∗ and v ∗ − v ∞ ≤ ˜ v ˜ ˜ ˜ min { v ∗ − v1 ∞ , v ∗ − v2 ∞ }. ˜ ˜ The proof of the proposition is based on Jensen’s inequality and can be found in [19]. Proposition 8 can be used to extend Algorithm 1 when solving ABPs. One option is to take the state-wise minimum of values from multiple random executions of Algorithm 1, which preserves the transitive feasibility of the value function. However, the increasing number of value functions used to obtain v also increases the potential sampling error. ˜ 5 Relationship to ALP and API In this section, we describe the important connections between ABP and the two closely related ADP methods: ALP, and API with L∞ minimization. Both of these methods are commonly used, for example to solve factored MDPs [10]. Our analysis sheds light on some of their observed properties and leads to a new convergent form of API. ABP addresses some important issues with ALP: 1) ALP provides value function bounds with respect to L1 norm, which does not guarantee small policy loss, 2) ALP’s solution quality depends signiﬁcantly on the heuristically-chosen objective function c in Eq. (2) [7], and 3) incomplete constraint samples in ALP easily lead to unbounded linear programs. The drawback of using ABP, however, is the higher computational complexity. Both the ﬁrst and the second issues in ALP can be addressed by choosing the right objective function [7]. Because this objective function depends on the optimal ALP solution, it cannot be practically computed. Instead, various heuristics are usually used. The heuristic objective functions may lead to signiﬁcant improvements in speciﬁc domains, but they do not provide any guarantees. ABP, on the other hand, has no such parameters that require adjustments. The third issue arises when the constraints of an ALP need to be sampled in some large domains. The ALP may become unbounded with incomplete samples because its objective value is deﬁned using the L1 norm on the states, and the constraints are deﬁned using the L∞ norm of the Bellman residual. In ABP, the Bellman residual is used in both the constraints and objective function. The objective function of ABP is then bounded below by 0 for an arbitrarily small number of samples. ABP can also improve on API with L∞ minimization (L∞ -API for short), which is a leading method for solving factored MDPs [10]. Minimizing the L∞ approximation error is theoretically preferable, since it is compatible with the existing bounds on policy loss [10]. In contrast, few practical bounds exist for API with the L2 norm minimization [14], such as LSPI [12]. L∞ -API is shown in Algorithm 2, where f (π) is calculated using the following program: minimize φ φ,v s.t. (I − γPπ )v + 1φ ≥ rπ −(I − γPπ )v + 1φ ≥ −rπ (5) v∈M Here I denotes the identity matrix. We are not aware of a convergence or a divergence proof of L∞ -API, and this analysis is beyond the scope of this paper. 6 Algorithm 2: Approximate policy iteration, where f (π) denotes a custom value function approximation for the policy π. π0 , k ← rand, 1 ; while πk = πk−1 do vk ← f (πk−1 ) ; ˜ πk (s) ← arg maxa∈A r(s, a) + γ s ∈S P (s , s, a)˜k (s) ∀s ∈ S ; v k ←k+1 We propose Optimistic Approximate Policy Iteration (OAPI), a modiﬁcation of API. OAPI is shown in Algorithm 2, where f (π) is calculated using the following program: minimize φ φ,v s.t. Av ≥ b (≡ (I − γPπ )v ≥ rπ ∀π ∈ Π) −(I − γPπ )v + 1φ ≥ −rπ (6) v∈M In fact, OAPI corresponds to Algorithm 1 applied to ABP because Eq. (6) corresponds to Eq. (4) with ﬁxed θ. Then, using Proposition 7, we get the following corollary. Corollary 9. Optimistic approximate policy iteration converges in ﬁnite time. In addition, the Bellman residual of the generated value functions monotonically decreases. OAPI differs from L∞ -API in two ways: 1) OAPI constrains the Bellman residuals by 0 from below and by φ from above, and then it minimizes φ. L∞ -API constrains the Bellman residuals by φ from both above and below. 2) OAPI, like API, uses only the current policy for the upper bound on the Bellman residual, but uses all the policies for the lower bound on the Bellman residual. L∞ -API cannot return an approximate value function that has a lower Bellman residual than ABP, given the optimality of ABP described in Theorem 5. However, even OAPI, an approximate ABP algorithm, performs comparably to L∞ -API, as the following theorem states. Theorem 10. b Assume that L∞ -API converges to a policy π and a value function v that both φ satisfy: φ = v − Lπ v ∞ = v − Lv ∞ . Then v = v + 1−γ 1 is feasible in Eq. (4), and it is a ﬁxed ˜ point of OAPI. In addition, the greedy policies with respect to v and v are identical. ˜ The proof is based on two facts. First, v is feasible with respect to the constraints in Eq. (4). The ˜ Bellman residual changes for all the policies identically, since a constant vector is added. Second, because Lπ is greedy with respect to v , we have that v ≥ Lπ v ≥ L˜. The value function v is ˜ ˜ ˜ v ˜ therefore transitive-feasible. The full proof can be found in [19]. To summarize, OAPI guarantees convergence, while matching the performance of L∞ -API. The convergence of OAPI is achieved because given a non-negative Bellman residual, the greedy policy also minimizes the Bellman residual. Because OAPI ensures that the Bellman residual is always non-negative, it can progressively reduce it. In comparison, the greedy policy in L∞ -API does not minimize the Bellman residual, and therefore L∞ -API does not always reduce it. Theorem 10 also explains why API provides better solutions than ALP, as observed in [10]. From the discussion above, ALP can be seen as an L1 -norm approximation of a single iteration of OAPI. L∞ -API, on the other hand, performs many such ALP-like iterations. 6 Empirical Evaluation As we showed in Theorem 10, even OAPI, the very simple approximate algorithm for ABP, can perform as well as existing state-of-the art methods on factored MDPs. However, a deeper understanding of the formulation and potential solution methods will be necessary in order to determine the full practical impact of the proposed methods. In this section, we validate the approach by applying it to the mountain car problem, a simple reinforcement learning benchmark problem. We have so far considered that all the constraints involving z are present in the ABP in Eq. (4). Because the constraints correspond to all state-action pairs, it is often impractical to even enumerate 7 (a) L∞ error of the Bellman residual Features 100 144 OAPI 0.21 (0.23) 0.13 (0.1) ALP 13. (13.) 3.6 (4.3) LSPI 9. (14.) 3.9 (7.7) API 0.46 (0.08) 0.86 (1.18) (b) L2 error of the Bellman residual Features 100 144 OAPI 0.2 (0.3) 0.1 (1.9) ALP 9.5 (18.) 0.3 (0.4) LSPI 1.2 (1.5) 0.9 (0.1) API 0.04 (0.01) 0.08 (0.08) Table 1: Bellman residual of the ﬁnal value function. The values are averages over 5 executions, with the standard deviations shown in parentheses. them. This issue can be addressed in at least two ways. First, a small randomly-selected subset of the constraints can be used in the ABP, a common approach in ALP [9, 5]. The ALP sampling bounds can be easily extended to ABP. Second, the structure of the MDP can be used to reduce the number of constraints. Such a reduction is possible, for example, in factored MDPs with L∞ -API and ALP [10], and can be easily extended to OAPI and ABP. In the mountain-car benchmark, an underpowered car needs to climb a hill [23]. To do so, it ﬁrst needs to back up to an opposite hill to gain sufﬁcient momentum. The car receives a reward of 1 when it climbs the hill. In the experiments we used a discount factor γ = 0.99. The experiments are designed to determine whether OAPI reliably minimizes the Bellman residual in comparison with API and ALP. We use a uniformly-spaced linear spline to approximate the value function. The constraints were based on 200 uniformly sampled states with all 3 actions per state. We evaluated the methods with the number of the approximation features 100 and 144, which corresponds to the number of linear segments. The results of ABP (in particular OAPI), ALP, API with L2 minimization, and LSPI are depicted in Table 1. The results are shown for both L∞ norm and uniformly-weighted L2 norm. The runtimes of all these methods are comparable, with ALP being the fastest. Since API (LSPI) is not guaranteed to converge, we ran it for at most 20 iterations, which was an upper bound on the number of iterations of OAPI. The results demonstrate that ABP minimizes the L∞ Bellman residual much more consistently than the other methods. Note, however, that all the considered algorithms would perform signiﬁcantly better given a ﬁner approximation. 7 Conclusion and Future Work We proposed and analyzed approximate bilinear programming, a new value-function approximation method, which provably minimizes the L∞ Bellman residual. ABP returns the optimal approximate value function with respect to the Bellman residual bounds, despite the formulation with regard to transitive-feasible value functions. We also showed that there is no asymptotically simpler formulation, since ﬁnding the closest value function and solving a bilinear program are both NP-complete problems. Finally, the formulation leads to the development of OAPI, a new convergent form of API which monotonically improves the objective value function. While we only discussed approximate solutions of the ABP, a deeper study of bilinear solvers may render optimal solution methods feasible. ABPs have a small number of essential variables (that determine the value function) and a large number of constraints, which can be leveraged by the solvers [15]. The L∞ error bound provides good theoretical guarantees, but it may be too conservative in practice. A similar formulation based on L2 norm minimization may be more practical. We believe that the proposed formulation will help to deepen the understanding of value function approximation and the characteristics of existing solution methods, and potentially lead to the development of more robust and widely-applicable reinforcement learning algorithms. Acknowledgements This work was supported by the Air Force Ofﬁce of Scientiﬁc Research under Grant No. FA955008-1-0171. We also thank the anonymous reviewers for their useful comments. 8 References [1] Pieter Abbeel, Varun Ganapathi, and Andrew Y. Ng. Learning vehicular dynamics, with application to modeling helicopters. In Advances in Neural Information Processing Systems, pages 1–8, 2006. [2] Daniel Adelman. A price-directed approach to stochastic inventory/routing. Operations Research, 52:499–514, 2004. [3] Kristin P. Bennett and O. L. Mangasarian. Bilinear separation of two sets in n-space. Technical report, Computer Science Department, University of Wisconsin, 1992. [4] Dimitri P. Bertsekas and Sergey Ioffe. Temporal differences-based policy iteration and applications in neuro-dynamic programming. Technical Report LIDS-P-2349, LIDS, 1997. [5] Guiuseppe Calaﬁore and M.C. Campi. Uncertain convex programs: Randomized solutions and conﬁdence levels. Mathematical Programming, Series A, 102:25–46, 2005. [6] Alberto Carpara and Michele Monaci. Bidimensional packing by bilinear programming. Mathematical Programming Series A, 118:75–108, 2009. [7] Daniela P. de Farias. The Linear Programming Approach to Approximate Dynamic Programming: Theory and Application. PhD thesis, Stanford University, 2002. [8] Daniela P. de Farias and Ben Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51:850–856, 2003. [9] Daniela Pucci de Farias and Benjamin Van Roy. On constraint sampling in the linear programming approach to approximate dynamic programming. Mathematics of Operations Research, 29(3):462–478, 2004. [10] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efﬁcient solution algorithms for factored MDPs. Journal of Artiﬁcial Intelligence Research, 19:399–468, 2003. [11] Reiner Horst and Hoang Tuy. Global optimization: Deterministic approaches. Springer, 1996. [12] Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003. [13] O. L. Mangasarian. The linear complementarity problem as a separable bilinear program. Journal of Global Optimization, 12:1–7, 1995. [14] Remi Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning, pages 560–567, 2003. [15] Marek Petrik and Shlomo Zilberstein. Anytime coordination using separable bilinear programs. In Conference on Artiﬁcial Intelligence, pages 750–755, 2007. [16] Marek Petrik and Shlomo Zilberstein. Average reward decentralized Markov decision processes. In International Joint Conference on Artiﬁcial Intelligence, pages 1997–2002, 2007. [17] Marek Petrik and Shlomo Zilberstein. A bilinear programming approach for multiagent planning. Journal of Artiﬁcial Intelligence Research, 35:235–274, 2009. [18] Marek Petrik and Shlomo Zilberstein. Constraint relaxation in approximate linear programs. In International Conference on Machine Learning, pages 809–816, 2009. [19] Marek Petrik and Shlomo Zilberstein. Robust value function approximation using bilinear programming. Technical Report UM-CS-2009-052, Department of Computer Science, University of Massachusetts Amherst, 2009. [20] Warren B. Powell. Approximate Dynamic Programming. Wiley-Interscience, 2007. [21] Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc., 2005. [22] Kenneth O. Stanley and Risto Miikkulainen. Competitive coevolution through evolutionary complexiﬁcation. Journal of Artiﬁcial Intelligence Research, 21:63–100, 2004. [23] Richard S. Sutton and Andrew Barto. Reinforcement learning. MIT Press, 1998. [24] Istvan Szita and Andras Lorincz. Learning Tetris using the noisy cross-entropy method. Neural Computation, 18(12):2936–2941, 2006. [25] Ronald J. Williams and Leemon C. Baird. Tight performance bounds on greedy policies based on imperfect value functions. In Yale Workshop on Adaptive and Learning Systems, 1994. 9

6 0.079565763 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

7 0.078234687 242 nips-2009-The Infinite Partially Observable Markov Decision Process

8 0.074935295 134 nips-2009-Learning to Explore and Exploit in POMDPs

9 0.069734327 52 nips-2009-Code-specific policy gradient rules for spiking neurons

10 0.066418931 14 nips-2009-A Parameter-free Hedging Algorithm

11 0.064563312 53 nips-2009-Complexity of Decentralized Control: Special Cases

12 0.060164444 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control

13 0.059638586 45 nips-2009-Beyond Convexity: Online Submodular Minimization

14 0.05829775 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

15 0.055849712 221 nips-2009-Solving Stochastic Games

16 0.053877749 141 nips-2009-Local Rules for Global MAP: When Do They Work ?

17 0.053430449 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

18 0.049219377 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

19 0.045623712 94 nips-2009-Fast Learning from Non-i.i.d. Observations

20 0.045285095 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.111), (1, 0.075), (2, 0.144), (3, -0.115), (4, -0.155), (5, 0.02), (6, 0.007), (7, -0.033), (8, 0.007), (9, 0.026), (10, 0.084), (11, 0.085), (12, 0.035), (13, 0.03), (14, -0.001), (15, 0.001), (16, -0.018), (17, -0.026), (18, 0.094), (19, 0.014), (20, -0.031), (21, -0.038), (22, 0.098), (23, -0.047), (24, -0.142), (25, 0.001), (26, -0.095), (27, 0.06), (28, 0.015), (29, 0.02), (30, 0.034), (31, 0.025), (32, 0.021), (33, 0.001), (34, -0.143), (35, -0.014), (36, 0.022), (37, 0.038), (38, -0.069), (39, 0.025), (40, 0.051), (41, 0.03), (42, -0.02), (43, 0.016), (44, -0.05), (45, 0.031), (46, -0.092), (47, -0.066), (48, 0.038), (49, 0.066)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93818218 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

Author: Tetsuro Morimura, Eiji Uchibe, Junichiro Yoshimoto, Kenji Doya

2 0.75718558 113 nips-2009-Improving Existing Fault Recovery Policies

Author: Guy Shani, Christopher Meek

3 0.70634836 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

Author: Keith Bush, Joelle Pineau

4 0.70630723 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control

Author: Hengshuai Yao, Shalabh Bhatnagar, Dongcui Diao, Richard S. Sutton, Csaba Szepesvári

Abstract: In this paper we introduce a multi-step linear Dyna-style planning algorithm. The key element of the multi-step linear Dyna is a multi-step linear model that enables multi-step projection of a sampled feature and multi-step planning based on the simulated multi-step transition experience. We propose two multi-step linear models. The ﬁrst iterates the one-step linear model, but is generally computationally complex. The second interpolates between the one-step model and the inﬁnite-step model (which turns out to be the LSTD solution), and can be learned efﬁciently online. Policy evaluation on Boyan Chain shows that multi-step linear Dyna learns a policy faster than single-step linear Dyna, and generally learns faster as the number of projection steps increases. Results on Mountain-car show that multi-step linear Dyna leads to much better online performance than single-step linear Dyna and model-free algorithms; however, the performance of multi-step linear Dyna does not always improve as the number of projection steps increases. Our results also suggest that previous attempts on extending LSTD for online control were unsuccessful because LSTD looks inﬁnite steps into the future, and suffers from the model errors in non-stationary (control) environments.

5 0.66119057 134 nips-2009-Learning to Explore and Exploit in POMDPs

Author: Chenghui Cai, Xuejun Liao, Lawrence Carin

Abstract: A fundamental objective in reinforcement learning is the maintenance of a proper balance between exploration and exploitation. This problem becomes more challenging when the agent can only partially observe the states of its environment. In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). The form of the employed exploration is dictated by the speciﬁc problem. Theoretical guarantees are provided concerning the optimality of the balancing of exploration and exploitation. The effectiveness of the method is demonstrated by experimental results on benchmark problems.

6 0.63635141 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

7 0.61051196 209 nips-2009-Robust Value Function Approximation Using Bilinear Programming

8 0.61015379 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

9 0.4777005 16 nips-2009-A Smoothed Approximate Linear Program

10 0.43560931 14 nips-2009-A Parameter-free Hedging Algorithm

11 0.43354809 242 nips-2009-The Infinite Partially Observable Markov Decision Process

12 0.41635966 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

13 0.37297729 94 nips-2009-Fast Learning from Non-i.i.d. Observations

14 0.33517942 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

15 0.32769299 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains

16 0.32451287 53 nips-2009-Complexity of Decentralized Control: Special Cases

17 0.32399297 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

18 0.31838989 54 nips-2009-Compositionality of optimal control laws

19 0.30871311 221 nips-2009-Solving Stochastic Games

20 0.30460918 141 nips-2009-Local Rules for Global MAP: When Do They Work ?

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.013), (24, 0.081), (25, 0.043), (35, 0.068), (36, 0.076), (37, 0.329), (39, 0.041), (58, 0.061), (61, 0.072), (71, 0.042), (86, 0.044), (91, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82477862 235 nips-2009-Structural inference affects depth perception in the context of potential occlusion

Author: Ian Stevenson, Konrad Koerding

Abstract: In many domains, humans appear to combine perceptual cues in a near-optimal, probabilistic fashion: two noisy pieces of information tend to be combined linearly with weights proportional to the precision of each cue. Here we present a case where structural information plays an important role. The presence of a background cue gives rise to the possibility of occlusion, and places a soft constraint on the location of a target - in effect propelling it forward. We present an ideal observer model of depth estimation for this situation where structural or ordinal information is important and then ﬁt the model to human data from a stereo-matching task. To test whether subjects are truly using ordinal cues in a probabilistic manner we then vary the uncertainty of the task. We ﬁnd that the model accurately predicts shifts in subject’s behavior. Our results indicate that the nervous system estimates depth ordering in a probabilistic fashion and estimates the structure of the visual scene during depth perception. 1

same-paper 2 0.70906717 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

Author: Tetsuro Morimura, Eiji Uchibe, Junichiro Yoshimoto, Kenji Doya

3 0.69358855 234 nips-2009-Streaming k-means approximation

Author: Nir Ailon, Ragesh Jaiswal, Claire Monteleoni

Abstract: We provide a clustering algorithm that approximately optimizes the k-means objective, in the one-pass streaming setting. We make no assumptions about the data, and our algorithm is very light-weight in terms of memory, and computation. This setting is applicable to unsupervised learning on massive data sets, or resource-constrained devices. The two main ingredients of our theoretical work are: a derivation of an extremely simple pseudo-approximation batch algorithm for k-means (based on the recent k-means++), in which the algorithm is allowed to output more than k centers, and a streaming clustering algorithm in which batch clustering algorithms are performed on small inputs (ﬁtting in memory) and combined in a hierarchical manner. Empirical evaluations on real and simulated data reveal the practical utility of our method. 1

4 0.63184381 231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing

Author: Ruben Coen-cagli, Peter Dayan, Odelia Schwartz

Abstract: A central hypothesis about early visual processing is that it represents inputs in a coordinate system matched to the statistics of natural scenes. Simple versions of this lead to Gabor–like receptive ﬁelds and divisive gain modulation from local surrounds; these have led to inﬂuential neural and psychological models of visual processing. However, these accounts are based on an incomplete view of the visual context surrounding each point. Here, we consider an approximate model of linear and non–linear correlations between the responses of spatially distributed Gaborlike receptive ﬁelds, which, when trained on an ensemble of natural scenes, uniﬁes a range of spatial context effects. The full model accounts for neural surround data in primary visual cortex (V1), provides a statistical foundation for perceptual phenomena associated with Li’s (2002) hypothesis that V1 builds a saliency map, and ﬁts data on the tilt illusion. 1

5 0.5854966 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

Author: Chonghai Hu, Weike Pan, James T. Kwok

Abstract: Regularized risk minimization often involves non-smooth optimization, either because of the loss function (e.g., hinge loss) or the regularizer (e.g., ℓ1 -regularizer). Gradient methods, though highly scalable and easy to implement, are known to converge slowly. In this paper, we develop a novel accelerated gradient method for stochastic optimization while still preserving their computational simplicity and scalability. The proposed algorithm, called SAGE (Stochastic Accelerated GradiEnt), exhibits fast convergence rates on stochastic composite optimization with convex or strongly convex objectives. Experimental results show that SAGE is faster than recent (sub)gradient methods including FOLOS, SMIDAS and SCD. Moreover, SAGE can also be extended for online learning, resulting in a simple algorithm but with the best regret bounds currently known for these problems. 1

6 0.45356789 126 nips-2009-Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering

7 0.45122081 33 nips-2009-Analysis of SVM with Indefinite Kernels

8 0.45102218 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

9 0.44278917 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

10 0.44194821 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

11 0.43950251 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control

12 0.43460143 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling

13 0.43389788 113 nips-2009-Improving Existing Fault Recovery Policies

14 0.43342882 242 nips-2009-The Infinite Partially Observable Markov Decision Process

15 0.43149939 239 nips-2009-Submodularity Cuts and Applications

16 0.42877641 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

17 0.42822734 3 nips-2009-AUC optimization and the two-sample problem

18 0.42800924 16 nips-2009-A Smoothed Approximate Linear Program

19 0.42745996 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

20 0.42682943 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models