jmlr jmlr2011 jmlr2011-47 knowledge-graph by maker-knowledge-mining

47 jmlr-2011-Inverse Reinforcement Learning in Partially Observable Environments

Source: pdf

Author: Jaedeug Choi, Kee-Eung Kim

Abstract: Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behavior of an expert. Most of the existing IRL algorithms assume that the environment is modeled as a Markov decision process (MDP), although it is desirable to handle partially observable settings in order to handle more realistic scenarios. In this paper, we present IRL algorithms for partially observable environments that can be modeled as a partially observable Markov decision process (POMDP). We deal with two cases according to the representation of the given expert’s behavior, namely the case in which the expert’s policy is explicitly given, and the case in which the expert’s trajectories are available instead. The IRL in POMDPs poses a greater challenge than in MDPs since it is not only ill-posed due to the nature of IRL, but also computationally intractable due to the hardness in solving POMDPs. To overcome these obstacles, we present algorithms that exploit some of the classical results from the POMDP literature. Experimental results on several benchmark POMDP domains show that our work is useful for partially observable settings. Keywords: inverse reinforcement learning, partially observable Markov decision process, inverse optimization, linear programming, quadratically constrained programming

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In these research areas, the reward function is generally assumed to be ﬁxed and known, but it is often non-trivial to come up with an appropriate reward function for each problem. [sent-19, score-0.984]

2 Although it is not easy to transfer the control policy of the agent to other problems that have a similar structure with the original problem, the reward function could be applied since it compactly represents the agent’s objectives and preferences. [sent-27, score-1.022]

3 In practice, the reward function is repeatedly hand-tuned by domain experts until a satisfactory policy is acquired. [sent-43, score-0.931]

4 However, there is no straightforward way to compute the reward function, which should represent the balance among the reward of a successful dialogue, the penalty of an unsuccessful dialogue, and the cost of information gathering. [sent-47, score-0.968]

5 We present algorithms that iteratively ﬁnd the reward function, comparing the expert’s policy and other policies found by the algorithm. [sent-63, score-1.043]

6 • R : S ×A → R is the reward function, where R(s, a) denotes the immediate reward of executing action a in state s, whose absolute value is bounded by Rmax . [sent-74, score-1.082]

7 A policy in MDP is deﬁned as a mapping π : S → A, where π(s) = a denotes that action a is always executed in state s following the policy π. [sent-76, score-1.003]

8 Hence, the belief serves as a sufﬁcient statistic for fully summarizing histories, and the policy can be equivalently deﬁned as a mapping π : ∆ → A, where π(b) = a speciﬁes action a to be selected at the current belief b by the policy π. [sent-92, score-1.052]

9 We can also deﬁne Q-function for an FSC policy π: Qπ ( n, s , a, os ) = R(s, a) + γ ∑ T a,os ( n, s , n′ , s′ )V π (n′ , s′ ), n′ ,s′ which is the expected discounted return of choosing action a at node n and moving to node os(z) upon observation z, and then following policy π. [sent-102, score-1.186]

10 The IRL problem in completely observable Markovian environments is denoted with IRL for MDP\R, which is formally stated as follows: Given an MDP\R S, A, T, γ and an expert’s policy πE , ﬁnd the reward function R that makes πE an optimal policy for the given MDP. [sent-109, score-1.509]

11 The problem can be categorized into two cases: The ﬁrst case is when an expert’s policy is explicitly given and the second case is when an expert’s policy is implicitly given by its trajectories. [sent-110, score-0.878]

12 Ng and Russell (2000) present a necessary and sufﬁcient condition for the reward function R of an MDP to guarantee the optimality of πE : QπE (s, πE (s)) ≥ QπE (s, a), ∀s ∈ S, ∀a ∈ A, (8) which states that deviating from the expert’s policy should not yield a higher value. [sent-113, score-1.01]

13 Then the policy π is optimal if and only if the reward function R satisﬁes Rπ − Ra + γ(T π − T a )(I − γT π )−1 Rπ 0, ∀a ∈ A, (9) where the matrix notations and the matrix operator are deﬁned as follows: • T π is a |S| × |S| matrix with (s, s′ ) element being T (s, π(s), s′ ). [sent-115, score-0.943]

14 Equation (9) bounds the feasible space of the reward functions that guarantee the optimality of the expert’s policy, and there exist inﬁnitely many reward functions that satisfy Equation (9). [sent-126, score-1.015]

15 Thus, given the expert’s policy πE , which is assumed to be optimal, the reward function is found by solving the optimization problem in Table 1, where λ is an adjustable weight for the penalty of having too many non-zero entries in the reward function. [sent-128, score-1.421]

16 The objective is to maximize the sum of the margins1 between the expert’s policy and all other policies that deviate a single step from the expert’s policy, in the hope that the expert’s policy is optimal while favoring sparseness in the reward function. [sent-129, score-1.511]

17 2 IRL for MDP\R from Sampled Trajectories In some cases, we have to assume that the expert’s policy is not explicitly given but instead the trajectories of the expert’s policy in the state and action spaces are available. [sent-131, score-1.084]

18 Ideally, the true reward function R should yield V πE (s0 ) ≥ V π (s0 ) for ∀π ∈ Π since the expert’s policy πE is assumed to be an optimal policy with respect to R. [sent-145, score-1.382]

19 3 The algorithm then computes a new policy πk+1 that maximizes the value function under the new reward function, and adds πk+1 to Π. [sent-151, score-0.931]

20 The value of a policy π can be written using the feature expectation µ(π) for the reward 3. [sent-158, score-0.917]

21 Since the expert’s policy is not explicitly given, the feature expectation of the expert’s policy cannot be exactly computed. [sent-171, score-0.878]

22 IRL in Partially Observable Environments We denote the problem of IRL in partially observable environments as IRL for POMDP\R and the objective is to determine the reward function that the expert is optimizing. [sent-182, score-0.962]

23 Formally, IRL for POMDP\R is deﬁned as follows: Given a POMDP\R S, A, Z, T, O, b0 , γ and an expert’s policy πE , ﬁnd the reward function R that makes πE an optimal policy for the given POMDP. [sent-183, score-1.382]

24 Hence, the reward function found by IRL for POMDP\R should guarantee the optimality of the expert’s policy for the given POMDP. [sent-184, score-0.964]

25 Also, given an optimal policy for a reward function, we can ﬁnd some other reward function that yields the same optimal policy without any modiﬁcation to the environment by the technique of reward shaping (Ng et al. [sent-192, score-2.37]

26 As suggested by Ng and Russell (2000), we can guarantee the optimality of the expert’s policy by comparing the value of the expert’s policy with that of all possible policies. [sent-194, score-0.911]

27 However, there are an inﬁnite number of policies in a ﬁnite POMDP, since a policy in a POMDP is deﬁned as a mapping from a continuous belief space to a ﬁnite state space or represented by an FSC policy that might have an inﬁnite number of nodes. [sent-195, score-1.084]

28 For example, when building dialogue management systems, we may already have a dialogue policy engineered by human experts, but we still do not know the reward function that produces the expert’s policy. [sent-205, score-0.991]

29 We assume that the expert’s policy is represented in the form of an FSC, since the FSC is one of the most natural ways to specify a policy in POMDPs. [sent-214, score-0.878]

30 We propose three conditions for the reward function to guarantee the optimality of the expert’s policy based on comparing Q-functions and using the generalized Howard’s policy improvement theorem (Howard, 1960) and the witness theorem (Kaelbling et al. [sent-215, score-1.471]

31 1 Q-function Based Approach We could derive a simple and naive condition for the optimality of the expert’s policy by comparing the value of the expert’s policy with those of all other policies. [sent-219, score-0.911]

32 Since V πE and V π are linear in terms of the reward function R by Equations (5) and (7), the above inequality yields the set of linear constraints that deﬁnes the feasible region of the reward functions that guarantees the expert’s policy to be optimal. [sent-221, score-1.422]

33 The above condition states that any policy that deviates one step from the expert’s action and observation strategies should not have a higher value than the expert’s policy does. [sent-234, score-0.985]

34 This idea comes from the generalized Howard’s policy improvement theorem (Howard, 1960): Theorem 2 [Hansen, 1998] If an FSC policy is not optimal, the DP update transforms it into an FSC policy with a value function that is as good or better for every belief state and better for some belief state. [sent-251, score-1.479]

35 For some node n in π, if V new (nnew , s) ≥ V π (n, s), for ∀s ∈ S, the value of the original policy π will not be greater than that of the policy transformed by discarding node n and redirecting all the incoming edges of node n to node nnew . [sent-255, score-1.403]

36 Proof We build a new policy πk that follows the original policy π, but executes the action and observation strategies of nnew for the ﬁrst k times that node n is visited. [sent-256, score-1.271]

37 For the inductive step, we abuse notation to denote Rπk (st , at ) as the reward at the t-th time step by following the policy πk and starting from belief b and node n. [sent-260, score-1.047]

38 Thus, if the policy is not optimal, the DP update must generate some non-duplicate nodes that change the policy and improve the values for some belief state. [sent-275, score-0.989]

39 The value function of policy πE should satisfy V πE (n, b) ≥ V new (nnew , b), ∀b ∈ Bn , ∀nnew ∈ Nnew (16) for every node n ∈ N if the expert’s policy πE is optimal. [sent-284, score-0.972]

40 We will exploit the witness theorem to ﬁnd a set of useful nodes 705 C HOI AND K IM that yield the feasible region for the true reward function as the witness algorithm incrementally generates new policy trees that improve the current policy trees. [sent-296, score-1.549]

41 4 Optimization Problem In the previous sections, we suggested three constraints for the reward function that stem from the optimality of the expert’s policy, but inﬁnitely many reward functions can satisfy the constraints in Equations (14) and (16). [sent-340, score-1.016]

42 As in Ng and Russell (2000), we prefer a reward function that maximizes the sum of the margins between the expert’s policy and other policies. [sent-342, score-0.931]

43 At the same time, we want the reward function as sparse as possible, which can be accomplished by adjusting the penalty weight on the L1 -norm of the reward function. [sent-343, score-0.982]

44 When using the DP update or the witness theorem based optimality constraint, that is, Equation (16), the policies other than the expert’s policy are captured in newly generated nodes nnew , hence the optimization problem now becomes the one given in Table 3. [sent-345, score-0.948]

45 Since all the inequalities and the objective functions in the optimization problems are linear in terms 708 IRL IN PARTIALLY O BSERVABLE E NVIRONMENTS of the reward function, the desired reward function can be found efﬁciently by solving the linear programming problems. [sent-346, score-0.983]

46 The algorithms share the same framework that iteratively repeats estimating the parameter of the reward function using an IRL algorithm and computing an optimal policy for the estimated reward function using a POMDP solver. [sent-370, score-1.435]

47 The ﬁrst algorithm ﬁnds the reward function that maximizes the margin between the values of the expert’s policy and other policies for the sampled beliefs using LP. [sent-371, score-1.134]

48 The second algorithm computes the reward function that maximizes the margin between the feature expectations of the expert’s policy and other policies using QCP. [sent-373, score-1.074]

49 1 Max-Margin between Values (MMV) Method We ﬁrst evaluate the values of the expert’s policy and other policies for the weight vector α of a reward function in order to compare their values. [sent-383, score-1.057]

50 It iteratively tries to ﬁnd a reward function parameterized by α that maximizes the sum of the margins ˆ between the value V πE of the expert’s policy and the value V π of each FSC policy π ∈ Π found so far by the algorithm at all the unique beliefs b ∈ BπE in the trajectories. [sent-390, score-1.427]

51 5 In addition, we prefer the sparse reward function and the sparsity of the learned reward function can be achieved by tuning the penalty weight λ. [sent-398, score-1.024]

52 In order to compute the feature expectation µ(π) exactly, we deﬁne the occupancy distribution occπ (s, n) of the policy π that represents the relative frequency of visiting state s at node n when following the policy π = ψ, η and starting from belief b0 and node n0 . [sent-404, score-1.118]

53 The above inequalities state that the difference between the expert’s policy πE and any policy π is bounded by the difference between their feature expectations, which is the same result as in Abbeel and Ng (2004). [sent-427, score-0.908]

54 Abbeel and Ng (2004) construct a policy by mixing the policies found by the algorithm in order to ﬁnd the policy that is as good as the given expert’s policy. [sent-432, score-1.004]

55 Thus, we return the reward function that yields the closest feature expectation to that of the expert’s policy among the intermediate reward functions found by the algorithm. [sent-435, score-1.45]

56 Similar to Algorithm 5, the algorithm returns the reward function that yields the closest feature expectation to that of the expert’s policy among the intermediate reward functions found by the algorithm. [sent-444, score-1.436]

57 To evaluate the performance of the IRL algorithms, we could naively compare the true reward functions in the original problems to the reward functions found by the algorithms. [sent-496, score-0.982]

58 However, it is not only difﬁcult but also meaningless to simply compare the numerical values of the reward functions, since the reward function represents the relative importance of executing an action in a state. [sent-497, score-1.066]

59 Completely different behaviors may be derived from two reward functions that have a small difference, and an identical optimal policy may be induced by two reward functions that have a large difference. [sent-498, score-1.446]

60 For example, three reward functions in the Tiger problem are presented in Table 5, where R∗ is the true reward function and R1 and R2 are two reward functions chosen for explaining the phenomenon. [sent-499, score-1.474]

61 When the distances are measured by L2 norm, Dist(R, R∗ ) = R∗ − R 2 = ∑ (R∗ (s, a) − R(s, a))2 , s∈S,a∈A the reward function R2 is more similar to R∗ than the reward function R1 . [sent-500, score-0.984]

62 V π (b0 ; R∗ ): The value of the optimal policy for each reward function measured on the true reward function. [sent-514, score-1.421]

63 Therefore, we compare the value functions of the optimal policies induced from the true and learned reward functions instead of directly measuring the distance between the reward functions. [sent-523, score-1.148]

64 The performance of the algorithms are evaluated by the differences in the values of the expert’s policy and the optimal policy for the learned reward function. [sent-524, score-1.396]

65 717 C HOI AND K IM Reward 1 0 Left Middle Goal Move left Right Left Middle Goal Move right Right Figure 6: Comparison of the true and learned reward functions and the expert’s policy in the 1d Maze problem. [sent-571, score-0.958]

66 In Table 6, D(R∗ ) = |V πE (b0 ; R∗ ) − V πL (b0 ; R∗ )| is the difference between the values of the expert’s policy πE and the optimal policy πL for the learned reward, which are measured on the true reward function R∗ . [sent-574, score-1.41]

67 The differences measured on the true reward function in the Tiger, 1d Maze, 5 × 5 Grid World, and Heaven/Hell are zero, meaning that the learned reward function generated a policy whose performance is the same as that of the expert’s policy. [sent-576, score-1.451]

68 However, our algorithms failed to ﬁnd the reward that generates a policy that is optimal on the true reward in the Rock Sample[4,3]. [sent-577, score-1.407]

69 Nevertheless, we can say that the learned reward function RL satisﬁes the optimality of the expert’s policy πE since the policy πL is an optimal policy on the learned reward function RL and |V πE (b0 ; RL ) −V πL (b0 ; RL )| = 0. [sent-578, score-2.402]

70 The learned reward functions are compared to the true reward functions for the Tiger, 1d Maze, 5 × 5 Grid World, and Heaven/Hell problems, but the reward function in the Rock Sample[4,3] problem is omitted since it has too many elements to present. [sent-581, score-1.502]

71 Since our methods favor sparse reward functions, there is some degree of difference between the true and the learned reward functions, most notably for the listen action, where our methods assign a zero reward instead of -1 as in the true reward. [sent-585, score-1.478]

72 It is close to the true reward function and produces the optimal policy whose value is equal to the value of the expert’s policy when measured on the true reward function. [sent-589, score-1.86]

73 For the 1d Maze problem, the learned reward function is compared to the true reward function in the left panel of Figure 6 and the expert’s policy is presented in the right panel of Figure 6. [sent-590, score-1.485]

74 5 0 s13 s17 s18 s24 s13 s17 s18 s24 s13 s17 s18 s24 s13 s17 s18 s24 Move north Move south Move west Move east Figure 7: Comparison of the true and the learned reward functions and the expert’s policy in the 5 × 5 Grid World problems. [sent-593, score-0.99]

75 This causes the algorithms to assign the positive reward to moving left in the goal state as the true one, but the zero reward to moving right in the goal state unlike the true one. [sent-600, score-1.052]

76 Consequently, the algorithms ﬁnd the reward function that explains the behavior of the expert’s policy, and the optimal policy from the POMDP with respect to the learned reward function is the same as the expert’s policy. [sent-601, score-1.463]

77 The learned reward function is presented with the true reward function in the left panel of Figure 7. [sent-605, score-1.029]

78 The learned reward is presented in Figure 8 where the agent gets a +1 reward when moving in the direction of the arrow in each state. [sent-611, score-1.093]

79 As shown in Table 6, the learned reward function in the Heaven/Hell problem also yields the policy whose value is equal to that of the expert’s policy. [sent-613, score-0.959]

80 As in the previous section, we compare V πL (b0 ; R∗ ) at each iteration, which is the value of the policy πL from the learned reward function RL evaluated on the true reward function R∗ . [sent-640, score-1.451]

81 All the algorithms found the reward function that generate the policy 7. [sent-642, score-0.931]

82 11 Table 9: Conﬁguration for each problem and the value of the expert’s policy measured on the true reward function. [sent-665, score-0.917]

83 However, these poor intermediate reward functions will effectively restrict the region of the reward functions for the ﬁnal result. [sent-751, score-0.996]

84 As noted in the above, in most of the experiments, the algorithms eventually found the policy whose performance is the same as the expert’s, which means the algorithms found the reward function that successfully recovers the expert’s policy. [sent-753, score-0.931]

85 Besides the task of reward learning, IRL has gained interest in apprenticeship learning, where the task is to ﬁnd the policy with possibly better performance than the one demonstrated by an expert. [sent-777, score-1.014]

86 They presented a sufﬁcient and necessary condition on the reward functions which guarantees the optimality of the expert’s policy, and provided some heuristics to choose a reward function since the degenerate reward functions also satisfy the optimality condition. [sent-784, score-1.54]

87 724 IRL IN PARTIALLY O BSERVABLE E NVIRONMENTS The algorithm comes with a theoretical guarantee that the learned policy is similar to the expert’s policy when evaluated on the true reward function. [sent-812, score-1.384]

88 They formulated a QP problem to ﬁnd the weight vector of the reward basis functions that maximizes the margin between the expert’s policy and all other policies. [sent-821, score-0.962]

89 The indirect method ﬁnds the policy using the learned reward function from IRL. [sent-825, score-0.959]

90 Since the loss functions are deﬁned on the policy space, the algorithm uses natural gradients to map the gradients in the policy space to those in the weight vector space of reward functions. [sent-826, score-1.369]

91 This was achieved in a game-theoretic framework using a two person zero-sum game, where the learner selects a policy that maximizes its performance relative to the expert’s and the environment adversarially selects a reward function that minimizes the performance of the learned policy. [sent-828, score-0.979]

92 Each algorithm was characterized by the distance function that measures the difference between the expert’s behavior data and the policy from the learned reward function, and the update step that computes new parameter values for the reward function. [sent-842, score-1.455]

93 First, we derived the constraints of the reward function to guarantee the optimality of the expert’s policy and built optimization problems to solve IRL for POMDP\R when the expert’s policy is explicitly given. [sent-849, score-1.403]

94 Thus, it is crucial to ﬁnd a sufﬁcient condition that can be efﬁciently computed in order to restrict the feasible region of the reward functions tightly so that the optimization problems can ﬁnd the reward function that guarantees the optimality of the given expert’s policy. [sent-862, score-1.016]

95 In other words, the expert’s policy is another optimal policy for the learned reward and the learned reward still satisﬁes the optimality condition of the expert’s policy. [sent-865, score-1.935]

96 However, the optimal policy for the learned reward does not achieve the same value as the expert’s policy when the value is evaluated on the true reward. [sent-866, score-1.396]

97 It prefers the reward function that maximizes the sum of the differences between the value of the expert’s policy and the other policies while forcing the reward function to be as sparse as possible. [sent-869, score-1.549]

98 The Bayesian IRL prefers the reward function inducing the high probability of executing actions in the given behavior data, and the maximum entropy IRL prefers the reward function maximizing the entropy of the distribution over behaviors while matching the feature expectations. [sent-874, score-1.068]

99 Computing an optimal policy in the intermediate POMDP problem takes a much longer time than solving a usual POMDP problem, since an optimal policy of the intermediate POMDP problem is often complex due to the complex reward structure. [sent-884, score-1.408]

100 Policy invariance under reward transformations: Theory and application to reward shaping. [sent-985, score-0.956]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('reward', 0.478), ('policy', 0.439), ('irl', 0.428), ('expert', 0.297), ('pomdp', 0.263), ('nnew', 0.205), ('policies', 0.126), ('trajectories', 0.102), ('mdp', 0.101), ('apprenticeship', 0.097), ('fsc', 0.095), ('observable', 0.095), ('agent', 0.091), ('node', 0.08), ('rock', 0.077), ('action', 0.074), ('bservable', 0.074), ('nvironments', 0.074), ('witness', 0.068), ('mmv', 0.066), ('tiger', 0.065), ('russell', 0.062), ('ng', 0.058), ('beliefs', 0.057), ('hoi', 0.054), ('maze', 0.053), ('belief', 0.05), ('actions', 0.049), ('abbeel', 0.047), ('mmfe', 0.046), ('partially', 0.046), ('nodes', 0.043), ('prj', 0.043), ('ua', 0.041), ('dp', 0.038), ('rl', 0.035), ('im', 0.035), ('reinforcement', 0.034), ('optimality', 0.033), ('basis', 0.032), ('environments', 0.032), ('heaven', 0.031), ('pomdps', 0.03), ('dialogue', 0.03), ('state', 0.03), ('os', 0.028), ('learned', 0.028), ('deviating', 0.027), ('reachable', 0.024), ('howard', 0.024), ('sa', 0.023), ('atm', 0.023), ('occ', 0.023), ('rover', 0.023), ('rocks', 0.023), ('executing', 0.022), ('kaelbling', 0.022), ('st', 0.021), ('move', 0.021), ('executed', 0.021), ('sampled', 0.02), ('executes', 0.02), ('rmax', 0.02), ('btk', 0.02), ('neu', 0.02), ('syed', 0.02), ('environment', 0.02), ('btm', 0.019), ('ramachandran', 0.019), ('successor', 0.019), ('ziebart', 0.019), ('states', 0.019), ('bn', 0.019), ('update', 0.018), ('moving', 0.018), ('grid', 0.018), ('expectations', 0.017), ('panel', 0.017), ('deviate', 0.017), ('bm', 0.017), ('listen', 0.016), ('spaan', 0.016), ('maxiter', 0.016), ('east', 0.016), ('south', 0.016), ('newly', 0.016), ('qcp', 0.015), ('world', 0.015), ('tk', 0.015), ('planning', 0.014), ('observation', 0.014), ('intermediate', 0.014), ('return', 0.014), ('function', 0.014), ('trajectory', 0.014), ('korea', 0.013), ('functions', 0.013), ('behaviors', 0.013), ('penalty', 0.012), ('optimal', 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.000001 47 jmlr-2011-Inverse Reinforcement Learning in Partially Observable Environments

Author: Jaedeug Choi, Kee-Eung Kim

2 0.25143251 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

Author: Stéphane Ross, Joelle Pineau, Brahim Chaib-draa, Pierre Kreitmann

Abstract: Bayesian learning methods have recently been shown to provide an elegant solution to the explorationexploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). The primary focus of this paper is to extend these ideas to the case of partially observable domains, by introducing the Bayes-Adaptive Partially Observable Markov Decision Processes. This new framework can be used to simultaneously (1) learn a model of the POMDP domain through interaction with the environment, (2) track the state of the system under partial observability, and (3) plan (near-)optimal sequences of actions. An important contribution of this paper is to provide theoretical results showing how the model can be ﬁnitely approximated while preserving good learning performance. We present approximate algorithms for belief tracking and planning in this model, as well as empirical results that illustrate how the model estimate and agent’s return improve as a function of experience. Keywords: processes reinforcement learning, Bayesian inference, partially observable Markov decision

3 0.22690473 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation

Author: Marek Petrik, Shlomo Zilberstein

Abstract: Value function approximation methods have been successfully used in many applications, but the prevailing techniques often lack useful a priori error bounds. We propose a new approximate bilinear programming formulation of value function approximation, which employs global optimization. The formulation provides strong a priori guarantees on both robust and expected policy loss by minimizing speciﬁc norms of the Bellman residual. Solving a bilinear program optimally is NP-hard, but this worst-case complexity is unavoidable because the Bellman-residual minimization itself is NP-hard. We describe and analyze the formulation as well as a simple approximate algorithm for solving bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. We also brieﬂy analyze the behavior of bilinear programming algorithms under incomplete samples. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on simple benchmark problems. Keywords: value function approximation, approximate dynamic programming, Markov decision processes

4 0.13709363 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling

Author: Martijn R.K. Mes, Warren B. Powell, Peter I. Frazier

Abstract: We propose a sequential sampling policy for noisy discrete global optimization and ranking and selection, in which we aim to efﬁciently explore a ﬁnite set of alternatives before selecting an alternative as best when exploration stops. Each alternative may be characterized by a multidimensional vector of categorical and numerical attributes and has independent normal rewards. We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. This policy myopically optimizes the expected increment in the value of sampling information in each time period. We propose a hierarchical aggregation technique that uses the common features shared by alternatives to learn about many alternatives from even a single measurement. This approach greatly reduces the measurement effort required, but it requires some prior knowledge on the smoothness of the function in the form of an aggregation function and computational issues limit the number of alternatives that can be easily considered to the thousands. We prove that our policy is consistent, ﬁnding a globally optimal alternative when given enough measurements, and show through simulations that it performs competitively with or signiﬁcantly better than other policies. Keywords: sequential experimental design, ranking and selection, adaptive learning, hierarchical statistics, Bayesian statistics

5 0.12667267 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning

Author: Harm van Seijen, Shimon Whiteson, Hado van Hasselt, Marco Wiering

Abstract: This article presents and evaluates best-match learning, a new approach to reinforcement learning that trades off the sample efﬁciency of model-based methods with the space efﬁciency of modelfree methods. Best-match learning works by approximating the solution to a set of best-match equations, which combine a sparse model with a model-free Q-value function constructed from samples not used by the model. We prove that, unlike regular sparse model-based methods, bestmatch learning is guaranteed to converge to the optimal Q-values in the tabular case. Empirical results demonstrate that best-match learning can substantially outperform regular sparse modelbased methods, as well as several model-free methods that strive to improve the sample efﬁciency of temporal-difference methods. In addition, we demonstrate that best-match learning can be successfully combined with function approximation. Keywords: reinforcement learning, on-line learning, temporal-difference methods, function approximation, data reuse

6 0.052442446 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

7 0.046799172 104 jmlr-2011-X-Armed Bandits

8 0.04243096 36 jmlr-2011-Generalized TD Learning

9 0.038939733 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

10 0.034427721 54 jmlr-2011-Learning Latent Tree Graphical Models

11 0.02680237 41 jmlr-2011-Improved Moves for Truncated Convex Models

12 0.022982555 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure

13 0.021813238 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

14 0.02022714 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

15 0.019517004 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

16 0.018626392 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

17 0.0170845 83 jmlr-2011-Scikit-learn: Machine Learning in Python

18 0.017008398 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

19 0.016423181 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

20 0.015874133 14 jmlr-2011-Better Algorithms for Benign Bandits

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.125), (1, 0.104), (2, -0.236), (3, 0.076), (4, -0.088), (5, 0.461), (6, -0.356), (7, -0.105), (8, 0.079), (9, -0.036), (10, 0.008), (11, 0.038), (12, -0.033), (13, -0.053), (14, -0.028), (15, -0.026), (16, -0.025), (17, 0.056), (18, -0.029), (19, -0.003), (20, -0.084), (21, 0.031), (22, -0.002), (23, 0.103), (24, -0.115), (25, 0.02), (26, 0.085), (27, 0.077), (28, -0.064), (29, 0.008), (30, -0.064), (31, 0.013), (32, -0.002), (33, 0.13), (34, -0.009), (35, -0.069), (36, -0.051), (37, -0.043), (38, -0.03), (39, -0.03), (40, -0.049), (41, -0.002), (42, 0.055), (43, -0.005), (44, 0.049), (45, 0.029), (46, -0.046), (47, 0.016), (48, -0.04), (49, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98629665 47 jmlr-2011-Inverse Reinforcement Learning in Partially Observable Environments

Author: Jaedeug Choi, Kee-Eung Kim

2 0.85270315 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation

Author: Marek Petrik, Shlomo Zilberstein

3 0.58234972 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

Author: Stéphane Ross, Joelle Pineau, Brahim Chaib-draa, Pierre Kreitmann

4 0.50688124 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning

Author: Harm van Seijen, Shimon Whiteson, Hado van Hasselt, Marco Wiering

5 0.50601035 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling

Author: Martijn R.K. Mes, Warren B. Powell, Peter I. Frazier

6 0.19241399 104 jmlr-2011-X-Armed Bandits

7 0.18328065 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

8 0.14329749 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

9 0.13539843 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

10 0.12869315 36 jmlr-2011-Generalized TD Learning

11 0.1113361 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals

12 0.10809004 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

13 0.10710271 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

14 0.10562296 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure

15 0.10266617 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

16 0.10133012 54 jmlr-2011-Learning Latent Tree Graphical Models

17 0.094952025 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

18 0.083913073 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

19 0.082780153 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes

20 0.078940645 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.051), (9, 0.022), (10, 0.037), (24, 0.038), (31, 0.045), (32, 0.021), (41, 0.035), (60, 0.01), (71, 0.012), (73, 0.028), (76, 0.026), (78, 0.076), (85, 0.455), (90, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77907026 47 jmlr-2011-Inverse Reinforcement Learning in Partially Observable Environments

Author: Jaedeug Choi, Kee-Eung Kim

2 0.30254233 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

Author: Stéphane Ross, Joelle Pineau, Brahim Chaib-draa, Pierre Kreitmann

3 0.24919267 91 jmlr-2011-The Sample Complexity of Dictionary Learning

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classiﬁcation, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a ﬁxed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefﬁcient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefﬁcient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefﬁcient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation

4 0.24471839 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir

Abstract: We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efﬁcient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufﬁcient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, when the constraint is on the number of available attributes per test example, we show that there are cases in which there exists a linear predictor with zero error but it is statistically impossible to achieve arbitrary accuracy without full information on test examples. Finally, we run simple experiments on a digit recognition problem that reveal that our algorithm has a good performance against both partial information and full information baselines. Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory

5 0.24304584 59 jmlr-2011-Learning with Structured Sparsity

Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas

Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efﬁciently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing

6 0.24292012 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

7 0.24278909 104 jmlr-2011-X-Armed Bandits

8 0.24133803 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

9 0.24052571 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

10 0.239866 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

11 0.23979297 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

12 0.23817781 16 jmlr-2011-Clustering Algorithms for Chains

13 0.2372608 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

14 0.23713014 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

15 0.23711728 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

16 0.23630902 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

17 0.23582691 12 jmlr-2011-Bayesian Co-Training

18 0.23560691 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

19 0.23549783 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

20 0.23534498 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms