jmlr jmlr2011 jmlr2011-81 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marek Petrik, Shlomo Zilberstein
Abstract: Value function approximation methods have been successfully used in many applications, but the prevailing techniques often lack useful a priori error bounds. We propose a new approximate bilinear programming formulation of value function approximation, which employs global optimization. The formulation provides strong a priori guarantees on both robust and expected policy loss by minimizing specific norms of the Bellman residual. Solving a bilinear program optimally is NP-hard, but this worst-case complexity is unavoidable because the Bellman-residual minimization itself is NP-hard. We describe and analyze the formulation as well as a simple approximate algorithm for solving bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. We also briefly analyze the behavior of bilinear programming algorithms under incomplete samples. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on simple benchmark problems. Keywords: value function approximation, approximate dynamic programming, Markov decision processes
Reference: text
sentIndex sentText sentNum sentScore
1 We propose a new approximate bilinear programming formulation of value function approximation, which employs global optimization. [sent-10, score-0.546]
2 The formulation provides strong a priori guarantees on both robust and expected policy loss by minimizing specific norms of the Bellman residual. [sent-11, score-0.523]
3 We describe and analyze the formulation as well as a simple approximate algorithm for solving bilinear programs. [sent-13, score-0.488]
4 The goal of approximate methods is to compute a policy that minimizes the policy loss—the difference between the returns of the computed policy and an optimal one. [sent-21, score-1.316]
5 The policy may be represented, for example, as a finite-state controller (Stanley and Miikkulainen, 2004) or as a greedy policy with respect to an approximate value function (Szita and Lorincz, 2006). [sent-24, score-0.967]
6 Our goal is to develop a more reliable method that is guaranteed to minimize bounds on the policy loss in various settings. [sent-45, score-0.516]
7 This paper presents new formulations for value function approximation that provably minimize bounds on policy loss using a global optimization framework; we consider both L∞ and weighted L1 error bounds. [sent-46, score-0.604]
8 To minimize the policy loss, we derive new bounds based on approximate value functions. [sent-47, score-0.578]
9 Then, in Section 4, we describe the proposed approximate bilinear programming (ABP) formulations. [sent-53, score-0.48]
10 A drawback of the bilinear formulation is that solving bilinear programs may require exponential time. [sent-55, score-0.886]
11 Section 7 shows that ABP is related to other 3028 ROBUST A PPROXIMATE B ILINEAR P ROGRAMMING approximate dynamic programming methods, such as approximate linear programming and policy iteration. [sent-60, score-0.637]
12 Definition 2 A deterministic stationary policy π : S → A assigns an action to each state of the Markov decision process. [sent-76, score-0.515]
13 A policy π ∈ Π together with the transition matrix induces a distribution over the state space S in every time step resulting in random variables St for t = 0 . [sent-82, score-0.488]
14 Our objective is then maxπ∈Π ρ(π, α), for which the optimal solution is some deterministic policy π⋆ . [sent-87, score-0.484]
15 The transition matrix and reward function for a deterministic policy π are defined as: Pπ : (s, s′ ) → P(s, π(s), s′ ) and rπ : s → r(s, π(s)) . [sent-88, score-0.504]
16 a∈A ,s′ ∈S A policy π is greedy with respect to a value function v when Lπ v = Lv. [sent-96, score-0.5]
17 In addition, a policy π induces a state occupancy frequency uπ : S → R, defined as follows: T uπ = I − γPπ −1 α. [sent-108, score-0.508]
18 The return of a policy depends on the T state-action occupancy frequencies and αT vπ = rπ uπ . [sent-110, score-0.49]
19 a∈A To formulate approximate linear and bilinear programs, it is necessary to restrict the value functions so that their Bellman residuals are non-negative (or at least bounded from below). [sent-118, score-0.491]
20 In addition, if π is a greedy policy with respect to v it is also greedy with respect to v + c1. [sent-131, score-0.521]
21 The actual solution of the MDP is then the greedy policy π with respect to this value ˜ function v. [sent-144, score-0.523]
22 The expected policy loss σe of π is defined as: σe (π) = ρ(π⋆ , α) − ρ(π, α) = v⋆ − vπ where x 1,c denotes the weighted L1 norm: ||x The robust policy loss σr of π is defined as: σr (π) = max {α≥0 1T α=1} 1,c 1,α = αT v⋆ − αT vπ , = ∑i |c(i)x(i)|. [sent-147, score-0.928]
23 s∈S ROBUST A PPROXIMATE B ILINEAR P ROGRAMMING The expected policy loss captures the total loss of discounted reward when following the policy π instead of the optimal policy, given the initial state distribution. [sent-149, score-0.938]
24 The robust policy loss ignores the initial distribution and, in some sense, measures the difference for the worst-case initial distribution. [sent-150, score-0.49]
25 The goal of value function approximation is not simply to obtain a good value function v but ˜ a policy with a small policy loss. [sent-176, score-0.923]
26 Unfortunately, the policy loss of a greedy policy, as formulated in Theorem 9, depends non-trivially on the approximate value function v. [sent-177, score-0.579]
27 The following theorem states the most common bound on the robust policy loss. [sent-179, score-0.537]
28 Theorem 11 [Robust Policy Loss, Williams and Baird, 1994] Let π be a greedy policy with respect to a value function v. [sent-180, score-0.5]
29 We propose methods that minimize both the standard bounds in Theorem 11 and new tighter bounds on the expected policy loss in Theorem 12. [sent-184, score-0.57]
30 Theorem 12 [Expected Policy Loss] Let π ∈ Π be a greedy policy with respect to a value function v ∈ V and let the state occupancy frequencies of π be bounded as u ≤ uπ ≤ u. [sent-187, score-0.614]
31 The optimal value function v⋆ is independent of the approximate value function v and the greedy ˜ policy π depends only on v. [sent-195, score-0.587]
32 ˜ 3034 ROBUST A PPROXIMATE B ILINEAR P ROGRAMMING Algorithm 1: Approximate policy iteration, where Z (π) denotes a custom value function approximation for the policy π. [sent-196, score-0.89]
33 1), which state that for each v ∈ K and a greedy policy π: ˜ v⋆ − vπ 1,α ≤ 1 v⋆ − v ˜ 1−γ 1,(1−γ)uπ . [sent-198, score-0.504]
34 Hence, they can be categorized as approximate value iteration, approximate policy iteration, and approximate linear programming. [sent-206, score-0.608]
35 Below, we only discuss approximate policy iteration and approximate linear programming, because they are the methods most closely related to our approach. [sent-208, score-0.521]
36 The function Z (π) denotes the specific method used to approximate the value function for the policy π. [sent-210, score-0.5]
37 We discuss the performance of API and how it relates to approximate bilinear programming in more detail in Section 7. [sent-222, score-0.48]
38 v⋆ − v 1,α ≤ ˜ ˜ 1 − γ v∈V 1−γ x The difficulty with the solution of ALP is that it is hard to derive guarantees on the policy loss based on the bounds in terms of the L1 norm; it is possible when the objective function c represents u, as ¯ Theorem 13 shows. [sent-255, score-0.535]
39 Bilinear programs are a generalization of linear programs that allows the objective function to include an additional bilinear term. [sent-262, score-0.541]
40 A separable bilinear program consists of two linear programs with independent constraints and is fairly easy to solve and analyze in comparison to non-separable bilinear programs. [sent-263, score-0.928]
41 min (6) The objective of the bilinear program (6) is denoted as f (w, x, y, z). [sent-267, score-0.505]
42 In this paper, we only use separable bilinear programs and refer to them simply as bilinear programs. [sent-269, score-0.864]
43 The goal in approximate dynamic programming and value function approximation is to find a policy that is close to optimal. [sent-270, score-0.604]
44 We propose approximate bilinear formulations that minimize the following bounds on robust and expected policy loss. [sent-273, score-1.004]
45 Robust policy loss: Minimize v⋆ − vπ ∞ by minimizing the bounds in Theorem 11: min v⋆ − vπ ˜ π∈Π ∞ ≤ min ˜ v∈V 3037 1 v − Lv 1−γ ∞ . [sent-275, score-0.543]
46 Expected policy loss: Minimize v⋆ − vπ min v⋆ − vπ ˜ π∈Π 1,α 1,α by minimizing the bounds in Theorem 12: ≤ αT v⋆ + min ˜ v∈V ∩K −αT v + ˜ 1 v − Lv 1−γ ∞ . [sent-277, score-0.543]
47 While sampling in linear programs results simply in removal of constraints, in approximate bilinear programs it also leads to a reduction in the number of certain variables, as described in Section 6. [sent-284, score-0.602]
48 1 Robust Policy Loss The solution of the robust approximate bilinear program minimizes the L∞ norm of the Bellman residual v − Lv ∞ over the set of representable and transitive-feasible value functions. [sent-289, score-0.829]
49 ROBUST A PPROXIMATE B ILINEAR P ROGRAMMING It is important to note that the theorem states that solving the approximate bilinear program is equivalent to minimization over all representable value functions, not only the transitive-feasible ones. [sent-302, score-0.697]
50 ˜ Corollary 18 For any optimal solution v of (7), the policy loss of the greedy policy π is bounded ˜ by: 2 min v − Lv ∞ . [sent-307, score-0.966]
51 (8) (9) In addition, inequality (8) becomes an equality for any deterministic policy π, and there is a deterministic optimal policy that satisfies equality (9). [sent-313, score-0.882]
52 Then, using the fact that the policy defines an action for every state, we get: f2 (v) = min v − Lπ v ∞ π∈Π = min max(v − Lπ v)(s) π∈Π s∈S = max min(v − Lπ v)(s) s∈S π∈Π = max(v − max Lπ v)(s) π∈Π s∈S = max(v − Lv)(s) = v − Lv s∈S ∞ . [sent-319, score-0.526]
53 The existence of an optimal deterministic solution then follows from the existence of a deterministic greedy policy with respect to a value function. [sent-320, score-0.579]
54 We are now ready to show that neither the set of greedy policies considered nor the policy loss bounds are affected by considering only transitive feasible functions in (7). [sent-333, score-0.67]
55 P ETRIK AND Z ILBERSTEIN Note that the existence of an optimal deterministic policy in (7) follows from the existence of a deterministic optimal policy in f2 . [sent-348, score-0.882]
56 2 Expected Policy Loss This section describes bilinear programs that minimize bounds on the expected policy loss for a given initial distribution v − Lv 1,α . [sent-351, score-0.968]
57 The expected policy loss can be minimized by solving the following bilinear formulation. [sent-355, score-0.839]
58 ′ Notice that this formulation is identical to the bilinear program (7) with the exception of the term −(1 − γ)αT v. [sent-359, score-0.48]
59 ˜ Corollary 24 There exists an optimal solution π that is greedy with respect to v for which the policy ˜ loss is bounded by: v⋆ − vπ ˜ 1,α ≤ 1 Lv − v ˜ v∈V ∩K 1 − γ min ≤ min ˜ v∈V 1 Lv − v 1−γ 3042 ∞− ∞ −α T v⋆ − v (v − v⋆ ) 1,α . [sent-363, score-0.591]
60 (12) In addition, (11) holds with an equality for a deterministic policy π, and there is a deterministic optimal policy that satisfies (12). [sent-368, score-0.882]
61 The relaxation of the bilinear program is typically either a linear or semi-definite program (Carpara and Monaci, 2009). [sent-384, score-0.511]
62 The most common and simplest approximate algorithm for solving bilinear programs is Algorithm 2. [sent-389, score-0.524]
63 As mentioned above, any separable bilinear program can be also formulated as a mixed integer linear program (Horst and Tuy, 1996). [sent-400, score-0.56]
64 Below, we present a more compact and structured mixed integer linear program formulation, which relies on the property of ABP that there is always a solution with an optimal deterministic policy (see Theorem 17). [sent-402, score-0.548]
65 We only show the formulation of the robust approximate bilinear program (7); the same approach applies to all other formulations that we propose. [sent-403, score-0.61]
66 This mixed integer linear program formulation is much simpler than a general MILP formulation of a bilinear program (Horst and Tuy, 1996). [sent-418, score-0.597]
67 These bounds do not rely on distributions over constraints and transform directly to bounds on the policy loss. [sent-463, score-0.521]
68 Then, the matrices used in the sampled approximate bilinear program (7) are defined as ˜ follows for all (si , a j ) ∈ Σ. [sent-486, score-0.525]
69 The sampled version of the bilinear program (7) is then: πT λ + λ′ min π λ,λ′ ,x ˜ AΦx − b ≥ 0 , ˜ ˜ λ + λ′ 1 ≥ AΦx − b , ˜ Bπ = 1 , s. [sent-498, score-0.509]
70 Value functions v1 , v2 , v3 correspond to solutions of instances of robust approximate bilinear programs for the given samples. [sent-508, score-0.579]
71 These bounds show that it is possible to bound policy loss due to incomplete samples. [sent-511, score-0.492]
72 As mentioned above, existing bounds on constraint violation in approximate linear programming (de Farias and van Roy, 2004) typically do not easily lead to policy loss bounds. [sent-512, score-0.589]
73 Because they also rely on an approximation of the initial distribution and the policy loss, they require additional assumptions on the uniformity of state samples. [sent-514, score-0.481]
74 To summarize, this section identifies basic assumptions on the sampling behavior and shows that approximate bilinear programming scales well in the face of uncertainty caused by incomplete sampling. [sent-520, score-0.507]
75 Discussion and Related ADP Methods This section describes connections between approximate bilinear programming and two closely related approximate dynamic programming methods: ALP, and L∞ -API, which are commonly used to solve factored MDPs (Guestrin et al. [sent-524, score-0.607]
76 ALP provides value function bounds with respect to L1 norm, which does not guarantee small policy loss; 2. [sent-528, score-0.5]
77 3050 ROBUST A PPROXIMATE B ILINEAR P ROGRAMMING The downside of using approximate bilinear programming is, of course, the higher computational complexity. [sent-532, score-0.48]
78 Robust approximate bilinear program (7), on the other hand, has no such parameters. [sent-537, score-0.501]
79 In approximate bilinear programs, the Bellman residual is used in both the constraints and objective function. [sent-540, score-0.562]
80 There is an optimal solution of the bilinear program such that the solutions of the individual linear programs are basic feasible. [sent-547, score-0.56]
81 There are two possible cases we analyze: 1) all ˜ transitions of a greedy policy are to states with positive value, and 2) there is at least one transition to a state with a non-positive value. [sent-617, score-0.626]
82 Minimizing the L∞ approximation error is theoretically preferable, since it is compatible with the existing bounds on policy loss (Guestrin et al. [sent-655, score-0.523]
83 The bounds on value function approximation in API are typically (Munos, 2003): 2γ lim sup v⋆ − vk ∞ ≤ ˆ lim sup vk − vk ∞ , ˜ (1 − γ)2 k→∞ k→∞ where vk is the value function of policy πk which is greedy with respect to vk . [sent-657, score-0.808]
84 2) OAPI, like API, uses only the current policy for the upper bound on the Bellman residual, but uses all the policies for the lower bound on the Bellman residual. [sent-671, score-0.485]
85 ¯ ¯ ¯ ¯ ¯ ¯ ¯ For the policy π to be a fixed point in OAPI, it needs to minimize the Bellman residual with respect to v′ . [sent-685, score-0.542]
86 The convergence of OAPI is achieved because given a non-negative Bellman residual, the greedy policy also minimizes the Bellman residual. [sent-693, score-0.49]
87 In comparison, the greedy policy in L∞ -API does not minimize the Bellman residual, and therefore L∞ -API does not always reduce it. [sent-695, score-0.491]
88 1, we compare the policy loss of various approximate bilinear programming formulations with the policy loss of approximate policy iteration and approximate linear programming. [sent-704, score-1.901]
89 In the results, we use ABP to denote a close-to-optimal solution of robust ABP, ABPexp for the bilinear program (10), and ABPh for a formulation that minimizes the average of ABP and ABPexp. [sent-729, score-0.578]
90 API denotes approximate policy iteration that minimizes the L2 norm. [sent-730, score-0.49]
91 It clearly shows that the robust bilinear formulation most reliably minimizes the Bellman residual. [sent-732, score-0.491]
92 Note that API and ALP may achieve lower policy loss on this particular domain than the ABP formulations, even though their Bellman residual is significantly higher. [sent-739, score-0.543]
93 This is possible because ABP simply minimizes bounds on the policy loss. [sent-740, score-0.49]
94 The analysis of tightness of policy loss bounds is beyond the scope of this paper. [sent-741, score-0.492]
95 5 ABP ABPexp ABPh ALP API Figure 3: Expected policy loss for the chain problem 6 ∞ 4 v ∗ − vπ 5 3 2 1 0 ABP ABPexp ABPh ALP API Figure 4: Robust policy loss for the chain problem Note that the state space in this problem is infinite. [sent-751, score-0.913]
96 Figure 5 compares the expected policy loss of ABP and RALP on the inverted pendulum benchmark as a function of the number of state transitions sampled. [sent-831, score-0.609]
97 The performance of the optimal policy was assumed to be 0 and the policy loss of 0 essentially corresponds to balancing the pole for 2500 steps. [sent-834, score-0.851]
98 Conclusion and Future Work We propose and analyze approximate bilinear programming, a new value-function approximation method, which provably minimizes bounds on policy loss. [sent-841, score-0.958]
99 We also show that there is no asymptotically simpler formulation, since finding the closest value function and solving a bilinear program are both NP-complete problems. [sent-843, score-0.498]
100 Note that, as for example LSPI, approximate bilinear programming is especially applicable to MDPs with discrete (and small) action spaces. [sent-848, score-0.517]
wordName wordTfidf (topN-words)
[('lv', 0.524), ('policy', 0.413), ('bilinear', 0.383), ('abp', 0.263), ('bellman', 0.256), ('alp', 0.152), ('oapi', 0.137), ('etrik', 0.123), ('ilberstein', 0.123), ('ilinear', 0.123), ('api', 0.106), ('pproximate', 0.105), ('rogramming', 0.105), ('residual', 0.105), ('farias', 0.082), ('representable', 0.073), ('policies', 0.072), ('programs', 0.069), ('petrik', 0.069), ('program', 0.064), ('occupancy', 0.058), ('actions', 0.057), ('lspi', 0.055), ('bounds', 0.054), ('sat', 0.054), ('approximate', 0.054), ('greedy', 0.054), ('robust', 0.052), ('transitions', 0.047), ('programming', 0.043), ('av', 0.042), ('marek', 0.041), ('shlomo', 0.041), ('mdp', 0.04), ('ra', 0.04), ('min', 0.038), ('vk', 0.038), ('transition', 0.038), ('action', 0.037), ('ut', 0.037), ('state', 0.037), ('minv', 0.037), ('states', 0.037), ('mdps', 0.036), ('li', 0.036), ('pendulum', 0.035), ('literal', 0.035), ('theorem', 0.035), ('ralp', 0.034), ('inverted', 0.034), ('value', 0.033), ('formulation', 0.033), ('ci', 0.032), ('approximation', 0.031), ('dynamic', 0.03), ('horst', 0.029), ('clause', 0.029), ('separable', 0.029), ('notice', 0.028), ('feasible', 0.028), ('pa', 0.028), ('deterministic', 0.028), ('sampling', 0.027), ('abpexp', 0.027), ('abph', 0.027), ('adp', 0.027), ('lagoudakis', 0.027), ('tuy', 0.027), ('guestrin', 0.027), ('reinforcement', 0.027), ('parr', 0.026), ('roy', 0.026), ('literals', 0.026), ('reward', 0.025), ('loss', 0.025), ('formulations', 0.024), ('minimize', 0.024), ('transitive', 0.024), ('sampled', 0.024), ('minimizes', 0.023), ('solution', 0.023), ('residuals', 0.021), ('solutions', 0.021), ('daniela', 0.021), ('maxs', 0.021), ('mins', 0.021), ('szita', 0.021), ('zilberstein', 0.021), ('mixed', 0.02), ('objective', 0.02), ('mangasarian', 0.019), ('frequencies', 0.019), ('norm', 0.019), ('samples', 0.019), ('bertsekas', 0.018), ('benchmark', 0.018), ('solving', 0.018), ('puterman', 0.018), ('holder', 0.018), ('rewards', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation
Author: Marek Petrik, Shlomo Zilberstein
Abstract: Value function approximation methods have been successfully used in many applications, but the prevailing techniques often lack useful a priori error bounds. We propose a new approximate bilinear programming formulation of value function approximation, which employs global optimization. The formulation provides strong a priori guarantees on both robust and expected policy loss by minimizing specific norms of the Bellman residual. Solving a bilinear program optimally is NP-hard, but this worst-case complexity is unavoidable because the Bellman-residual minimization itself is NP-hard. We describe and analyze the formulation as well as a simple approximate algorithm for solving bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. We also briefly analyze the behavior of bilinear programming algorithms under incomplete samples. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on simple benchmark problems. Keywords: value function approximation, approximate dynamic programming, Markov decision processes
2 0.22690473 47 jmlr-2011-Inverse Reinforcement Learning in Partially Observable Environments
Author: Jaedeug Choi, Kee-Eung Kim
Abstract: Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behavior of an expert. Most of the existing IRL algorithms assume that the environment is modeled as a Markov decision process (MDP), although it is desirable to handle partially observable settings in order to handle more realistic scenarios. In this paper, we present IRL algorithms for partially observable environments that can be modeled as a partially observable Markov decision process (POMDP). We deal with two cases according to the representation of the given expert’s behavior, namely the case in which the expert’s policy is explicitly given, and the case in which the expert’s trajectories are available instead. The IRL in POMDPs poses a greater challenge than in MDPs since it is not only ill-posed due to the nature of IRL, but also computationally intractable due to the hardness in solving POMDPs. To overcome these obstacles, we present algorithms that exploit some of the classical results from the POMDP literature. Experimental results on several benchmark POMDP domains show that our work is useful for partially observable settings. Keywords: inverse reinforcement learning, partially observable Markov decision process, inverse optimization, linear programming, quadratically constrained programming
3 0.10460202 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling
Author: Martijn R.K. Mes, Warren B. Powell, Peter I. Frazier
Abstract: We propose a sequential sampling policy for noisy discrete global optimization and ranking and selection, in which we aim to efficiently explore a finite set of alternatives before selecting an alternative as best when exploration stops. Each alternative may be characterized by a multidimensional vector of categorical and numerical attributes and has independent normal rewards. We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. This policy myopically optimizes the expected increment in the value of sampling information in each time period. We propose a hierarchical aggregation technique that uses the common features shared by alternatives to learn about many alternatives from even a single measurement. This approach greatly reduces the measurement effort required, but it requires some prior knowledge on the smoothness of the function in the form of an aggregation function and computational issues limit the number of alternatives that can be easily considered to the thousands. We prove that our policy is consistent, finding a globally optimal alternative when given enough measurements, and show through simulations that it performs competitively with or significantly better than other policies. Keywords: sequential experimental design, ranking and selection, adaptive learning, hierarchical statistics, Bayesian statistics
4 0.077370085 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning
Author: Harm van Seijen, Shimon Whiteson, Hado van Hasselt, Marco Wiering
Abstract: This article presents and evaluates best-match learning, a new approach to reinforcement learning that trades off the sample efficiency of model-based methods with the space efficiency of modelfree methods. Best-match learning works by approximating the solution to a set of best-match equations, which combine a sparse model with a model-free Q-value function constructed from samples not used by the model. We prove that, unlike regular sparse model-based methods, bestmatch learning is guaranteed to converge to the optimal Q-values in the tabular case. Empirical results demonstrate that best-match learning can substantially outperform regular sparse modelbased methods, as well as several model-free methods that strive to improve the sample efficiency of temporal-difference methods. In addition, we demonstrate that best-match learning can be successfully combined with function approximation. Keywords: reinforcement learning, on-line learning, temporal-difference methods, function approximation, data reuse
5 0.063276261 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
Author: Stéphane Ross, Joelle Pineau, Brahim Chaib-draa, Pierre Kreitmann
Abstract: Bayesian learning methods have recently been shown to provide an elegant solution to the explorationexploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). The primary focus of this paper is to extend these ideas to the case of partially observable domains, by introducing the Bayes-Adaptive Partially Observable Markov Decision Processes. This new framework can be used to simultaneously (1) learn a model of the POMDP domain through interaction with the environment, (2) track the state of the system under partial observability, and (3) plan (near-)optimal sequences of actions. An important contribution of this paper is to provide theoretical results showing how the model can be finitely approximated while preserving good learning performance. We present approximate algorithms for belief tracking and planning in this model, as well as empirical results that illustrate how the model estimate and agent’s return improve as a function of experience. Keywords: processes reinforcement learning, Bayesian inference, partially observable Markov decision
6 0.034461983 36 jmlr-2011-Generalized TD Learning
7 0.029481644 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
8 0.026931029 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
9 0.023851058 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
10 0.021256594 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
11 0.021053268 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods
12 0.020025915 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization
13 0.019682875 97 jmlr-2011-Union Support Recovery in Multi-task Learning
14 0.019389911 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals
15 0.018744396 91 jmlr-2011-The Sample Complexity of Dictionary Learning
16 0.01871071 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
17 0.018422613 104 jmlr-2011-X-Armed Bandits
18 0.017988371 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
19 0.017734138 55 jmlr-2011-Learning Multi-modal Similarity
20 0.017245788 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms
topicId topicWeight
[(0, 0.121), (1, 0.065), (2, -0.14), (3, 0.062), (4, -0.061), (5, 0.285), (6, -0.28), (7, -0.097), (8, 0.031), (9, -0.048), (10, 0.019), (11, 0.039), (12, -0.027), (13, -0.012), (14, -0.069), (15, 0.037), (16, 0.001), (17, 0.122), (18, -0.07), (19, 0.045), (20, -0.08), (21, 0.03), (22, -0.076), (23, 0.172), (24, -0.2), (25, -0.001), (26, 0.153), (27, 0.166), (28, -0.078), (29, 0.03), (30, -0.094), (31, 0.018), (32, 0.0), (33, 0.312), (34, -0.044), (35, -0.141), (36, -0.059), (37, -0.088), (38, -0.088), (39, -0.018), (40, -0.095), (41, 0.023), (42, 0.065), (43, 0.012), (44, 0.065), (45, 0.005), (46, -0.111), (47, 0.012), (48, -0.06), (49, -0.042)]
simIndex simValue paperId paperTitle
same-paper 1 0.96071267 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation
Author: Marek Petrik, Shlomo Zilberstein
Abstract: Value function approximation methods have been successfully used in many applications, but the prevailing techniques often lack useful a priori error bounds. We propose a new approximate bilinear programming formulation of value function approximation, which employs global optimization. The formulation provides strong a priori guarantees on both robust and expected policy loss by minimizing specific norms of the Bellman residual. Solving a bilinear program optimally is NP-hard, but this worst-case complexity is unavoidable because the Bellman-residual minimization itself is NP-hard. We describe and analyze the formulation as well as a simple approximate algorithm for solving bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. We also briefly analyze the behavior of bilinear programming algorithms under incomplete samples. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on simple benchmark problems. Keywords: value function approximation, approximate dynamic programming, Markov decision processes
2 0.86174846 47 jmlr-2011-Inverse Reinforcement Learning in Partially Observable Environments
Author: Jaedeug Choi, Kee-Eung Kim
Abstract: Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behavior of an expert. Most of the existing IRL algorithms assume that the environment is modeled as a Markov decision process (MDP), although it is desirable to handle partially observable settings in order to handle more realistic scenarios. In this paper, we present IRL algorithms for partially observable environments that can be modeled as a partially observable Markov decision process (POMDP). We deal with two cases according to the representation of the given expert’s behavior, namely the case in which the expert’s policy is explicitly given, and the case in which the expert’s trajectories are available instead. The IRL in POMDPs poses a greater challenge than in MDPs since it is not only ill-posed due to the nature of IRL, but also computationally intractable due to the hardness in solving POMDPs. To overcome these obstacles, we present algorithms that exploit some of the classical results from the POMDP literature. Experimental results on several benchmark POMDP domains show that our work is useful for partially observable settings. Keywords: inverse reinforcement learning, partially observable Markov decision process, inverse optimization, linear programming, quadratically constrained programming
3 0.52931398 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling
Author: Martijn R.K. Mes, Warren B. Powell, Peter I. Frazier
Abstract: We propose a sequential sampling policy for noisy discrete global optimization and ranking and selection, in which we aim to efficiently explore a finite set of alternatives before selecting an alternative as best when exploration stops. Each alternative may be characterized by a multidimensional vector of categorical and numerical attributes and has independent normal rewards. We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. This policy myopically optimizes the expected increment in the value of sampling information in each time period. We propose a hierarchical aggregation technique that uses the common features shared by alternatives to learn about many alternatives from even a single measurement. This approach greatly reduces the measurement effort required, but it requires some prior knowledge on the smoothness of the function in the form of an aggregation function and computational issues limit the number of alternatives that can be easily considered to the thousands. We prove that our policy is consistent, finding a globally optimal alternative when given enough measurements, and show through simulations that it performs competitively with or significantly better than other policies. Keywords: sequential experimental design, ranking and selection, adaptive learning, hierarchical statistics, Bayesian statistics
4 0.20999087 6 jmlr-2011-A Simpler Approach to Matrix Completion
Author: Benjamin Recht
Abstract: This paper provides the best bounds to date on the number of randomly sampled entries required to reconstruct an unknown low-rank matrix. These results improve on prior work by Cand` s and e Recht (2009), Cand` s and Tao (2009), and Keshavan et al. (2009). The reconstruction is accome plished by minimizing the nuclear norm, or sum of the singular values, of the hidden matrix subject to agreement with the provided entries. If the underlying matrix satisfies a certain incoherence condition, then the number of entries required is equal to a quadratic logarithmic factor times the number of parameters in the singular value decomposition. The proof of this assertion is short, self contained, and uses very elementary analysis. The novel techniques herein are based on recent work in quantum information theory. Keywords: matrix completion, low-rank matrices, convex optimization, nuclear norm minimization, random matrices, operator Chernoff bound, compressed sensing
5 0.19111265 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
Author: Stéphane Ross, Joelle Pineau, Brahim Chaib-draa, Pierre Kreitmann
Abstract: Bayesian learning methods have recently been shown to provide an elegant solution to the explorationexploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). The primary focus of this paper is to extend these ideas to the case of partially observable domains, by introducing the Bayes-Adaptive Partially Observable Markov Decision Processes. This new framework can be used to simultaneously (1) learn a model of the POMDP domain through interaction with the environment, (2) track the state of the system under partial observability, and (3) plan (near-)optimal sequences of actions. An important contribution of this paper is to provide theoretical results showing how the model can be finitely approximated while preserving good learning performance. We present approximate algorithms for belief tracking and planning in this model, as well as empirical results that illustrate how the model estimate and agent’s return improve as a function of experience. Keywords: processes reinforcement learning, Bayesian inference, partially observable Markov decision
6 0.18517901 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals
7 0.17472573 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
8 0.15211743 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
9 0.12669766 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
10 0.12599048 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms
11 0.12464236 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
12 0.12229805 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal
13 0.12203689 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
14 0.11802379 104 jmlr-2011-X-Armed Bandits
15 0.11528459 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
16 0.10887088 58 jmlr-2011-Learning from Partial Labels
17 0.10732159 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
18 0.1025044 55 jmlr-2011-Learning Multi-modal Similarity
19 0.10177702 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
20 0.098891914 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
topicId topicWeight
[(4, 0.045), (9, 0.038), (10, 0.03), (24, 0.031), (31, 0.064), (32, 0.016), (41, 0.049), (60, 0.019), (65, 0.013), (71, 0.01), (73, 0.033), (78, 0.101), (90, 0.017), (97, 0.429)]
simIndex simValue paperId paperTitle
same-paper 1 0.65552938 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation
Author: Marek Petrik, Shlomo Zilberstein
Abstract: Value function approximation methods have been successfully used in many applications, but the prevailing techniques often lack useful a priori error bounds. We propose a new approximate bilinear programming formulation of value function approximation, which employs global optimization. The formulation provides strong a priori guarantees on both robust and expected policy loss by minimizing specific norms of the Bellman residual. Solving a bilinear program optimally is NP-hard, but this worst-case complexity is unavoidable because the Bellman-residual minimization itself is NP-hard. We describe and analyze the formulation as well as a simple approximate algorithm for solving bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. We also briefly analyze the behavior of bilinear programming algorithms under incomplete samples. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on simple benchmark problems. Keywords: value function approximation, approximate dynamic programming, Markov decision processes
2 0.31655347 91 jmlr-2011-The Sample Complexity of Dictionary Learning
Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein
Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation
3 0.31172061 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
Author: Bruno Pelletier, Pierre Pudlo
Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components
4 0.31048861 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
Author: Jérémie Bigot, Rolando J. Biscay, Jean-Michel Loubes, Lillian Muñiz-Alvarez
Abstract: In this paper, we consider the Group Lasso estimator of the covariance matrix of a stochastic process corrupted by an additive noise. We propose to estimate the covariance matrix in a highdimensional setting under the assumption that the process has a sparse representation in a large dictionary of basis functions. Using a matrix regression model, we propose a new methodology for high-dimensional covariance matrix estimation based on empirical contrast regularization by a group Lasso penalty. Using such a penalty, the method selects a sparse set of basis functions in the dictionary used to approximate the process, leading to an approximation of the covariance matrix into a low dimensional space. Consistency of the estimator is studied in Frobenius and operator norms and an application to sparse PCA is proposed. Keywords: group Lasso, ℓ1 penalty, high-dimensional covariance estimation, basis expansion, sparsity, oracle inequality, sparse PCA
5 0.31022933 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir
Abstract: We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efficient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufficient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, when the constraint is on the number of available attributes per test example, we show that there are cases in which there exists a linear predictor with zero error but it is statistically impossible to achieve arbitrary accuracy without full information on test examples. Finally, we run simple experiments on a digit recognition problem that reveal that our algorithm has a good performance against both partial information and full information baselines. Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory
6 0.30677503 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
7 0.30644134 104 jmlr-2011-X-Armed Bandits
8 0.30451375 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
9 0.30433166 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
10 0.30353031 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
11 0.30324003 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
12 0.30137172 22 jmlr-2011-Differentially Private Empirical Risk Minimization
13 0.30104455 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
14 0.29994845 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
15 0.29929742 16 jmlr-2011-Clustering Algorithms for Chains
16 0.29868907 12 jmlr-2011-Bayesian Co-Training
17 0.29852158 59 jmlr-2011-Learning with Structured Sparsity
18 0.29776028 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
19 0.29709956 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
20 0.29705808 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms