nips nips2003 nips2003-167 knowledge-graph by maker-knowledge-mining

167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices


Source: pdf

Author: Arnab Nilim, Laurent El Ghaoui

Abstract: Optimal solutions to Markov Decision Problems (MDPs) are very sensitive with respect to the state transition probabilities. In many practical problems, the estimation of those probabilities is far from accurate. Hence, estimation errors are limiting factors in applying MDPs to realworld problems. We propose an algorithm for solving finite-state and finite-action MDPs, where the solution is guaranteed to be robust with respect to estimation errors on the state transition probabilities. Our algorithm involves a statistically accurate yet numerically efficient representation of uncertainty, via Kullback-Leibler divergence bounds. The worst-case complexity of the robust algorithm is the same as the original Bellman recursion. Hence, robustness can be added at practically no extra computing cost.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Optimal solutions to Markov Decision Problems (MDPs) are very sensitive with respect to the state transition probabilities. [sent-5, score-0.345]

2 We propose an algorithm for solving finite-state and finite-action MDPs, where the solution is guaranteed to be robust with respect to estimation errors on the state transition probabilities. [sent-8, score-0.547]

3 Our algorithm involves a statistically accurate yet numerically efficient representation of uncertainty, via Kullback-Leibler divergence bounds. [sent-9, score-0.195]

4 The worst-case complexity of the robust algorithm is the same as the original Bellman recursion. [sent-10, score-0.225]

5 Hence, robustness can be added at practically no extra computing cost. [sent-11, score-0.15]

6 1 Introduction We consider a finite-state and finite-action Markov decision problem in which the transition probabilities themselves are uncertain, and seek a robust decision for it. [sent-12, score-0.663]

7 Our work is motivated by the fact that in many practical problems, the transition matrices have to be estimated from data. [sent-13, score-0.431]

8 This may be a difficult task and the estimation errors may have a huge impact on the solution, which is often quite sensitive to changes in the transition probabilities [3]. [sent-14, score-0.275]

9 A number of authors have addressed the issue of uncertainty in the transition matrices of an MDP. [sent-15, score-0.771]

10 A Bayesian approach such as described by [9] requires a perfect knowledge of the whole prior distribution on the transition matrix, making it difficult to apply in practice. [sent-16, score-0.333]

11 Other authors have considered the transition matrix to lie in a given set, most typically a polytope: see [8, 10, 5]. [sent-17, score-0.349]

12 Although our approach allows to describe the uncertainty on the transition matrix by a polytope, we may argue against choosing such a model for the uncertainty. [sent-18, score-0.619]

13 First, a general polytope is often not a tractable way to address the robustness problem, as it incurs a significant additional computational effort to handle uncertainty. [sent-19, score-0.185]

14 In [1], the authors consider a problem dual to ours, and provide a general statement according to which the cost of solving their problem is polynomial in problem size, provided the uncertainty on the transition matrices is described by convex sets, without proposing any specific algorithm. [sent-21, score-1.12]

15 P > 0 or P ≥ 0 refers to the strict or non-strict componentwise inequality for matrices or vectors. [sent-24, score-0.284]

16 For a vector p > 0, log p refers to the componentwise operation. [sent-25, score-0.209]

17 The probability simplex in Rn is denoted ∆n = {p ∈ Rn : pT 1 = 1}, while Θn is the set of n×n transition + matrices (componentwise non-negative matrices with rows summing to one). [sent-27, score-0.622]

18 2 The problem description We consider a finite horizon Markov decision process with finite decision horizon T = {0, 1, 2, . [sent-29, score-0.42]

19 At each stage, the system occupies a state i ∈ X , where n = |X | is finite, and a decision maker is allowed to choose an action a deterministically from a finite set of allowable actions A = {a1 , . [sent-33, score-0.235]

20 The states make Markov transitions according to a collection of (possibly time-dependent) transition matrices τ := (Pta )a∈A,t∈T , where for every a ∈ A, t ∈ T , the n × n transition matrix Pta contains the probabilities of transition under action a at stage t. [sent-38, score-1.17]

21 , aN −1 ) a generic controller policy, where at (i) denotes the controller action when the system is in state i ∈ X at time t ∈ T . [sent-42, score-0.501]

22 Define by ct (i, a) the cost corresponding to state i ∈ X and action a ∈ A at time t ∈ T , and by cN the cost function at the terminal stage. [sent-44, score-0.672]

23 We assume that ct (i, a) is non-negative and finite for every i ∈ X and a ∈ A. [sent-45, score-0.181]

24 For a given set of transition matrices τ , we define the finite-horizon nominal problem by φN (Π, τ ) := min CN (π, τ ), (1) π∈Π where CN (π, τ ) denotes the expected total cost under controller policy π and transitions τ : N −1 CN (π, τ ) := E ct (it , at (i)) + cN (iN ) . [sent-46, score-1.319]

25 (2) t=0 A special case of interest is when the expected total cost function bears the form (2), where the terminal cost is zero, and ct (i, a) = ν t c(i, a), with c(i, a) now a constant cost function, which we assume non-negative and finite everywhere, and ν ∈ (0, 1) is a discount factor. [sent-47, score-0.714]

26 We refer to this cost function as the discounted cost function, and denote by C∞ (π, τ ) the limit of the discounted cost (2) as N → ∞. [sent-48, score-0.702]

27 When the transition matrices are exactly known, the corresponding nominal problem can be solved via a dynamic programming algorithm, which has total complexity of nmN flops in the finite-horizon case. [sent-49, score-0.856]

28 In the infinite-horizon case with a discounted cost function, the cost of computing an -suboptimal policy via the Bellman recursion is O(nm log(1/ )); see [7] for more details. [sent-50, score-0.716]

29 1 The robust control problems At first we assume that when for each action a and time t, the corresponding transition matrix Pta is only known to lie in some given subset P a . [sent-52, score-0.708]

30 Two models for transition matrix uncertainty are possible, leading to two possible forms of finite-horizon robust control problems. [sent-53, score-0.974]

31 In a first model, referred to as the stationary uncertainty model, the transition matrices are chosen by nature depending on the controller policy once and for all, and remain fixed thereafter. [sent-54, score-1.225]

32 In a second model, which we refer to as the time-varying uncertainty model, the transition matrices can vary arbitrarily with time, within their prescribed bounds. [sent-55, score-0.776]

33 Each problem leads to a game between the controller and nature, where the controller seeks to minimize the maximum expected cost, with nature being the maximizing player. [sent-56, score-0.563]

34 A policy of nature refers to a specific collection of time-dependent transition matrices τ = (Pta )a∈A,t∈T chosen by nature, and the set of admissible policies of nature is T := (⊗a∈A P a )N . [sent-58, score-0.86]

35 Denote by Ts the set of stationary admissible policies of nature: a Ts = {τ = (Pta )a∈A,t∈T ∈ T : Pta = Ps for every t, s ∈ T, a ∈ A} . [sent-59, score-0.256]

36 The stationary uncertainty model leads to the problem φN (Π, Ts ) := min max CN (π, τ ). [sent-60, score-0.677]

37 (3) π∈Π τ ∈Ts In contrast, the time-varying uncertainty model leads to a relaxed version of the above: φN (Π, Ts ) ≤ φN (Π, T ) := min max CN (π, τ ). [sent-61, score-0.49]

38 π∈Π τ ∈T (4) The first model is attractive for statistical reasons, as it is much easier to develop statistically accurate sets of confidence when the underlying process is time-invariant. [sent-62, score-0.124]

39 Unfortunately, the resulting game (3) seems to be hard to solve. [sent-63, score-0.152]

40 The second model is attractive as one can solve the corresponding game (4) using a variant of the dynamic programming algorithm seen later, but we are left with a difficult task, that of estimating a meaningful set of confidence for the time-varying matrices Pta . [sent-64, score-0.555]

41 In this paper we will use the first model of uncertainty in order to derive statistically meaningful sets of confidence for the transition matrices, based on likelihood or entropy bounds. [sent-65, score-0.721]

42 Then, instead of solving the corresponding difficult control problem (3), we use an approximation that is common in robust control, and solve the time-varying upper bound (4), using the uncertainty sets P a derived from a stationarity assumption about the transition matrices. [sent-66, score-0.975]

43 We will also consider a variant of the finite-horizon time-varying problem (4), where controller and nature play alternatively, leading to a repeated game φrep (Π, Q) := min max min max . [sent-67, score-0.82]

44 min N a0 τ0 ∈Q a1 τ1 ∈Q max aN −1 τN −1 ∈Q CN (π, τ ), (5) where the notation τt = (Pta )a∈A denotes the collection of transition matrices at a given time t ∈ T , and Q := ⊗a∈A P a is the corresponding set of confidence. [sent-70, score-0.664]

45 In the sequel, for a given control policy π ∈ Π and subset S ⊆ T , the notation φN (π, S) := maxτ ∈S CN (π, τ ) denotes the worst-case expected total cost for the finitehorizon problem, and φ∞ (π, S) is defined likewise. [sent-73, score-0.469]

46 First we provide a recursion, the “robust dynamic programming” algorithm, which solves the finite-horizon robust control problem (4). [sent-76, score-0.457]

47 We provide a simple proof in [2] of the optimality of the recursion, where the main ingredient is to show that perfect duality holds in the game (4). [sent-77, score-0.304]

48 As a corollary of this result, we obtain that the repeated game (5) is equivalent to its non-repeated counterpart (4). [sent-78, score-0.333]

49 Second, we provide similar results for the infinite-horizon problem with discounted cost function, (6). [sent-79, score-0.333]

50 Finally, we identify several classes of uncertainty models, which result in an algorithm that is both statistically accurate and numerically tractable. [sent-81, score-0.393]

51 We provide precise complexity results that imply that, with the proposed approach, robustness can be handled at practically no extra computing cost. [sent-82, score-0.205]

52 3 Finite-Horizon Robust MDP We consider the finite-horizon robust control problem defined in section 2. [sent-83, score-0.363]

53 For a given state i ∈ X , action a ∈ A, and P a ∈ P a , we denote by pa the next-state distribution i drawn from P a corresponding to state i ∈ X ; thus pa is the i-th row of matrix P a . [sent-85, score-0.484]

54 Theorem 1 (robust dynamic programming) For the robust control problem (4), perfect duality holds: φN (Π, T ) = min max CN (π, τ ) = max min CN (π, τ ) := ψN (Π, T ). [sent-89, score-0.979]

55 π∈Π τ ∈T τ ∈T π∈Π The problem can be solved via the recursion a vt (i) = min ct (i, a) + σPi (vt+1 ) , i ∈ X , t ∈ T, a∈A (7) where σP (v) := sup{pT v : p ∈ P} denotes the support function of a set P, vt (i) is the worst-case optimal value function in state i at stage t. [sent-90, score-0.92]

56 A corresponding optimal control policy π ∗ = (a∗ , . [sent-91, score-0.322]

57 , a∗ −1 ) is obtained by setting 0 N a∗ (i) ∈ arg min t a∈A a ct (i, a) + σPi (vt+1 ) , i ∈ X . [sent-94, score-0.323]

58 (8) The effect of uncertainty on a given strategy π = (a0 , . [sent-95, score-0.305]

59 , aN ) can be evaluated by the following recursion π vt (i) = π ct (i, at (i)) + σP at (i) (vt+1 ), i ∈ X , i (9) which provides the worst-case value function v π for the strategy π. [sent-98, score-0.411]

60 The above result has a nice consequence for the repeated game (5): Corollary 2 The repeated game (5) is equivalent to the game (4): φrep (Π, Q) = φN (Π, T ), N and the optimal strategies for φN (Π, T ) given in theorem 1 are optimal for φrep (Π, Q) as N well. [sent-99, score-0.653]

61 The interpretation of the perfect duality result given in theorem 1, and its consequence given in corollary 2, is that it does not matter wether the controller or nature play first, or if they alternatively; all these games are equivalent. [sent-100, score-0.486]

62 The complexity of a the sets Pi for each i ∈ X and a ∈ A is a key component in the complexity of the robust dynamic programming algorithm. [sent-102, score-0.491]

63 Beyond numerical tractability, an additional criteria for the choice of a specific uncertainty model is of course be that the sets P a should represent accurate (non-conservative) descriptions of the statistical uncertainty on the transition matrices. [sent-103, score-0.918]

64 In the finite-horizon case, we can bound vmax by O(N ). [sent-106, score-0.119]

65 Now consider the following algorithm, where the uncertainty is described in terms of one of the models described in section 5: Robust Finite Horizon Dynamic Programming Algorithm 1. [sent-107, score-0.347]

66 Repeat until t = 0: (a) For every state i ∈ X and action a ∈ A, compute, using the bisection algorithm given in [2], a value σi such that ˆa a v σi − /N ≤ σPi (ˆt ) ≤ σi . [sent-111, score-0.246]

67 , aN −1 ), where at (i) = arg max {ct−1 (i, a) + σi } , i ∈ X , a ∈ A. [sent-118, score-0.129]

68 ˆa a∈A As shown in [2], the above algorithm provides an suboptimal policy π that achieves the exact optimum with prescribed accuracy , with a required number of flops bounded above by O(mnN log(N/ )). [sent-119, score-0.182]

69 This means that robustness is obtained at a relative increase of computational cost of only log(N/ ) with respect to the classical dynamic programming algorithm, which is small for moderate values of N . [sent-120, score-0.468]

70 We begin with the infinite-horizon problem involving stationary control and nature policies defined in (6). [sent-123, score-0.472]

71 (11) τ ∈Ts π∈Πs The optimal value is given by φ∞ (Πs , Ts ) = v(i0 ), where i0 is the initial state, and where the value function v satisfies is the optimality conditions a v(i) = min c(i, a) + νσPi (v) , i ∈ X . [sent-126, score-0.136]

72 (13) a∈A A stationary, optimal control policy π = (a∗ , a∗ , . [sent-130, score-0.322]

73 ) is obtained as a a∗ (i) ∈ arg min c(i, a) + νσPi (v) , i ∈ X . [sent-133, score-0.142]

74 a∈A (14) Note that the problem of computing the dual quantity ψ∞ (Πs , Ts ) given in (11), has been addressed in [1], where the authors provide the recursion (13) without proof. [sent-134, score-0.207]

75 Corollary 4 In the infinite-horizon problem, we can without loss of generality assume that the control and nature policies are stationary, that is, φ∞ (Π, T ) = φ∞ (Πs , Ts ) = φ∞ (Πs , T ) = φ∞ (Π, Ts ). [sent-136, score-0.285]

76 (15) Furthermore, in the finite-horizon case, with a discounted cost function, the gap between the optimal values of the finite-horizon problems under stationary and time-varying uncertainty models, φN (Π, T )−φN (Π, Ts ), goes to zero as the horizon length N goes to infinity, at a geometric rate ν. [sent-137, score-0.953]

77 Now consider the following algorithm, where we describe the uncertainty using one of the models of section 5. [sent-138, score-0.347]

78 ), where a (i) = arg max {c(i, a) + ν σi } , i ∈ X . [sent-151, score-0.129]

79 ˆa a∈A In [2], we establish that the above algorithm finds an -suboptimal robust policy in at most O(nm log(1/ )2 ) flops. [sent-152, score-0.312]

80 Thus, the extra computational cost incurred by robustness in the infinite-horizon case is only O(log(1/ )). [sent-153, score-0.254]

81 5 Kullback-Liebler Divergence Uncertainty Models We now address the inner problem (10) for a specific action a ∈ A and state i ∈ X . [sent-154, score-0.315]

82 Denote by D(p q) denotes the Kullback-Leibler (KL) divergence (relative entropy) from the probability distribution q ∈ ∆n to the probability distribution p ∈ ∆n : D(p q) := p(j) log j p(j) . [sent-155, score-0.203]

83 q(j) The above function provides a natural way to describe errors in (rows of) the transition matrices; examples of models based on this function are given below. [sent-156, score-0.317]

84 Likelihood Models: Our first uncertainty model is derived from a controlled experiment starting from state i = 1, 2, . [sent-157, score-0.375]

85 We denote by F a the matrix of empirical frequencies of transition with control a in the experiment; denote by fia its ith row. [sent-161, score-0.587]

86 The “plug-in” estimate P a = F a is the solution to the maximum likelihood problem F a (i, j) log P (i, j) : P ≥ 0, P 1 = 1. [sent-163, score-0.174]

87 max P (16) i,j a The optimal log-likelihood is βmax = i,j F a (i, j) log F a (i, j). [sent-164, score-0.204]

88 A classical description of uncertainty in a maximum-likelihood setting is via the ”likelihood region” [6]     (17) F a (i, j) log P (i, j) ≥ β a , P a = P ∈ Rn×n : P ≥ 0, P 1 = 1,   i,j a a βmax is a pre-specified number, which represents the uncertainty level. [sent-165, score-0.764]

89 In where β < practice, the designer specifies an uncertainty level β a based on re-sampling Bmethods, or on a large-sample Gaussian approximation, so as to ensure that the set above achieves a desired level of confidence. [sent-166, score-0.305]

90 With the above model, we note that the inner problem (10) only involves the set a a a a Pi := pa ∈ Rn : pa ≥ 0, pa T 1 = 1, i i i j F (i, j) log pi (j) ≥ βi , where the paa a a a a rameter βi := β − k=i j F (k, j) log F (k, j). [sent-167, score-0.759]

91 If there exist a prior information regrading the uncertainty on the i-th row of P a , which can be described via a Dirichlet distribution [4] with a parameter αi , the resulting MAP estimation problem takes the form a max (fia + αi − 1)T log p : pT 1 = 1, p ≥ 0. [sent-171, score-0.603]

92 p Thus, the MAP uncertainty model is equivalent to a Likelihood model, with the sample a a distribution fia replaced by fia + αi − 1, where αi is the prior corresponding to state i and action a. [sent-172, score-0.716]

93 We can also choose to describe uncertainty by exchanging the order of the arguments of the KL divergence. [sent-174, score-0.305]

94 Equipped with one of the above uncertainty models, we can address the inner problem (10). [sent-176, score-0.469]

95 As shown in [2], the inner problem can be converted by convex duality, to a problem of minimizing a single-variable, convex function. [sent-177, score-0.243]

96 Remark: We can also use models where the uncertainty in the i-th row for the transition a matrix P a is described by a finite set of vectors, Pi = {pa,1 , . [sent-179, score-0.709]

97 In this case the i i complexity of the corresponding robust dynamic programming algorithm is increased by a relative factor of K with respect to its classical counterpart, which makes the approach attractive when the number of ”scenarios” K is moderate. [sent-183, score-0.512]

98 6 Concluding remarks We proposed a “robust dynamic programming” algorithm for solving finite-state and finiteaction MDPs whose solutions are guaranteed to tolerate arbitrary changes of the transition probability matrices within given sets. [sent-184, score-0.557]

99 The resulting robust dynamic programming algorithm has almost the same computational cost as the classical dynamic programming algorithm: the relative increase to compute an -suboptimal policy is O(N log(1/ )) in the N -horizon case, and O(log(1/ )) for the infinite-horizon case. [sent-186, score-0.91]

100 Robust solution to Markov decision problems with uncertain transition matrices: proofs and complexity analysis. [sent-196, score-0.513]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('uncertainty', 0.305), ('transition', 0.275), ('pta', 0.239), ('pi', 0.202), ('cn', 0.19), ('ct', 0.181), ('robust', 0.17), ('ts', 0.162), ('matrices', 0.156), ('game', 0.152), ('controller', 0.151), ('discounted', 0.147), ('control', 0.143), ('policy', 0.142), ('stationary', 0.137), ('cost', 0.136), ('fia', 0.13), ('recursion', 0.122), ('vmax', 0.119), ('programming', 0.117), ('vt', 0.108), ('vk', 0.105), ('horizon', 0.101), ('uncertain', 0.099), ('min', 0.099), ('bisection', 0.095), ('duality', 0.094), ('dynamic', 0.094), ('pa', 0.088), ('max', 0.086), ('decision', 0.084), ('policies', 0.083), ('rn', 0.082), ('robustness', 0.081), ('action', 0.081), ('inner', 0.081), ('log', 0.081), ('rep', 0.078), ('componentwise', 0.078), ('corollary', 0.077), ('divergence', 0.074), ('eecs', 0.071), ('polytope', 0.071), ('kl', 0.071), ('state', 0.07), ('terminal', 0.068), ('mdps', 0.068), ('counterpart', 0.066), ('markov', 0.065), ('nilim', 0.06), ('nature', 0.059), ('perfect', 0.058), ('pt', 0.058), ('discount', 0.057), ('statistically', 0.055), ('complexity', 0.055), ('bellman', 0.054), ('ops', 0.052), ('problem', 0.05), ('refers', 0.05), ('denotes', 0.048), ('row', 0.048), ('theorem', 0.047), ('goes', 0.045), ('nominal', 0.044), ('entropy', 0.043), ('likelihood', 0.043), ('arg', 0.043), ('models', 0.042), ('classical', 0.04), ('berkeley', 0.04), ('prescribed', 0.04), ('matrix', 0.039), ('repeated', 0.038), ('extra', 0.037), ('transitions', 0.037), ('optimal', 0.037), ('dence', 0.036), ('attractive', 0.036), ('admissible', 0.036), ('nm', 0.036), ('authors', 0.035), ('nite', 0.035), ('everywhere', 0.035), ('simplex', 0.035), ('map', 0.034), ('go', 0.033), ('operations', 0.033), ('address', 0.033), ('accurate', 0.033), ('via', 0.033), ('processes', 0.032), ('solving', 0.032), ('stage', 0.032), ('proved', 0.032), ('solved', 0.032), ('mdp', 0.032), ('practically', 0.032), ('convex', 0.031), ('california', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

Author: Arnab Nilim, Laurent El Ghaoui

Abstract: Optimal solutions to Markov Decision Problems (MDPs) are very sensitive with respect to the state transition probabilities. In many practical problems, the estimation of those probabilities is far from accurate. Hence, estimation errors are limiting factors in applying MDPs to realworld problems. We propose an algorithm for solving finite-state and finite-action MDPs, where the solution is guaranteed to be robust with respect to estimation errors on the state transition probabilities. Our algorithm involves a statistically accurate yet numerically efficient representation of uncertainty, via Kullback-Leibler divergence bounds. The worst-case complexity of the robust algorithm is the same as the original Bellman recursion. Hence, robustness can be added at practically no extra computing cost.

2 0.17086044 78 nips-2003-Gaussian Processes in Reinforcement Learning

Author: Malte Kuss, Carl E. Rasmussen

Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.

3 0.16991568 158 nips-2003-Policy Search by Dynamic Programming

Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng

Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1

4 0.14194824 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

Author: Alan Fern, Sungwook Yoon, Robert Givan

Abstract: We explore approximate policy iteration, replacing the usual costfunction learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-specific planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs. 1

5 0.13652255 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

Author: Ting-fan Wu, Chih-jen Lin, Ruby C. Weng

Abstract: Pairwise coupling is a popular multi-class classification method that combines together all pairwise comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than two existing popular methods: voting and [3]. 1

6 0.13524225 42 nips-2003-Bounded Finite State Controllers

7 0.12044165 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

8 0.1035067 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

9 0.10288753 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis

10 0.091043361 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time

11 0.090209052 62 nips-2003-Envelope-based Planning in Relational MDPs

12 0.088984743 55 nips-2003-Distributed Optimization in Adaptive Networks

13 0.08747422 68 nips-2003-Eye Movements for Reward Maximization

14 0.085608132 177 nips-2003-Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles

15 0.085440412 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions

16 0.081364155 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles

17 0.081033297 171 nips-2003-Semi-Definite Programming by Perceptron Learning

18 0.080833763 36 nips-2003-Auction Mechanism Design for Multi-Robot Coordination

19 0.079597026 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

20 0.078076363 169 nips-2003-Sample Propagation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.248), (1, 0.186), (2, -0.148), (3, 0.028), (4, 0.06), (5, 0.064), (6, -0.037), (7, -0.092), (8, 0.038), (9, 0.049), (10, -0.094), (11, -0.145), (12, -0.023), (13, 0.118), (14, -0.024), (15, -0.062), (16, -0.017), (17, 0.008), (18, 0.008), (19, -0.092), (20, 0.003), (21, -0.12), (22, -0.037), (23, -0.087), (24, -0.156), (25, 0.028), (26, 0.022), (27, -0.089), (28, 0.078), (29, -0.132), (30, -0.045), (31, -0.003), (32, -0.027), (33, -0.049), (34, -0.041), (35, 0.045), (36, -0.013), (37, -0.092), (38, -0.001), (39, -0.127), (40, -0.094), (41, 0.044), (42, 0.077), (43, -0.006), (44, 0.061), (45, -0.096), (46, 0.021), (47, 0.07), (48, -0.048), (49, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97730803 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

Author: Arnab Nilim, Laurent El Ghaoui

Abstract: Optimal solutions to Markov Decision Problems (MDPs) are very sensitive with respect to the state transition probabilities. In many practical problems, the estimation of those probabilities is far from accurate. Hence, estimation errors are limiting factors in applying MDPs to realworld problems. We propose an algorithm for solving finite-state and finite-action MDPs, where the solution is guaranteed to be robust with respect to estimation errors on the state transition probabilities. Our algorithm involves a statistically accurate yet numerically efficient representation of uncertainty, via Kullback-Leibler divergence bounds. The worst-case complexity of the robust algorithm is the same as the original Bellman recursion. Hence, robustness can be added at practically no extra computing cost.

2 0.72234249 158 nips-2003-Policy Search by Dynamic Programming

Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng

Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1

3 0.71753389 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

Author: Milos Hauskrecht, Branislav Kveton

Abstract: Approximate linear programming (ALP) has emerged recently as one of the most promising methods for solving complex factored MDPs with finite state spaces. In this work we show that ALP solutions are not limited only to MDPs with finite state spaces, but that they can also be applied successfully to factored continuous-state MDPs (CMDPs). We show how one can build an ALP-based approximation for such a model and contrast it to existing solution methods. We argue that this approach offers a robust alternative for solving high dimensional continuous-state space problems. The point is supported by experiments on three CMDP problems with 24-25 continuous state factors. 1

4 0.697285 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning

Author: H. J. Kim, Michael I. Jordan, Shankar Sastry, Andrew Y. Ng

Abstract: Autonomous helicopter flight represents a challenging control problem, with complex, noisy, dynamics. In this paper, we describe a successful application of reinforcement learning to autonomous helicopter flight. We first fit a stochastic, nonlinear model of the helicopter dynamics. We then use the model to learn to hover in place, and to fly a number of maneuvers taken from an RC helicopter competition.

5 0.60676175 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

Author: Alan Fern, Sungwook Yoon, Robert Givan

Abstract: We explore approximate policy iteration, replacing the usual costfunction learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-specific planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs. 1

6 0.59295636 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

7 0.58726346 78 nips-2003-Gaussian Processes in Reinforcement Learning

8 0.55903184 42 nips-2003-Bounded Finite State Controllers

9 0.52070314 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis

10 0.47772095 62 nips-2003-Envelope-based Planning in Relational MDPs

11 0.47695354 123 nips-2003-Markov Models for Automated ECG Interval Analysis

12 0.45549011 75 nips-2003-From Algorithmic to Subjective Randomness

13 0.40728986 55 nips-2003-Distributed Optimization in Adaptive Networks

14 0.39905864 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

15 0.39903033 146 nips-2003-Online Learning of Non-stationary Sequences

16 0.3879891 68 nips-2003-Eye Movements for Reward Maximization

17 0.38722885 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers

18 0.38363573 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

19 0.37746164 177 nips-2003-Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles

20 0.37236115 48 nips-2003-Convex Methods for Transduction


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.054), (30, 0.012), (35, 0.07), (53, 0.092), (69, 0.019), (71, 0.056), (76, 0.034), (85, 0.047), (91, 0.096), (99, 0.438)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91171986 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

Author: Arnab Nilim, Laurent El Ghaoui

Abstract: Optimal solutions to Markov Decision Problems (MDPs) are very sensitive with respect to the state transition probabilities. In many practical problems, the estimation of those probabilities is far from accurate. Hence, estimation errors are limiting factors in applying MDPs to realworld problems. We propose an algorithm for solving finite-state and finite-action MDPs, where the solution is guaranteed to be robust with respect to estimation errors on the state transition probabilities. Our algorithm involves a statistically accurate yet numerically efficient representation of uncertainty, via Kullback-Leibler divergence bounds. The worst-case complexity of the robust algorithm is the same as the original Bellman recursion. Hence, robustness can be added at practically no extra computing cost.

2 0.83404189 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles

Author: Michael I. Jordan, Martin J. Wainwright

Abstract: We present a new method for calculating approximate marginals for probability distributions defined by graphs with cycles, based on a Gaussian entropy bound combined with a semidefinite outer bound on the marginal polytope. This combination leads to a log-determinant maximization problem that can be solved by efficient interior point methods [8]. As with the Bethe approximation and its generalizations [12], the optimizing arguments of this problem can be taken as approximations to the exact marginals. In contrast to Bethe/Kikuchi approaches, our variational problem is strictly convex and so has a unique global optimum. An additional desirable feature is that the value of the optimal solution is guaranteed to provide an upper bound on the log partition function. In experimental trials, the performance of the log-determinant relaxation is comparable to or better than the sum-product algorithm, and by a substantial margin for certain problem classes. Finally, the zero-temperature limit of our log-determinant relaxation recovers a class of well-known semidefinite relaxations for integer programming [e.g., 3]. 1

3 0.72363293 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus

Author: Zach Solan, David Horn, Eytan Ruppin, Shimon Edelman

Abstract: We describe a pattern acquisition algorithm that learns, in an unsupervised fashion, a streamlined representation of linguistic structures from a plain natural-language corpus. This paper addresses the issues of learning structured knowledge from a large-scale natural language data set, and of generalization to unseen text. The implemented algorithm represents sentences as paths on a graph whose vertices are words (or parts of words). Significant patterns, determined by recursive context-sensitive statistical inference, form new vertices. Linguistic constructions are represented by trees composed of significant patterns and their associated equivalence classes. An input module allows the algorithm to be subjected to a standard test of English as a Second Language (ESL) proficiency. The results are encouraging: the model attains a level of performance considered to be “intermediate” for 9th-grade students, despite having been trained on a corpus (CHILDES) containing transcribed speech of parents directed to small children. 1

4 0.58539164 158 nips-2003-Policy Search by Dynamic Programming

Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng

Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1

5 0.53147578 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

Author: Alan Fern, Sungwook Yoon, Robert Givan

Abstract: We explore approximate policy iteration, replacing the usual costfunction learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-specific planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs. 1

6 0.50180137 78 nips-2003-Gaussian Processes in Reinforcement Learning

7 0.49819595 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

8 0.48469698 121 nips-2003-Log-Linear Models for Label Ranking

9 0.48460621 42 nips-2003-Bounded Finite State Controllers

10 0.48008037 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

11 0.47892606 30 nips-2003-Approximability of Probability Distributions

12 0.47828996 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

13 0.47664797 151 nips-2003-PAC-Bayesian Generic Chaining

14 0.47623992 189 nips-2003-Tree-structured Approximations by Expectation Propagation

15 0.46784002 177 nips-2003-Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles

16 0.4578352 48 nips-2003-Convex Methods for Transduction

17 0.45748639 146 nips-2003-Online Learning of Non-stationary Sequences

18 0.45663077 55 nips-2003-Distributed Optimization in Adaptive Networks

19 0.45608252 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering

20 0.44893366 107 nips-2003-Learning Spectral Clustering