nips nips2013 nips2013-273 knowledge-graph by maker-knowledge-mining

273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes


Source: pdf

Author: Shiau Hong Lim, Huan Xu, Shie Mannor

Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. [sent-8, score-0.592]

2 We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. [sent-9, score-0.268]

3 We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. [sent-10, score-0.809]

4 1 Introduction Markov decision processes (MDPs) [Puterman, 1994] have been widely used to model and solve sequential decision problems in stochastic environments. [sent-11, score-0.333]

5 Given the parameters of an MDP, namely, the rewards and transition probabilities, an optimal policy can be computed. [sent-12, score-0.447]

6 Hence, the performance of the chosen policy may deteriorate significantly; see [Mannor et al. [sent-14, score-0.319]

7 The robust MDP framework has been proposed to address this issue of parameter uncertainty (e. [sent-16, score-0.18]

8 The robust MDP setting assumes that the true parameters fall within some uncertainty set U and seeks a policy that performs the best under the worst realization of the parameters. [sent-19, score-0.448]

9 Variants of robust MDP formulations have been proposed to mitigate the conservativeness when additional information on parameter distribution [Strens, 2000, Xu and Mannor, 2012] or coupling among the parameters [Mannor et al. [sent-21, score-0.207]

10 Since in practice it is often difficult to accurately quantify the uncertainty, the solutions to the robust MDP can be conservative if a too large uncertainty set is used. [sent-24, score-0.22]

11 We assume that some of the state-action pairs are adversarial in the sense that their parameters can change arbitrarily within U from one step to another. [sent-26, score-0.463]

12 1 In this setting, a traditional robust MDP approach would be equivalent to assuming that all parameters are adversarial and therefore would always execute the minimax policy. [sent-29, score-0.705]

13 We show that the cumulative reward obtained from this method is as good as the minimax policy that knows the true nature of each state-action pair. [sent-35, score-0.686]

14 Executing action a in state s results in a random transition according to a distribution ps,a (·) where ps,a (s ) gives the probability of transitioning to state s , and accumulate an immediate reward r(s, a). [sent-44, score-0.428]

15 A robust MDP considers the case where the transition probability is determined in an adversarial way. [sent-45, score-0.59]

16 That is, when action a is taken at state s, the transition probability ps,a (·) can be an arbitrary element of the uncertainty set U(s, a). [sent-46, score-0.306]

17 Here, the power of the adversary – the uncertainty set of the parameter – is precisely known, and the goal is to find the minimax policy – the policy with the best performance under the worst admissible parameters. [sent-50, score-0.966]

18 We ask the following question: suppose the power of the adversary (the extent to which it can affect the system) is not completely revealed to the decision maker, if we are allowed to play the MDP many times, can we still obtain an optimal policy as if we knew the true extent of its power? [sent-52, score-0.493]

19 Only a subset F ⊂ S × A is truly adversarial while all the other stateaction pairs behave purely stochastically, i. [sent-56, score-0.604]

20 The adversarial behavior is not constrained, except it must belong to the uncertainty set. [sent-63, score-0.499]

21 The R-Max algorithm deals with stochastic games where the opponent’s action-set for each state is known and the opponent’s actions are always observable. [sent-67, score-0.197]

22 Indeed, because the set F is unknown, even the action set of the adversary is unknown to the algorithm. [sent-72, score-0.245]

23 The second challenge is due to unconstrained adversarial behavior. [sent-73, score-0.417]

24 For state-action pairs (s, a) ∈ F, the opponent is free to choose any ps,a ∈ U(s, a) for each transition, possibly depends on the his2 tory and the strategy of the decision maker (i. [sent-74, score-0.24]

25 In particular, when considering the regret against the best stationary policy “in hindsight”, [Yu and Mannor, 2009] show that small change in transition probabilities can cause large regret. [sent-78, score-0.574]

26 Even with additional constraints on the allowed adversarial behavior, they showed that the regret bound still does not vanish with respect to the number of steps. [sent-79, score-0.585]

27 Indeed, most results for adversarial MDPs [Even-Dar et al. [sent-80, score-0.434]

28 , 2012] only deal with adversarial rewards while the transitions are assumed stochastic and fixed, which is considerably simpler than our setting. [sent-85, score-0.671]

29 Since it is not possible to achieve vanishing regret against best stationary policy in hindsight, we choose to measure the regret against the performance of a minimax policy that knows exactly which state-actions are adversarial (i. [sent-86, score-1.59]

30 Finally, given that we are competing against the minimax policy, one might ask whether we could simply apply existing algorithms such as UCRL2 [Jaksch et al. [sent-90, score-0.27]

31 The following example shows that ignoring any adversarial behavior may lead to large regret compared to the minimax policy. [sent-92, score-0.843]

32 g∗ a1 s0 g∗ + β a2 s1 s2 g∗ + β a3 s3 s4 g∗ − α Figure 1: Example MDP with adversarial transitions. [sent-93, score-0.383]

33 Suppose that a UCRL2-like algorithm is used, where all transitions are assumed purely stochastic. [sent-95, score-0.186]

34 Action a1 leads to the optimal minimax average reward of g ∗ . [sent-97, score-0.37]

35 State s1 has adversarial transition, where both s2 and s4 are possible next states. [sent-99, score-0.383]

36 In phase 1, the adversary behaves “benignly” by choosing all solid-line transitions. [sent-102, score-0.197]

37 In phase 2, the adversary chooses the dashed-line transitions in both s1 and s4 . [sent-104, score-0.304]

38 This means that the overall total regret is cT which is linear in T . [sent-110, score-0.241]

39 Note that in the above example, the expected value of a2 remains greater than the minimax value g ∗ throughout phase 2 and therefore the algorithm will continue to prefer a2 , even though the actual accumulated average value is already way below g ∗ . [sent-111, score-0.327]

40 3 Algorithm and main result In this section, we present our algorithm and the main result for the finite-horizon case with the total reward as the performance measure. [sent-113, score-0.19]

41 It is straight-forward, by introducing additional states, to extend the algorithm and analysis to the case where the reward function is random, unknown and even adversarial. [sent-117, score-0.185]

42 After that, a new episode begins, with an arbitrarily chosen start state (it can simply be the last state of the previous episode). [sent-120, score-0.26]

43 Let π be a finite-horizon (non-stationary) policy where πt (s) gives the action to be executed in state s at step t in an episode, where t = 0, . [sent-122, score-0.561]

44 It is not hard to show that given state s, there exists a policy π with V0π (s) = V0∗ (s) and we can compute such a minimax policy if the algorithm knows F and ps,a for all (s, a) ∈ F, from literature of robust MDP (e. [sent-143, score-0.954]

45 The main message of this paper is that we can determine a policy as good as the minimax policy without knowing either F or ps,a for (s, a) ∈ F. [sent-146, score-0.789]

46 To make this formal, we define the regret (against / the minimax performance) in episode i, for i = 1, 2, . [sent-147, score-0.553]

47 as T −1 ∆i = V0∗ (si ) − 0 r(si , ai ), t t t=0 where si and ai denote the actual state visited and action taken at step t of episode i. [sent-150, score-0.296]

48 1 The total t t regret for m episodes, which we want to minimize, is thus defined as m ∆(m) = ∆i . [sent-151, score-0.241]

49 , 2010] with an additional stochastic check to detect adversarial state-action pairs. [sent-154, score-0.642]

50 The key challenge, however, is to successfully identify the adversarial state-action pairs when they start to behave maliciously. [sent-158, score-0.478]

51 They show that it is possible to achieve near-optimal regret without knowing a priori whether a bandit is stochastic or adversarial. [sent-160, score-0.375]

52 In [Bubeck and Slivkins, 2012], the key is to check some consistency conditions that would be satisfied if the behavior is stochastic. [sent-161, score-0.192]

53 A policy is executed until either a new pair (s, a) fails the stochastic check, and hence deemed to be adversarial, or some state-action pair has been executed too many times. [sent-169, score-0.829]

54 In either case, we need to re-compute the current optimistic policy (see Section 3. [sent-170, score-0.396]

55 Every time a new policy is computed we call it a new epoch. [sent-172, score-0.268]

56 While each episode has the same length (T ), each epoch can span multiple episodes, and an epoch can begin in the middle of an episode. [sent-173, score-0.252]

57 1 Computing an optimistic policy Figure 3 shows the algorithm for computing the optimistic minimax policy, where we treat all stateaction pairs in the set F as adversarial, and (similar to UCRL2) use optimistic values for other state-action pairs. [sent-175, score-0.966]

58 1 We provide high-probability regret bounds for any single trial, from which the expected regret can be readily derived, if desired. [sent-176, score-0.404]

59 Compute an optimistic policy π , assuming all state-action pairs in F are adversarial (Section ˜ 3. [sent-182, score-0.827]

60 • The executed state-action pair (s, a) fails the stochastic check (Section 3. [sent-186, score-0.51]

61 We use Nk (s, a) to denote the total number of times the state-action pair (s, a) has been executed before epoch k. [sent-197, score-0.303]

62 If (s, a) has never been executed before epoch k, we define Nk (s, a) = 1 and ˆk (·|s, a) to be arbitrarily defined. [sent-199, score-0.26]

63 ˜ Figure 3: Algorithm for computing an optimistic minimax policy. [sent-212, score-0.347]

64 2 Stochasticity check Every time a state-action (s, a) ∈ F is executed, the outcome is recorded and subjected to a “stochas/ ticity check”. [sent-214, score-0.187]

65 Let n be the total number of times (s, a) has been executed (including the latest one) and s1 , . [sent-215, score-0.207]

66 Let τ be the total number of steps executed by the algorithm (from the beginning) so far. [sent-229, score-0.207]

67 The stochastic check fails if: n j=1 n ˆ ˜k Pkj (·|s, a)Vtj j (·) − +1 ˜k Vtj j (sj ) > 5T +1 j=1 nS log 4SAT τ 2 . [sent-230, score-0.306]

68 δ The stochastic check follows the intuitive saying “if it is not broke, don’t fix it”, by checking whether the value of actual transition from (s, a) is below what is expected from the parameter estimation. [sent-231, score-0.363]

69 5 One can show that with high probability, all stochastic state-action pairs will always pass the stochastic check. [sent-232, score-0.26]

70 Given δ, T , S, A, the total regret of OLRM is √ ˜ ∆(m) ≤ O(ST 3/2 Am) for all m, with probability at least 1 − δ. [sent-239, score-0.241]

71 ˜ steps is τ = mT , the regret bound in terms of τ is therefore O(ST Aτ ). [sent-241, score-0.202]

72 The result shows that even though the algorithm deals with unknown stochastic and potentially adversarial states, it achieves the same regret bound as in the fully stochastic case. [sent-245, score-0.874]

73 Using Lemma 1, we show the following lemma that with high probability, all purely stochastic state-action pairs will always pass the stochastic check. [sent-259, score-0.371]

74 Recall that the check fails if n j=1 n ˆ ˜k Pkj (·|s, a)Vtj j (·) − +1 ˜k Vtj j (sj ) > 5T +1 nS log j=1 4SAT τ 2 . [sent-268, score-0.2]

75 δ We can derive a high-probability bound that satisfies the stochastic check by applying the AzumaHoeffding inequality on the martingale difference sequence k k ˜ ˜ Xj = ps,a (·)Vtj j (·) − Vtj j (sj ) +1 +1 followed by an application of Lemma 1. [sent-269, score-0.259]

76 all past transitions passed the test) would have optimistic Q values. [sent-279, score-0.268]

77 We can there˜ k and the actual rewards received by the fore bound the regret by bounding the difference between Vt algorithm. [sent-284, score-0.277]

78 For an adversarial state-action (s, a) ∈ F, we use the following facts to ensure the above: (i) If (s, a) has been added to F (i. [sent-286, score-0.416]

79 Given a (stationary) policy π, its average undiscounted reward (or “gain”) is defined as follows: 1 EP τ →∞ τ τ π gP (s) = lim r(si , π(si )) t=1 where s1 = s. [sent-290, score-0.419]

80 An optimal ∗ π∗ ∗ π minimax policy π is any policy whose gain g = g = maxπ g . [sent-295, score-0.755]

81 We define the regret after executing the MDP M for τ steps as τ ∆(τ ) = τ g ∗ − r(st , at ). [sent-296, score-0.202]

82 The main difference is in computing the optimistic policy and the corresponding stochastic check. [sent-298, score-0.502]

83 The algorithms from [Tewari and Bartlett, 2007] can be used to compute an optimistic minimax policy. [sent-300, score-0.347]

84 Nk (s, a) δ 2 In more general settings, such as communicating or weakly communicating MDPs, although the optimal policies (for a fixed P ) always have constant gain, the optimal minimax policies (over all possible P ) might have non-constant gain. [sent-303, score-0.395]

85 Additional assumptions on U, as well as a slight change in the definition of the regret are needed to deal with these cases. [sent-304, score-0.202]

86 7 ˜ Let Pk (·|s, π k (s)) be the minimax choice of transition functions for each s where the minimax gain ˜ πk ˜ g is attained. [sent-306, score-0.542]

87 ˜ ˜ (1) The stochastic check for the infinite-horizon case is mostly identical to the finite-horizon case, except ˜ that we replace T with the maximal span H of the bias, defined as follows: ˜ max hk (s) − min hk (s) . [sent-308, score-0.335]

88 ,kn } s The stochastic check fails if: n n ˜ Pkj (·|s, a)hkj (·) − j=1 ˜ hkj (sj ) > 5H j=1 nS log 4SAτ 2 . [sent-312, score-0.359]

89 δ Let H be the maximal span of the bias of any optimal minimax policies. [sent-313, score-0.219]

90 Given δ, S, A, the total regret of OLRM2 is √ ˜ ∆(τ ) ≤ O(SH Aτ ) for all τ , with probability at least 1 − δ. [sent-317, score-0.241]

91 5 x 10 Total reward 2 OLRM2 UCRL2 Standard robust MDP Optimal minimax policy 1. [sent-319, score-0.741]

92 It shows that UCRL2 accumulates smaller total rewards than the optimal minimax policy while our algorithm actually accumulates larger total rewards than the minimax policy. [sent-328, score-1.044]

93 We also include the result for a standard robust MDP that treats all state-action pairs as adversarial and therefore performs poorly. [sent-329, score-0.534]

94 We show that it achieves similar regret bound as in the fully stochastic case. [sent-332, score-0.308]

95 A natural extension is to allow the learning of the uncertainty sets in adversarial states, where the true uncertainty set is unknown. [sent-333, score-0.537]

96 Our preliminary results show that very similar regret bounds can be obtained for learning from a class of nested uncertainty sets. [sent-334, score-0.279]

97 The best of both worlds: Stochastic and adversarial bandits. [sent-347, score-0.383]

98 The adversarial stochastic shorto a est path problem with unknown transition probabilities. [sent-422, score-0.627]

99 Robust control of markov decision processes with uncertain transition matrices. [sent-435, score-0.298]

100 Bounded parameter markov decision processes with average reward criterion. [sent-452, score-0.345]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('adversarial', 0.383), ('policy', 0.268), ('mdp', 0.226), ('minimax', 0.219), ('mannor', 0.209), ('regret', 0.202), ('executed', 0.168), ('vtj', 0.159), ('check', 0.153), ('reward', 0.151), ('adversary', 0.134), ('episode', 0.132), ('optimistic', 0.128), ('neu', 0.121), ('vt', 0.121), ('mdps', 0.119), ('transitions', 0.107), ('stochastic', 0.106), ('olrm', 0.106), ('vtk', 0.106), ('transition', 0.104), ('robust', 0.103), ('jaksch', 0.101), ('puterman', 0.096), ('iyengar', 0.093), ('pk', 0.093), ('decision', 0.091), ('slivkins', 0.086), ('tennenholtz', 0.079), ('weissman', 0.079), ('purely', 0.079), ('uncertainty', 0.077), ('nilim', 0.077), ('action', 0.077), ('rewards', 0.075), ('singapore', 0.075), ('st', 0.072), ('brafman', 0.07), ('strens', 0.07), ('pkj', 0.07), ('episodes', 0.068), ('opponent', 0.064), ('bubeck', 0.064), ('nk', 0.064), ('ghaoui', 0.063), ('phase', 0.063), ('epoch', 0.06), ('qk', 0.058), ('markov', 0.058), ('tewari', 0.055), ('accumulates', 0.055), ('conservativeness', 0.053), ('hkj', 0.053), ('el', 0.052), ('et', 0.051), ('communicating', 0.05), ('nicely', 0.048), ('maxa', 0.048), ('state', 0.048), ('knows', 0.048), ('yu', 0.048), ('pairs', 0.048), ('behave', 0.047), ('fails', 0.047), ('stateaction', 0.047), ('mcdiarmid', 0.047), ('horizon', 0.046), ('processes', 0.045), ('accumulated', 0.045), ('ns', 0.044), ('sj', 0.043), ('nitehorizon', 0.043), ('xu', 0.043), ('deals', 0.043), ('gp', 0.041), ('reinforcement', 0.04), ('conservative', 0.04), ('plays', 0.04), ('si', 0.039), ('behavior', 0.039), ('total', 0.039), ('rgy', 0.038), ('shie', 0.038), ('policies', 0.038), ('hk', 0.038), ('maker', 0.037), ('pair', 0.036), ('outcome', 0.034), ('challenge', 0.034), ('knowing', 0.034), ('unknown', 0.034), ('bandit', 0.033), ('states', 0.033), ('hindsight', 0.033), ('passed', 0.033), ('added', 0.033), ('lemma', 0.032), ('gy', 0.032), ('arbitrarily', 0.032), ('bartlett', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

Author: Shiau Hong Lim, Huan Xu, Shie Mannor

Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1

2 0.39442518 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

Author: Yasin Abbasi, Peter Bartlett, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari

Abstract: We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves p O( T log |⇧| + log |⇧|) regret with respect to a comparison set of policies ⇧. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set ⇧ has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. It was shown that for randomly chosen graphs and adversarial losses, the problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes. Finally, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs. 1

3 0.26790538 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Author: Paul Wagner

Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1

4 0.24746622 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

Author: Ian Osband, Dan Russo, Benjamin Van Roy

Abstract: Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an √ agent to encode prior knowledge ˜ in a natural way. We establish an O(τ S AT ) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds. 1

5 0.24513212 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta

Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1

6 0.2342755 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

7 0.21917239 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

8 0.21707577 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

9 0.21412706 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

10 0.20412378 347 nips-2013-Variational Planning for Graph-based MDPs

11 0.19728832 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

12 0.18726836 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

13 0.1804454 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

14 0.17437656 348 nips-2013-Variational Policy Search via Trajectory Optimization

15 0.17189975 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

16 0.16832037 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

17 0.16571064 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

18 0.15877421 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

19 0.15865284 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization

20 0.15256977 79 nips-2013-DESPOT: Online POMDP Planning with Regularization


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.275), (1, -0.444), (2, -0.001), (3, 0.018), (4, -0.032), (5, -0.056), (6, 0.024), (7, -0.003), (8, -0.041), (9, 0.012), (10, 0.036), (11, 0.002), (12, 0.054), (13, 0.052), (14, 0.092), (15, -0.019), (16, 0.091), (17, -0.089), (18, -0.01), (19, -0.053), (20, -0.008), (21, 0.01), (22, -0.049), (23, 0.036), (24, 0.064), (25, 0.045), (26, -0.092), (27, -0.054), (28, -0.03), (29, -0.001), (30, -0.064), (31, 0.073), (32, -0.033), (33, 0.111), (34, -0.05), (35, 0.008), (36, -0.057), (37, 0.038), (38, -0.033), (39, 0.036), (40, -0.027), (41, -0.047), (42, -0.054), (43, 0.065), (44, 0.024), (45, 0.045), (46, 0.06), (47, 0.031), (48, -0.017), (49, -0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96714568 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

Author: Shiau Hong Lim, Huan Xu, Shie Mannor

Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1

2 0.89026618 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

Author: Asrar Ahmed, Pradeep Varakantham, Yossiri Adulyasak, Patrick Jaillet

Abstract: In this paper, we seek robust policies for uncertain Markov Decision Processes (MDPs). Most robust optimization approaches for these problems have focussed on the computation of maximin policies which maximize the value corresponding to the worst realization of the uncertainty. Recent work has proposed minimax regret as a suitable alternative to the maximin objective for robust optimization. However, existing algorithms for handling minimax regret are restricted to models with uncertainty over rewards only. We provide algorithms that employ sampling to improve across multiple dimensions: (a) Handle uncertainties over both transition and reward models; (b) Dependence of model uncertainties across state, action pairs and decision epochs; (c) Scalability and quality bounds. Finally, to demonstrate the empirical effectiveness of our sampling approaches, we provide comparisons against benchmark algorithms on two domains from literature. We also provide a Sample Average Approximation (SAA) analysis to compute a posteriori error bounds.

3 0.85054773 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

Author: Ian Osband, Dan Russo, Benjamin Van Roy

Abstract: Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an √ agent to encode prior knowledge ˜ in a natural way. We establish an O(τ S AT ) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds. 1

4 0.80826497 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

Author: Alexander Zimin, Gergely Neu

Abstract: We study the problem of online learning in finite episodic Markov decision processes (MDPs) where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space A and the state space X has a layered structure with L layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after T episodes is 2 L|X ||A|T log(|X ||A|/L) in the bandit setting and 2L T log(|X ||A|/L) in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP. These guarantees largely improve previously known results under much milder assumptions and cannot be significantly improved under general assumptions. 1

5 0.80141884 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

Author: Yasin Abbasi, Peter Bartlett, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari

Abstract: We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves p O( T log |⇧| + log |⇧|) regret with respect to a comparison set of policies ⇧. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set ⇧ has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. It was shown that for randomly chosen graphs and adversarial losses, the problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes. Finally, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs. 1

6 0.78503251 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

7 0.73671812 347 nips-2013-Variational Planning for Graph-based MDPs

8 0.72327119 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

9 0.70785469 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

10 0.70537513 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

11 0.66930395 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

12 0.62550682 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

13 0.62366521 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

14 0.60924184 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

15 0.59612578 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

16 0.58639359 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

17 0.5771085 280 nips-2013-Robust Data-Driven Dynamic Programming

18 0.57193696 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

19 0.56201583 241 nips-2013-Optimizing Instructional Policies

20 0.54656869 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.027), (2, 0.053), (10, 0.014), (16, 0.021), (33, 0.115), (34, 0.088), (41, 0.032), (49, 0.025), (56, 0.161), (70, 0.025), (79, 0.04), (85, 0.054), (89, 0.223), (93, 0.027), (95, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94346154 109 nips-2013-Estimating LASSO Risk and Noise Level

Author: Mohsen Bayati, Murat A. Erdogdu, Andrea Montanari

Abstract: We study the fundamental problems of variance and risk estimation in high dimensional statistical modeling. In particular, we consider the problem of learning a coefficient vector θ0 ∈ Rp from noisy linear observations y = Xθ0 + w ∈ Rn (p > n) and the popular estimation procedure of solving the 1 -penalized least squares objective known as the LASSO or Basis Pursuit DeNoising (BPDN). In this context, we develop new estimators for the 2 estimation risk θ − θ0 2 and the variance of the noise when distributions of θ0 and w are unknown. These can be used to select the regularization parameter optimally. Our approach combines Stein’s unbiased risk estimate [Ste81] and the recent results of [BM12a][BM12b] on the analysis of approximate message passing and the risk of LASSO. We establish high-dimensional consistency of our estimators for sequences of matrices X of increasing dimensions, with independent Gaussian entries. We establish validity for a broader class of Gaussian designs, conditional on a certain conjecture from statistical physics. To the best of our knowledge, this result is the first that provides an asymptotically consistent risk estimator for the LASSO solely based on data. In addition, we demonstrate through simulations that our variance estimation outperforms several existing methods in the literature. 1

2 0.9302454 91 nips-2013-Dirty Statistical Models

Author: Eunho Yang, Pradeep Ravikumar

Abstract: We provide a unified framework for the high-dimensional analysis of “superposition-structured” or “dirty” statistical models: where the model parameters are a superposition of structurally constrained parameters. We allow for any number and types of structures, and any statistical model. We consider the general class of M -estimators that minimize the sum of any loss function, and an instance of what we call a “hybrid” regularization, that is the infimal convolution of weighted regularization functions, one for each structural component. We provide corollaries showcasing our unified framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures. 1

3 0.92247385 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification

Author: George H. Chen, Stanislav Nikolov, Devavrat Shah

Abstract: For classifying time series, a nearest-neighbor approach is widely used in practice with performance often competitive with or better than more elaborate methods such as neural networks, decision trees, and support vector machines. We develop theoretical justification for the effectiveness of nearest-neighbor-like classification of time series. Our guiding hypothesis is that in many applications, such as forecasting which topics will become trends on Twitter, there aren’t actually that many prototypical time series to begin with, relative to the number of time series we have access to, e.g., topics become trends on Twitter only in a few distinct manners whereas we can collect massive amounts of Twitter data. To operationalize this hypothesis, we propose a latent source model for time series, which naturally leads to a “weighted majority voting” classification rule that can be approximated by a nearest-neighbor classifier. We establish nonasymptotic performance guarantees of both weighted majority voting and nearest-neighbor classification under our model accounting for how much of the time series we observe and the model complexity. Experimental results on synthetic data show weighted majority voting achieving the same misclassification rate as nearest-neighbor classification while observing less of the time series. We then use weighted majority to forecast which news topics on Twitter become trends, where we are able to detect such “trending topics” in advance of Twitter 79% of the time, with a mean early advantage of 1 hour and 26 minutes, a true positive rate of 95%, and a false positive rate of 4%. 1

4 0.91666454 305 nips-2013-Spectral methods for neural characterization using generalized quadratic models

Author: Il M. Park, Evan W. Archer, Nicholas Priebe, Jonathan W. Pillow

Abstract: We describe a set of fast, tractable methods for characterizing neural responses to high-dimensional sensory stimuli using a model we refer to as the generalized quadratic model (GQM). The GQM consists of a low-rank quadratic function followed by a point nonlinearity and exponential-family noise. The quadratic function characterizes the neuron’s stimulus selectivity in terms of a set linear receptive fields followed by a quadratic combination rule, and the invertible nonlinearity maps this output to the desired response range. Special cases of the GQM include the 2nd-order Volterra model [1, 2] and the elliptical Linear-Nonlinear-Poisson model [3]. Here we show that for “canonical form” GQMs, spectral decomposition of the first two response-weighted moments yields approximate maximumlikelihood estimators via a quantity called the expected log-likelihood. The resulting theory generalizes moment-based estimators such as the spike-triggered covariance, and, in the Gaussian noise case, provides closed-form estimators under a large class of non-Gaussian stimulus distributions. We show that these estimators are fast and provide highly accurate estimates with far lower computational cost than full maximum likelihood. Moreover, the GQM provides a natural framework for combining multi-dimensional stimulus sensitivity and spike-history dependencies within a single model. We show applications to both analog and spiking data using intracellular recordings of V1 membrane potential and extracellular recordings of retinal spike trains. 1

same-paper 5 0.88261449 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

Author: Shiau Hong Lim, Huan Xu, Shie Mannor

Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1

6 0.8568576 234 nips-2013-Online Variational Approximations to non-Exponential Family Change Point Models: With Application to Radar Tracking

7 0.8388322 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification

8 0.82263434 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

9 0.8128171 194 nips-2013-Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition

10 0.81271273 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models

11 0.80125856 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks

12 0.79887021 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

13 0.79585218 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models

14 0.7896924 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization

15 0.788059 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

16 0.78311175 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation

17 0.77925044 177 nips-2013-Local Privacy and Minimax Bounds: Sharp Rates for Probability Estimation

18 0.77803755 147 nips-2013-Lasso Screening Rules via Dual Polytope Projection

19 0.77647555 66 nips-2013-Computing the Stationary Distribution Locally

20 0.77643794 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints