nips nips2012 nips2012-243 knowledge-graph by maker-knowledge-mining

243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method


Source: pdf

Author: Nikhil Bhat, Vivek Farias, Ciamac C. Moallemi

Abstract: This paper presents a novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful approximation and sample complexity guarantees. In particular, we establish both theoretically and computationally that our proposal can serve as a viable alternative to state-of-the-art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. We accomplish this by developing a kernel-based mathematical program for ADP. Via a computational study on a controlled queueing network, we show that our procedure is competitive with parametric ADP approaches. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract This paper presents a novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful approximation and sample complexity guarantees. [sent-8, score-0.381]

2 In particular, we establish both theoretically and computationally that our proposal can serve as a viable alternative to state-of-the-art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. [sent-9, score-0.321]

3 We accomplish this by developing a kernel-based mathematical program for ADP. [sent-10, score-0.135]

4 Via a computational study on a controlled queueing network, we show that our procedure is competitive with parametric ADP approaches. [sent-11, score-0.309]

5 The central computational problem is then reduced to the computation of an optimal ‘cost-to-go’ function that encodes the cost incurred under an optimal policy starting from any given MDP state. [sent-13, score-0.259]

6 Many MDPs of practical interest suffer from the curse of dimensionality, where intractably large state spaces precluding exact computation of the cost-to-go function. [sent-14, score-0.104]

7 Approximate dynamic programming (ADP) is an umbrella term for algorithms designed to produce good approximation to this function, yielding a natural ‘greedy’ control policy. [sent-15, score-0.244]

8 ADP algorithms are, in large part, parametric in nature; requiring the user to provide an ‘approximation architecture’ (i. [sent-16, score-0.098]

9 The algorithm then produces an approximation in the span of this basis. [sent-19, score-0.101]

10 These results highlight the importance of selecting a ‘good’ approximation architecture, and remain somewhat dissatisfying in that additional sampling or computational effort cannot remedy a bad approximation architecture. [sent-21, score-0.271]

11 On the other hand, a non-parametric approach would, in principle, permit the user to select a rich, potentially full-dimensional architecture (e. [sent-22, score-0.097]

12 1 The key computational step in approximate policy iteration methods is approximate policy evaluation. [sent-28, score-0.49]

13 Unfortunately schemes such approximate policy iteration have no convergence guarantees in parametric settings, and these difficulties remain in non-parametric variations. [sent-32, score-0.319]

14 Another idea has been to use kernel-based local averaging ideas to approximate the solution of an MDP with that of a simpler variation on a sampled state space [5, 6, 7]. [sent-33, score-0.146]

15 Closely related to our work, [9, 10] consider modifying the approximate linear program with an ￿1 regularization term to encourage sparse approximations in the span of a large, but necessarily tractable set of features. [sent-37, score-0.252]

16 In contrast to this line of work, our approach will allow for approximations in a potentially infinite dimensional approximation architecture with a constraint on an appropriate ￿2 -norm of the weight vector. [sent-40, score-0.287]

17 The non-parametric ADP algorithm we develop enjoys non-trivial approximation and sample complexity guarantees. [sent-41, score-0.182]

18 The resulting mathematical program, which we dub the regularized smoothed approximate LP (RSALP), is distinct from simply substituting a kernel-based approximation in the SALP formulation. [sent-45, score-0.248]

19 We develop a companion active set method that is capable of solving this mathematical program rapidly and with limited memory requirements. [sent-46, score-0.161]

20 2 We establish a graceful approximation guarantee for our algorithm. [sent-48, score-0.242]

21 Our algorithm can be interpreted as solving an approximate linear program in an appropriate Hilbert space. [sent-49, score-0.182]

22 We provide, with high probability, an upper bound on the approximation error of the algorithm relative to the best possible approximation subject to a regularization constraint. [sent-50, score-0.25]

23 The sampling requirements for our method are, in fact, independent of the dimension of the approximation architecture. [sent-51, score-0.177]

24 Hence, the sampling requirements are a function of the complexity of the approximation, not of the dimension of the approximating architecture. [sent-53, score-0.099]

25 This result can be seen as the ‘right’ generalization of the prior parametric approximate LP approaches [13, 14, 12], where, in contrast, sample complexity grows with the dimension of the approximating architecture. [sent-54, score-0.205]

26 To study the efficacy of RSALP, we consider an MDP arising from a challenging queueing network scheduling problem. [sent-56, score-0.332]

27 We demonstrate that our RSALP method yields significant improvements over known heuristics and standard parametric ADP methods. [sent-57, score-0.097]

28 2 These guarantees come under assumption of being able to sample from a certain idealized distribution. [sent-60, score-0.095]

29 2 2 Formulation Consider a discrete time Markov decision process with finite state space S and finite action space A. [sent-62, score-0.091]

30 We assume time-homogeneous Markovian dynamics: conditioned on being at state x and taking action a, the system transitions to state x￿ with probability p(x, x￿ , a) independent of the past. [sent-64, score-0.1]

31 A policy is a map µ : S → A, so that ￿∞ ￿ ￿ µ t J (x) ￿ Ex,µ α gxt ,at t=0 represents the expected (discounted, infinite horizon) cost-to-go under policy µ starting at state x. [sent-65, score-0.438]

32 Letting Π denote the set of all policies our goal is to find an optimal policy µ∗ such that µ∗ ∈ argmaxµ∈Π J µ (x) for all x ∈ S (it is well known that such a policy exists). [sent-66, score-0.499]

33 An optimal policy µ∗ can be recovered as a ‘greedy’ policy with respect to J ∗ , µ∗ (x) ∈ argmin gx,a + αEx,a [J ∗ (X ￿ )], a∈A ￿ where we define Ex,a [f (X ￿ )] as x￿ ∈S p(x, x￿ , a)f (x￿ ), for all f : S → R. [sent-68, score-0.431]

34 ADP algorithms are principally tasked with computing approximations to J ∗ of the form J ∗ (x) ≈ ˜ z ￿ Φ(x) ￿ J(x), where Φ : S → Rm is referred to as an ‘approximation architecture’ or a basis and must be provided as input to the ADP algorithm. [sent-70, score-0.111]

35 The ADP algorithm computes a ‘weight’ vector z; ˜ one then employs a policy that is greedy with respect to the corresponding approximation J. [sent-71, score-0.339]

36 1 Primal Formulation Motivated by the LP for exact dynamic programming, a series of ADP algorithms [15, 13, 12] have been proposed that compute a weight vector z by solving an appropriate modification of the exact LP for dynamic programming. [sent-73, score-0.228]

37 x∈S ￿ z Φ(x) ≤ ga,x + αEx,a [z ￿ Φ(X ￿ )] + sx , z ∈ Rm , s ∈ RS . [sent-76, score-0.126]

38 + ∀ x ∈ S, a ∈ A, (1) In parsing the above program notice that if one insisted that the slack variables s were precisely 0, one is left with the ALP proposed by [15]. [sent-77, score-0.112]

39 [13] provided a pioneering analysis that loosely showed 2 ￿J ∗ − z ∗ ￿ Φ￿1,ν ≤ inf ￿J ∗ − z ￿ Φ￿∞ , 1−α z for an optimal solution z ∗ to the ALP; [12] showed that these bounds could be improved upon substantially by ‘smoothing’ the constraints of the ALP, i. [sent-78, score-0.095]

40 The value function approximation in this case would be given by ˜ Jz,b (x) ￿ ￿x, z￿ + b = ￿Φ(x), z￿ + b, (2) where b is a scalar offset corresponding to a constant basis function. [sent-89, score-0.205]

41 The following generalization of (1) — which we dub the regularized SALP (RSALP) — then essentially suggests itself: ￿ ￿ Γ max νx ￿x, z￿ + b − κ πx sx − ￿z, z￿ 2 x∈S x∈S (3) s. [sent-90, score-0.173]

42 ￿x, z￿ + b ≤ ga,x + αEx,a [￿X￿ , z￿ + b] + sx , ∀ x ∈ S, a ∈ A, z ∈ H, b ∈ R, s ∈ RS . [sent-92, score-0.126]

43 + 3 The only ‘new’ ingredient in the program above is the fact that we regularize z using the parameter ￿ Γ > 0. [sent-93, score-0.112]

44 Constraining ￿z￿H ￿ ￿z, z￿ to lie within some ￿2 -ball anticipates that we will eventually resort to sampling in solving this program and we cannot hope for a reasonable number of samples to provide a good solution to a problem where z was unconstrained. [sent-94, score-0.208]

45 This regularization, which plays a crucial role both in theory and practice, is easily missed if one directly ‘plugs in’ a local averaging approximation in place of z ￿ Φ(x) as is the case in the earlier work of [5, 6, 7, 8] and others. [sent-95, score-0.124]

46 , program (3), can be interpreted as a regularized stochastic optimization problem, one may hope to solve it via its sample average approximation. [sent-98, score-0.17]

47 To this end, define the likeliˆ hood ratio wx ￿ νx /πx , and let S ⊂ S be a set of N states sampled independently according to the distribution π. [sent-99, score-0.106]

48 The sample average approximation of (3) is then 1 ￿ κ ￿ Γ max wx ￿x, z￿ + b − sx − ￿z, z￿ N N 2 ˆ ˆ x∈S x∈S (4) ˆ s. [sent-100, score-0.324]

49 ￿x, z￿ + b ≤ ga,x + αEx,a [￿X￿ , z￿ + b] + sx , ∀ x ∈ S, a ∈ A, ˆ z ∈ H, b ∈ R, s ∈ RS . [sent-102, score-0.126]

50 Even if |S| were small, it is still not clear that this program can be solved effectively. [sent-104, score-0.112]

51 The optimal solution to (7) is attained at some λ∗ , then optimal solution to (4) is attained at some (z ∗ , s∗ , b∗ ) with   ￿ ￿ ￿ ￿ 1 1 ￿ . [sent-118, score-0.1]

52 z∗ =  wx x − λ∗ (8) x,a x − αEx,a [X ] Γ N ˆ x∈S ˆ x∈S,a∈A Having solved this program, we may, using Proposition 1, recover our approximate cost-to-go func˜ tion J(x) = ￿z∗ , x￿ + b∗ as ￿ ￿ ￿￿ ￿ 1 1 ￿ ∗ ￿ ˜ J(x) = wy ￿y, x￿ − λy,a ￿y, x￿ − αEy,a [￿X , x￿] + b∗ . [sent-119, score-0.107]

53 (9) Γ N ˆ y∈S ˆ y∈S,a∈A 4 ˜ A policy greedy with respect to J is not affected by constant translations, hence in (9), the value of ∗ ˜ b can be set to be zero arbitrarily. [sent-120, score-0.238]

54 Consequently, given a positive definite kernel, we simply replace every inner product ￿x, x￿ ￿ in the defining of the program (7) with the quantity K(x, x￿ ) and similarly in the approximation (9). [sent-124, score-0.27]

55 We defer the details of the procedure as well as the theoretical analysis to the Online Supplement 3 Approximation Guarantees ˜ Recall that we are employing an approximation Jz,b of the form (2), parameterized by the weight vector z and the offset parameter b. [sent-130, score-0.178]

56 Now denoting by C the feasible region of the RSALP projected onto the z and b co-ordinates, the best possible approximation one may hope for among those per˜ mitted by the RSALP will have ￿∞ -approximation error inf (z,b)∈C ￿J ∗ − Jz,b ￿∞ . [sent-131, score-0.162]

57 , with C fixed) what approximation guarantee can be obtained for a solution to the RSALP? [sent-136, score-0.148]

58 This section will show that one can achieve a guarantee that is, in essence, within a certain constant multiple of the optimal approximation error using a number of samples that is independent of the size of the state space and the dimension of the approximation architecture. [sent-137, score-0.344]

59 ˜ Given the definition of Jz,b in (2), we consider the following sampled version of RSALP, 2 1 ￿ ˜ max ν ￿ Jz,b − sx 1−αN ˆ x∈S (10) ˆ s. [sent-141, score-0.169]

60 ￿x, z￿ + b ≤ ga,x + αEx,a [￿X￿ , z￿ + b] + sx , ∀ x ∈ S, a ∈ A, ˆ ￿z￿H ≤ C, |b| ≤ B, z ∈ H, b ∈ R, s ∈ RS . [sent-143, score-0.126]

61 + We will assume that states are sampled according to an idealized distribution. [sent-144, score-0.104]

62 (11) t=0 Here, Pµ∗ is the transition matrix under the optimal policy µ∗ . [sent-146, score-0.23]

63 In addition, this program is somewhat distinct from the program presented earlier, (4): (1) As opposed to a ‘soft’ regularization term in the objective, we have a ‘hard’ regularization constraint, ￿z￿H ≤ C. [sent-148, score-0.367]

64 It is analogous to the approximation results produced in parametric settings with the important distinction that one allows comparisons to approximations in potentially fulldimensional basis sets which might be substantially superior. [sent-162, score-0.308]

65 This quantity has no explicit dependence on the dimension of the approximation architecture. [sent-164, score-0.164]

66 [14, 12]) typically scale with the dimension of the approximation architecture. [sent-166, score-0.131]

67 with ￿z￿H small) approximations in a rich feature space as opposed to restricting us to some a-priori fixed, low dimensional feature space. [sent-171, score-0.125]

68 In summary, increased sampling yields approximations of increasing quality, approaching an exact approximation. [sent-175, score-0.147]

69 If J ∗ admits a good approximation with ￿z￿H small, one can expect a good approximation with a reasonable number of samples. [sent-176, score-0.202]

70 1 − α ￿z￿H ≤C,b∈R Geometrically, the proof works loosely by translating the ‘best’ approximation given the regularization constraints to one that is guaranteed to yield an approximation error no worse that that produced by the RSALP. [sent-182, score-0.279]

71 To establish a guarantee for the sampled RSALP, we first pose the RSALP as a stochastic optimiza˜ ˜ tion problem by setting s(z, b) ￿ (Jz,b − T Jz,b )+ . [sent-183, score-0.137]

72 We must ensure that with high probability, the sample averages in the sampled program are close to the exact expectations, uniformly for all possible values of (z, b) with high accuracy. [sent-184, score-0.219]

73 In order to establish such a guarantee, we bound the Rademacher complexity of the class of functions given by ￿ ￿ ¯S,µ ￿ x ￿→ (Jz,b (x) − Tµ Jz,b (x))+ : ￿z￿H ≤ C, |b| ≤ B , ˜ ˜ F 6 queue 2 queue 4 µ2 = 0. [sent-185, score-0.332]

74 12 queue 1 queue 3 Figure 1: The queueing network example. [sent-191, score-0.526]

75 (where Tµ is the Bellman operator associated with policy µ), This yields the appropriate uniform large deviations bound. [sent-192, score-0.224]

76 Using this guarantee we show that the optimal solution to the sampled RSALP yields similar approximation guarantees as that with the exact RSALP; this proof is somewhat delicate as it appears difficult to directly show that the optimal solutions themselves are close. [sent-193, score-0.349]

77 There are two ‘flows’ in this network: the first through server 1 followed by server 2 (with buffering at queues 1 and 2, respectively), and the second through server 2 followed by server 1 (with buffering at queues 4 and 3, respectively). [sent-195, score-0.79]

78 In the equivalent discrete time problem, at most a single event can occur in a given epoch, corresponding either to the arrival of a job at queues 1 or 4, or the arrival of a service token for one of the four queues with probability proportional to the corresponding rates. [sent-203, score-0.376]

79 The state of the system is described by the number of jobs is each of the four queues, so that S ￿ Z4 , + whereas the action space A consists of four potential actions each corresponding to a matching between servers and queues. [sent-204, score-0.147]

80 We take the single period cost as the total number of jobs in the system, so that gx,a = ￿x￿1 ; note that minimizing the average number of jobs in the system is equivalent to minimizing average delay by Little’s law. [sent-205, score-0.166]

81 We solve (7) using the active set method outlined in the Online Supplement, taking as￿ our kernel the standard Gaussian radial basis function kernel K(x, y) ￿ ￿ exp −￿x − y￿2 /h , with the bandwidth parameter h ￿ 100. [sent-210, score-0.155]

82 Since the idealized sampling distribution, πµ∗ ,ν is unavailable to us, we use in its place the geometric distribution π(x) ￿ (1 − ζ)4 ζ ￿x￿1 , with the sampling parameter ζ set at 0. [sent-213, score-0.153]

83 1, as well as by [12], 7 policy performance Longest Queue Max-Weight 8. [sent-218, score-0.201]

84 55 sample size SALP, cubic basis RSALP, Gaussian kernel 1000 7. [sent-220, score-0.155]

85 05) Table 1: Performance results in the queueing example. [sent-236, score-0.235]

86 Thus, we solve the sample average approximation of this program using the same geometric sampling distribution and parameter κ. [sent-242, score-0.293]

87 Approximation architectures in which the basis functions are monomials of the queue lengths appear to be a popular choice for queueing control problems [13]. [sent-243, score-0.506]

88 We use all monomials with degree at most 3, which we will call the cubic basis, as our approximation architectures. [sent-244, score-0.167]

89 This is a simple heuristic approach: at any given time, a server chooses to work on the longest queue from among those it can service. [sent-246, score-0.313]

90 Max-Weight is a well known scheduling heuristic for queueing networks. [sent-248, score-0.326]

91 The policy is obtained as the greedy policy with respect to a value function approximation of the form ￿4 ˜ JM W (x) ￿ i=1 |xi |1+￿ , given a parameter ￿ > 0. [sent-249, score-0.54]

92 This policy has been extensively studied and shown to have a number of good properties, for example, being throughput optimal and offering good performance for critically loaded settings [21]. [sent-250, score-0.264]

93 The performance metric we report for each control policy is the long run average number of jobs in the system under that ￿T policy, t=1 ￿xt ￿1 /T , where we set T ￿ 10000. [sent-255, score-0.319]

94 To understand the effect of the sample size on the resulting policy performance, the different sample sizes listed in Table 1 were used. [sent-258, score-0.269]

95 Since the policies generated involve randomness to the sampled states, we further average performance over 10 sets of sampled states. [sent-259, score-0.154]

96 RSALP in less sensitive to state sampling: We notice from the standard deviation values in Table 1 that our approach gives policies whose performance varies significantly less across different sample sets of the same size. [sent-269, score-0.138]

97 In summary we view these results as indicative of the possibility that the RSALP may serve as a practical and viable alternative to state-of-the-art parametric ADP techniques. [sent-270, score-0.106]

98 Feature selection using regularization in approximate linear programs for Markov decision processes. [sent-334, score-0.119]

99 On constraint sampling in the linear programming approach to approximate dynamic programming. [sent-361, score-0.198]

100 Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks. [sent-396, score-0.405]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rsalp', 0.588), ('adp', 0.424), ('queueing', 0.235), ('salp', 0.212), ('policy', 0.201), ('queue', 0.131), ('sx', 0.126), ('alp', 0.118), ('queues', 0.118), ('server', 0.115), ('rs', 0.112), ('program', 0.112), ('approximation', 0.101), ('farias', 0.094), ('bellman', 0.086), ('jobs', 0.083), ('parametric', 0.074), ('scheduling', 0.068), ('policies', 0.068), ('basis', 0.063), ('wx', 0.063), ('idealized', 0.061), ('programming', 0.055), ('arrival', 0.054), ('lp', 0.053), ('dynamic', 0.053), ('architecture', 0.051), ('approximations', 0.048), ('regularization', 0.048), ('hilbert', 0.047), ('guarantee', 0.047), ('buffering', 0.047), ('ciamac', 0.047), ('dub', 0.047), ('graceful', 0.047), ('moallemi', 0.047), ('establish', 0.047), ('sampling', 0.046), ('approximate', 0.044), ('longest', 0.044), ('sampled', 0.043), ('reinforcement', 0.042), ('monomials', 0.042), ('jz', 0.042), ('mdp', 0.041), ('offset', 0.041), ('ormoneit', 0.038), ('designer', 0.038), ('intractably', 0.038), ('greedy', 0.037), ('inf', 0.037), ('weight', 0.036), ('state', 0.036), ('control', 0.035), ('sample', 0.034), ('throughput', 0.034), ('supplement', 0.034), ('kernel', 0.034), ('quantity', 0.033), ('smoothed', 0.033), ('viable', 0.032), ('service', 0.032), ('permitted', 0.03), ('dimension', 0.03), ('exact', 0.03), ('polynomially', 0.03), ('network', 0.029), ('specifying', 0.029), ('optimal', 0.029), ('dimensional', 0.029), ('loosely', 0.029), ('graduate', 0.029), ('action', 0.028), ('aaai', 0.027), ('decision', 0.027), ('dual', 0.026), ('solving', 0.026), ('business', 0.025), ('columbia', 0.025), ('restricting', 0.024), ('cubic', 0.024), ('inner', 0.024), ('hope', 0.024), ('markovian', 0.024), ('enjoys', 0.024), ('appears', 0.024), ('bandwidth', 0.024), ('user', 0.024), ('opposed', 0.024), ('mathematical', 0.023), ('mdps', 0.023), ('somewhat', 0.023), ('yields', 0.023), ('heuristic', 0.023), ('averaging', 0.023), ('complexity', 0.023), ('school', 0.022), ('potentially', 0.022), ('attained', 0.021), ('bold', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

Author: Nikhil Bhat, Vivek Farias, Ciamac C. Moallemi

Abstract: This paper presents a novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful approximation and sample complexity guarantees. In particular, we establish both theoretically and computationally that our proposal can serve as a viable alternative to state-of-the-art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. We accomplish this by developing a kernel-based mathematical program for ADP. Via a computational study on a controlled queueing network, we show that our procedure is competitive with parametric ADP approaches. 1

2 0.15355588 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

3 0.15272558 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

4 0.12661156 348 nips-2012-Tractable Objectives for Robust Policy Optimization

Author: Katherine Chen, Michael Bowling

Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1

5 0.12576041 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin

Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1

6 0.11697298 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

7 0.10501302 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

8 0.10444671 344 nips-2012-Timely Object Recognition

9 0.10270942 160 nips-2012-Imitation Learning by Coaching

10 0.092668466 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

11 0.082611583 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

12 0.081049457 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

13 0.079043694 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

14 0.075691417 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

15 0.074637882 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

16 0.073195294 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

17 0.066911004 292 nips-2012-Regularized Off-Policy TD-Learning

18 0.062258434 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

19 0.062014744 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

20 0.059495021 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.168), (1, -0.203), (2, 0.005), (3, -0.058), (4, 0.005), (5, 0.037), (6, -0.007), (7, -0.002), (8, 0.01), (9, -0.038), (10, 0.0), (11, -0.016), (12, 0.02), (13, 0.058), (14, 0.036), (15, -0.037), (16, -0.011), (17, -0.027), (18, 0.026), (19, 0.018), (20, -0.006), (21, -0.003), (22, 0.032), (23, -0.032), (24, -0.027), (25, 0.039), (26, -0.016), (27, 0.07), (28, -0.013), (29, -0.0), (30, -0.028), (31, -0.01), (32, -0.043), (33, 0.022), (34, -0.006), (35, -0.015), (36, -0.074), (37, 0.024), (38, 0.029), (39, 0.052), (40, -0.006), (41, 0.046), (42, -0.015), (43, 0.012), (44, -0.007), (45, -0.039), (46, 0.057), (47, 0.098), (48, -0.001), (49, -0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89914924 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

Author: Nikhil Bhat, Vivek Farias, Ciamac C. Moallemi

Abstract: This paper presents a novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful approximation and sample complexity guarantees. In particular, we establish both theoretically and computationally that our proposal can serve as a viable alternative to state-of-the-art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. We accomplish this by developing a kernel-based mathematical program for ADP. Via a computational study on a controlled queueing network, we show that our procedure is competitive with parametric ADP approaches. 1

2 0.83650845 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

3 0.82655996 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

4 0.77726626 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

Author: Felipe Trevizan, Manuela Veloso

Abstract: Probabilistic planning captures the uncertainty of plan execution by probabilistically modeling the effects of actions in the environment, and therefore the probability of reaching different states from a given state and action. In order to compute a solution for a probabilistic planning problem, planners need to manage the uncertainty associated with the different paths from the initial state to a goal state. Several approaches to manage uncertainty were proposed, e.g., consider all paths at once, perform determinization of actions, and sampling. In this paper, we introduce trajectory-based short-sighted Stochastic Shortest Path Problems (SSPs), a novel approach to manage uncertainty for probabilistic planning problems in which states reachable with low probability are substituted by artificial goals that heuristically estimate their cost to reach a goal state. We also extend the theoretical results of Short-Sighted Probabilistic Planner (SSiPP) [1] by proving that SSiPP always finishes and is asymptotically optimal under sufficient conditions on the structure of short-sighted SSPs. We empirically compare SSiPP using trajectorybased short-sighted SSPs with the winners of the previous probabilistic planning competitions and other state-of-the-art planners in the triangle tireworld problems. Trajectory-based SSiPP outperforms all the competitors and is the only planner able to scale up to problem number 60, a problem in which the optimal solution contains approximately 1070 states. 1

5 0.75266397 348 nips-2012-Tractable Objectives for Robust Policy Optimization

Author: Katherine Chen, Michael Bowling

Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1

6 0.74505985 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

7 0.73414135 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

8 0.73043346 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

9 0.72815639 160 nips-2012-Imitation Learning by Coaching

10 0.7187885 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

11 0.71114254 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

12 0.67078358 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

13 0.66785675 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

14 0.64844084 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

15 0.63616288 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

16 0.622419 358 nips-2012-Value Pursuit Iteration

17 0.61067772 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

18 0.59654814 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

19 0.57609069 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

20 0.54877776 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.029), (21, 0.011), (38, 0.13), (42, 0.406), (54, 0.044), (55, 0.02), (74, 0.039), (76, 0.102), (80, 0.093), (92, 0.042)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8581565 289 nips-2012-Recognizing Activities by Attribute Dynamics

Author: Weixin Li, Nuno Vasconcelos

Abstract: In this work, we consider the problem of modeling the dynamic structure of human activities in the attributes space. A video sequence is Ä?Ĺš rst represented in a semantic feature space, where each feature encodes the probability of occurrence of an activity attribute at a given time. A generative model, denoted the binary dynamic system (BDS), is proposed to learn both the distribution and dynamics of different activities in this space. The BDS is a non-linear dynamic system, which extends both the binary principal component analysis (PCA) and classical linear dynamic systems (LDS), by combining binary observation variables with a hidden Gauss-Markov state process. In this way, it integrates the representation power of semantic modeling with the ability of dynamic systems to capture the temporal structure of time-varying processes. An algorithm for learning BDS parameters, inspired by a popular LDS learning method from dynamic textures, is proposed. A similarity measure between BDSs, which generalizes the BinetCauchy kernel for LDS, is then introduced and used to design activity classiÄ?Ĺš ers. The proposed method is shown to outperform similar classiÄ?Ĺš ers derived from the kernel dynamic system (KDS) and state-of-the-art approaches for dynamics-based or attribute-based action recognition. 1

2 0.83116609 242 nips-2012-Non-linear Metric Learning

Author: Dor Kedem, Stephen Tyree, Fei Sha, Gert R. Lanckriet, Kilian Q. Weinberger

Abstract: In this paper, we introduce two novel metric learning algorithms, χ2 -LMNN and GB-LMNN, which are explicitly designed to be non-linear and easy-to-use. The two approaches achieve this goal in fundamentally different ways: χ2 -LMNN inherits the computational benefits of a linear mapping from linear metric learning, but uses a non-linear χ2 -distance to explicitly capture similarities within histogram data sets; GB-LMNN applies gradient-boosting to learn non-linear mappings directly in function space and takes advantage of this approach’s robustness, speed, parallelizability and insensitivity towards the single additional hyperparameter. On various benchmark data sets, we demonstrate these methods not only match the current state-of-the-art in terms of kNN classification error, but in the case of χ2 -LMNN, obtain best results in 19 out of 20 learning settings. 1

3 0.82583702 219 nips-2012-Modelling Reciprocating Relationships with Hawkes Processes

Author: Charles Blundell, Jeff Beck, Katherine A. Heller

Abstract: We present a Bayesian nonparametric model that discovers implicit social structure from interaction time-series data. Social groups are often formed implicitly, through actions among members of groups. Yet many models of social networks use explicitly declared relationships to infer social structure. We consider a particular class of Hawkes processes, a doubly stochastic point process, that is able to model reciprocity between groups of individuals. We then extend the Infinite Relational Model by using these reciprocating Hawkes processes to parameterise its edges, making events associated with edges co-dependent through time. Our model outperforms general, unstructured Hawkes processes as well as structured Poisson process-based models at predicting verbal and email turn-taking, and military conflicts among nations. 1

same-paper 4 0.80682462 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

Author: Nikhil Bhat, Vivek Farias, Ciamac C. Moallemi

Abstract: This paper presents a novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful approximation and sample complexity guarantees. In particular, we establish both theoretically and computationally that our proposal can serve as a viable alternative to state-of-the-art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. We accomplish this by developing a kernel-based mathematical program for ADP. Via a computational study on a controlled queueing network, we show that our procedure is competitive with parametric ADP approaches. 1

5 0.79644346 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi

Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1

6 0.63073957 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

7 0.60505432 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

8 0.60073256 275 nips-2012-Privacy Aware Learning

9 0.59673458 285 nips-2012-Query Complexity of Derivative-Free Optimization

10 0.58766448 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

11 0.58624625 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

12 0.58040625 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

13 0.57416725 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

14 0.57225412 358 nips-2012-Value Pursuit Iteration

15 0.56919706 292 nips-2012-Regularized Off-Policy TD-Learning

16 0.5654707 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

17 0.56442058 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

18 0.55789489 217 nips-2012-Mixability in Statistical Learning

19 0.55527675 9 nips-2012-A Geometric take on Metric Learning

20 0.55057639 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications