nips nips2013 nips2013-280 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Grani Adiwena Hanasusanto, Daniel Kuhn
Abstract: In stochastic optimal control the distribution of the exogenous noise is typically unknown and must be inferred from limited data before dynamic programming (DP)-based solution schemes can be applied. If the conditional expectations in the DP recursions are estimated via kernel regression, however, the historical sample paths enter the solution procedure directly as they determine the evaluation points of the cost-to-go functions. The resulting data-driven DP scheme is asymptotically consistent and admits an efficient computational solution when combined with parametric value function approximations. If training data is sparse, however, the estimated cost-to-go functions display a high variability and an optimistic bias, while the corresponding control policies perform poorly in out-of-sample tests. To mitigate these small sample effects, we propose a robust data-driven DP scheme, which replaces the expectations in the DP recursions with worst-case expectations over a set of distributions close to the best estimate. We show that the arising minmax problems in the DP recursions reduce to tractable conic programs. We also demonstrate that the proposed robust DP algorithm dominates various non-robust schemes in out-of-sample tests across several application domains. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract In stochastic optimal control the distribution of the exogenous noise is typically unknown and must be inferred from limited data before dynamic programming (DP)-based solution schemes can be applied. [sent-7, score-0.416]
2 If the conditional expectations in the DP recursions are estimated via kernel regression, however, the historical sample paths enter the solution procedure directly as they determine the evaluation points of the cost-to-go functions. [sent-8, score-0.291]
3 To mitigate these small sample effects, we propose a robust data-driven DP scheme, which replaces the expectations in the DP recursions with worst-case expectations over a set of distributions close to the best estimate. [sent-11, score-0.21]
4 1 Introduction We consider a stochastic optimal control problem in discrete time with continuous state and action spaces. [sent-14, score-0.137]
5 The endogenous state st ∈ Rd1 captures all decision-dependent information, while the exogenous state ξt ∈ Rd2 captures the external random disturbances. [sent-16, score-0.839]
6 Conditional on (st , ξt ) the decision maker chooses a control action ut ∈ Ut ⊆ Rm and incurs a cost ct (st , ξt , ut ). [sent-17, score-0.892]
7 Without much loss of generality we assume that the endogenous state obeys the recursion st+1 = gt (st , ut , ξt+1 ), while the evolution of the exogenous state can be modeled by a Markov process. [sent-19, score-0.831]
8 Note that even if the exogenous state process has finite memory, it can be reduced as an equivalent Markov process on a higher-dimensional space. [sent-20, score-0.242]
9 By Bellman’s principle of optimality, a decision maker aiming to minimize the expected cumulative costs solves the dynamic program Vt (st , ξt ) = min ut ∈Ut s. [sent-22, score-0.55]
10 ct (st , ξt , ut ) + E[Vt+1 (st+1 , ξt+1 )|ξt ] (1) st+1 = gt (st , ut , ξt+1 ) backwards for t = T, . [sent-24, score-0.719]
11 There is often a natural distinction between endogenous and exogenous states. [sent-35, score-0.369]
12 For example, in inventory control the inventory level can naturally be interpreted as the endogenous state, while the uncertain demand represents the exogenous state. [sent-36, score-0.453]
13 1 In spite of their exceptional modeling power, dynamic programming problems of the above type suffer from two major shortcomings that limit their practical applicability. [sent-37, score-0.18]
14 Secondly, even if the dynamic programming recursions (1) could be computed efficiently, there is often substantial uncertainty about the conditional distribution of ξt+1 given ξt . [sent-39, score-0.349]
15 Indeed, the distribution of the exogenous states is typically unknown and must be inferred from historical observations. [sent-40, score-0.299]
16 In this paper, we assume that only a set of N sample trajectories of the exogenous state is given, and we use kernel regression in conjunction with parametric value function approximations to estimate the conditional expectation in (1). [sent-43, score-0.389]
17 Thus, we approximate the conditional distribution of ξt+1 given ξt by a discrete distribution whose discretization points are given by the historical samples, while the corresponding conditional probabilities are expressed in terms of a normalized NadarayaWatson (NW) kernel function. [sent-44, score-0.285]
18 This data-driven dynamic programming (DDP) approach is conceptually appealing and avoids an artificial separation of estimation and optimization steps. [sent-45, score-0.18]
19 Instead, the historical samples are used directly in the dynamic programming recursions. [sent-46, score-0.287]
20 Moreover, DDP computes the value functions only on the N sample trajectories of the exogenous state, thereby mitigating one of the intractabilities of classical dynamic programming. [sent-48, score-0.392]
21 However, it also displays a downward bias caused by the minimization over ut . [sent-52, score-0.348]
22 As estimation errors in the cost-to-go functions are propagated through the dynamic programming recursions, the bias grows over time and thus incentivizes poor control decisions in the early time periods. [sent-54, score-0.27]
23 In this paper we propose a robust data-driven dynamic programming (RDDP) approach that replaces the expectation in (1) by a worst-case expectation over a set of distributions close to the nominal estimate in view of the χ2 -distance. [sent-56, score-0.367]
24 Robust value iteration methods have recently been studied in robust Markov decision process (MDP) theory [6, 7, 8, 9]. [sent-59, score-0.138]
25 However, these algorithms are not fundamentally data-driven as their primitives are uncertainty sets for the transition kernels instead of historical observations. [sent-60, score-0.141]
26 The robust value iterations in RDDP are facilitated by combining convex parametric function approximation methods (to model the dependence on the endogenous state) with nonparametric kernel regression techniques (for the dependence on the exogenous state). [sent-63, score-0.488]
27 Due to the convexity in the endogenous state, RDDP further benefits from mathematical programming techniques to optimize over high-dimensional continuous action spaces without requiring any form of discretization. [sent-65, score-0.287]
28 2 2 Data-driven dynamic programming Assume from now on that the distribution of the exogenous states is unknown and that we are only i given N observation histories {ξt }T for i = 1, . [sent-72, score-0.392]
29 For a large bandwidth, the conditional 1 probabilities qti (ξt ) converge to N , in which case (2) reduces to the (unconditional) sample average. [sent-80, score-0.17]
30 In the following we set the bandwidth matrix H to its best esi timate assuming that the historical observations {ξt }N follow a Gaussian distribution; see [19]. [sent-82, score-0.142]
31 i=1 Substituting (2) into (1), results in the data-driven dynamic programming (DDP) formulation N Vtd (st , ξt ) = min ut ∈Ut s. [sent-83, score-0.471]
32 ct (st , ξt , ut ) + si t+1 = d i qti (ξt )Vt+1 (si , ξt+1 ) t+1 i=1 i gt (st , ut , ξt+1 ) (4) ∀i , d VT +1 ≡ 0. [sent-85, score-0.819]
33 Moreover, DDP evaluates the cost-to-go function of the next period only at the sample points and thus requires no a-priori discretization of the exogenous state space, thus mitigating one of the intractabilities of classical dynamic programming. [sent-90, score-0.404]
34 3 Robust data-driven dynamic programming If the training data is sparse, the NW estimate (2) of the conditional expectation in (4) typically 1 exhibits a small bias and a high variability. [sent-91, score-0.287]
35 1 Assume that d1 = 1, d2 = m = 5, ct (st , ξt , ut ) = 0, gt (st , ut , ξt+1 ) = ξt+1 ut , 1 Ut = {u ∈ Rm : 1 u = 1} and Vt+1 (st+1 , ξt+1 ) = 10 s2 − st+1 . [sent-97, score-1.01]
36 By permutation symmetry, the optimal decision under full distributional knowledge is u∗ = 1 1. [sent-102, score-0.116]
37 To showcase the high variability of NW estimation, we fix the decision u∗ and use (2) to estimate its expected cost conditional on ξt = 1. [sent-110, score-0.192]
38 Next, we use (4) to estimate Vtd (st , 1), that is, the expected cost of the best decision obtained without distributional information. [sent-112, score-0.15]
39 Figure 1 (middle) shows that this cost estimator is even more noisy than the one for a fixed decision, exhibits a significant downward bias and converges slowly as N grows. [sent-113, score-0.115]
40 Setting Vt+1 ≡ Vt+1 , we find d Vt (st , ξt ) = min ct (st , ξt , ut ) + E[Vt+1 (gt (st , ut , ξt+1 ), ξt+1 )|ξt ] ut ∈Ut N d i i qti (ξt )Vt+1 (gt (st , ut , ξt+1 ), ξt+1 )|ξt ] ≈ min ct (st , ξt , ut ) + E[ ut ∈Ut i=1 N d i i qti (ξt )Vt+1 (gt (st , ut , ξt+1 ), ξt+1 ) ξt . [sent-130, score-2.325]
41 ≥ E min ct (st , ξt , ut ) + ut ∈Ut i=1 The relation in the second line uses our observation that the NW estimator of the expected cost associated with any fixed decision ut is approximately unbiased. [sent-131, score-1.069]
42 We emphasize that all systematic estimation errors of this type accumulate as they are propagated through the dynamic programming recursions. [sent-136, score-0.18]
43 To mitigate the detrimental overfitting effects, we propose a regularization that reduces the decision maker’s overconfidence in the weights qt (ξt ) = [qt1 (ξt ) . [sent-137, score-0.171]
44 Thus, we allow the conditional probabilities used in (4) to deviate from their nominal values qt (ξt ) up to a certain degree. [sent-141, score-0.244]
45 Allowing the conditional probabilities in (4) to range over the uncertainty set ∆(qt (ξt )) results in the robust data-driven dynamic programming (RDDP) formulation N Vtr (st , ξt ) = min ut ∈Ut s. [sent-150, score-0.663]
46 ct (st , ξt , ut ) + r i pi Vt+1 (si , ξt+1 ) t+1 max p∈∆(qt (ξt )) i=1 (6) i si = gt (st , ut , ξt+1 ) ∀i t+1 r with terminal condition VT +1 ≡ 0. [sent-152, score-0.781]
47 Thus, each RDDP recursion involves the solution of a robust optimization problem [4], which can be viewed as a game against ‘nature’ (or a malicious adversary): for every action ut chosen by the decision maker, nature selects the corresponding worst-case weight vector from within p ∈ ∆ (qt (ξt )). [sent-153, score-0.472]
48 By anticipating nature’s moves, the decision maker is forced to select more conservative decisions that are less susceptible to amplifying estimation errors in the nominal weights qt (ξt ). [sent-154, score-0.319]
49 We suggest to choose γ large enough such that the envelope of all conditional CDFs of ξt+1 implied by the weight vectors in ∆(qt (ξt )) covers the true conditional CDF with high confidence (Figure 2). [sent-156, score-0.173]
50 RDDP seeks the worst-case probabilities of N historical samples of the exogenous state, using the NW weights as nominal probabilities. [sent-168, score-0.405]
51 Worst-case probabilities are then assigned to the bins, whose nominal probabilities are given by the empirical frequencies. [sent-170, score-0.137]
52 4 Computational solution procedure In this section we demonstrate that RDDP is computationally tractable under a convexity assumption and if we approximate the dependence of the cost-to-go functions on the endogenous state through a piecewise linear or quadratic approximation architecture. [sent-180, score-0.29]
53 , T , the cost function ct is convex quadratic in (st , ut ), the transition function gt is affine in (st , ut ), and the feasible set Ut is second-order conic representable. [sent-186, score-0.837]
54 1 holds and that the cost-to-go function Vt+1 is convex in the endogenous state. [sent-191, score-0.205]
55 Vtr (st , ξt ) = min ct (st , ξt , ut ) + λγ − µ − 2qt (ξt ) y + 2λqt (ξt ) 1 s. [sent-193, score-0.357]
56 ut ∈ Ut , µ ∈ R, λ ∈ R+ , z, y ∈ RN r i i Vt+1 (gt (st , ut , ξt+1 ), ξt+1 ) ≤ zi ∀i (7) 2 4yi + (zi + µ)2 ≤ 2λ − zi − µ ∀i zi + µ ≤ λ, Corollary 4. [sent-195, score-0.582]
57 1 holds, then RDDP preserves convexity in the exogenous state. [sent-197, score-0.192]
58 r Thus, Vtr (st , ξt ) is convex in st whenever Vt+1 (st+1 , ξt+1 ) is convex in st+1 . [sent-198, score-0.426]
59 r Note that problem (7) becomes a tractable second-order cone program if Vt+1 is convex piecewise linear or convex quadratic in st+1 . [sent-199, score-0.173]
60 5 Algorithm 1: Robust data-driven dynamic programming Inputs: Sample trajectories {sk }T for k = 1, . [sent-201, score-0.218]
61 Init=1 i tially, we collect historical observation trajectories of the exogenous state {ξt }T , i = 1, . [sent-231, score-0.387]
62 , N , t=1 and generate sample trajectories of the endogenous state {sk }T , k = 1, . [sent-234, score-0.265]
63 , K, by simulating the t t=1 evolution of st under a prescribed control policy along randomly selected exogenous state trajectories. [sent-237, score-0.697]
64 The core ˆ of the algorithm computes approximate value functions Vtr , which are piecewise linear or quadratic ˆ r as an input and computes the optimal value in st , by backward induction on t. [sent-240, score-0.476]
65 If the endogenous t k=1 state is univariate (d1 = 1), the following piecewise linear approximation is used. [sent-243, score-0.262]
66 min K k=1 ˆr (sk ) Mi sk + 2mi sk + mi − Vt,k,i t t t s. [sent-246, score-0.213]
67 Mi ∈ Sd1 , Mi 0, mi ∈ Rd1 , 2 (8b) mi ∈ R Quadratic approximation architectures of the above type first emerged in approximate dynamic proi ˆ gramming [5]. [sent-248, score-0.203]
68 1 The RDDP algorithm remains valid if the feasible set Ut depends on the state (st , ξt ) or if the control action ut includes components that are of the ‘here-and-now’-type (i. [sent-255, score-0.428]
69 5 Experimental results We evaluate the RDDP algorithm of § 4 in the context of an index tracking and a wind energy commitment application. [sent-261, score-0.489]
70 1 Index tracking The objective of index tracking is to match the performance of a stock index as closely as possible with a portfolio of other financial instruments. [sent-264, score-0.237]
71 5 10 15 20 25 Sum of squared tracking errors (in ‰) LSPI DDP RDDP 30 Figure 3: Cumulative distribution function of sum of squared tracking errors. [sent-285, score-0.146]
72 Let st ∈ R+ be the value of the current tracking portfolio relative to the value of S&P; 500 on day t, while ξt ∈ R5 denotes the vector of the total index returns (price relatives) from day t − 1 to day t. [sent-288, score-0.737]
73 The objective of index tracking is to maintain st close to 1 in a least-squares sense throughout the planning horizon, which gives rise to the following dynamic program with terminal condition VT +1 ≡ 0. [sent-290, score-0.647]
74 Vt (st , ξt ) = min (1 − st )2 + E[Vt (st+1 , ξt+1 )|ξt ] s. [sent-291, score-0.37]
75 u ∈ R5 , + 1 u = st , u1 = 0, st+1 = ξt+1 u/ξt+1,1 (9) Here, ui /st can be interpreted as the portion of the tracking portfolio that is invested in index i on day t. [sent-293, score-0.583]
76 Our computational experiment is based on historical returns of the indices over 5440 days from 26-Aug-1991 to 8-Mar-2013 (272 trading months). [sent-294, score-0.132]
77 As the endogenous state is univariate, DDP and RDDP employ the piecewise linear approximation architecture (8a). [sent-298, score-0.262]
78 2 Wind energy commitment Next, we apply RDDP to the wind energy commitment problem proposed in [28, 29]. [sent-306, score-0.596]
79 On every day t, a wind energy producer chooses the energy commitment levels xt ∈ R24 for the next 24 + Statistic Persistence DDP RDDP Mean Std. [sent-307, score-0.613]
80 However, the hourly amounts of wind energy ωt+1 ∈ R24 generated over the day + are uncertain. [sent-345, score-0.368]
81 If the actual production falls short of the commitment levels, there is a penalty of twice the respective day-ahead price for each unit of unsatisfied demand. [sent-346, score-0.133]
82 The wind energy producer also operates three storage devices indexed by l ∈ {1, 2, 3}, each of which can have a different capacity sl , hourly leakage ρl , charging efficiency ρl and discharging efficiency ρl . [sent-347, score-0.468]
83 We denote c d 24 by sl t+1 ∈ R+ the hourly filling levels of storage l over the next 24 hours. [sent-348, score-0.182]
84 The wind producer’s objective is to maximize the expected profit over a short-term planning horizon of T = 7 days. [sent-349, score-0.18]
85 The endogenous state is given by the storage levels at the end of day t, st = {sl }3 ∈ R3 , while + t,24 l=1 the exogenous state comprises the day-ahead prices πt ∈ R24 and the wind energy production levels + ωt ∈ R24 of day t − 1, which are revealed to the producer on day t. [sent-350, score-1.507]
86 + The best bidding and storage strategy can be found by solving the dynamic program Vt (st , ξt ) = max πt xt − 2πt E[eu |ξt ] + E[Vt+1 (st+1 , ξt+1 )|ξt ] t+1 s. [sent-352, score-0.198]
87 {c,w,u} xt , et+1 ωt+1,h = ∈ R24 , + ec t+1,h xt,h = ec t+1,h + {+,−},l et+1 , sl ∈ R24 t+1 + ∀l +,1 +,2 +,3 + et+1,h + et+1,h + et+1,h + ew ∀h t+1,h −,1 −,2 −,3 u et+1,h + et+1,h + et+1,h + et+1,h ∀h (10) 1 −,l l e ∀h, l , sl t+1,h ≤ s ρl t+1,h d l with terminal condition VT +1 ≡ 0. [sent-354, score-0.251]
88 Besides the usual here-and-now decisions xt , the decision vector ut now also includes wait-and-see decisions that are chosen after ξt+1 has been revealed (see Remark 4. [sent-356, score-0.409]
89 l l l +,l sl t+1,h = ρ st+1,h−1 + ρc et+1,h − Our computational experiment is based on day-ahead prices for the PJM market and wind speed data for North Carolina (33. [sent-358, score-0.269]
90 As ξt is a 48 dimensional vector with high correlations between its components, we perform principal component analysis to obtain a 6 dimensional subspace that explains more than 90% of the variability of the historical observations. [sent-363, score-0.132]
91 The conditional probabilities qt (ξt ) are subsequently estimated using the projected data points. [sent-364, score-0.169]
92 We solve the wind energy commitment problem using the DDP and RDDP algorithms (i. [sent-366, score-0.388]
93 , the algorithm of § 4 with γ = 0 and γ = 1, respectively) as well as a persistence heuristic that naively pledges the wind generation of the previous day by setting xt = ωt . [sent-368, score-0.329]
94 Note that problem (10) is beyond the scope of traditional reinforcement learning algorithms due to the high dimensionality of the action spaces and the seasonalities in the wind and price data. [sent-370, score-0.263]
95 We train DDP and RDDP on the first 260 weeks and test the resulting commitment strategies as well as the persistence heuristic on the last 260 weeks of the data set. [sent-371, score-0.273]
96 Concluding remarks The proposed RDDP algorithm combines ideas from robust optimization, reinforcement learning and approximate dynamic programming. [sent-377, score-0.219]
97 1 could be relaxed to allow ct and gt to display a general nonlinear dependence on st . [sent-380, score-0.507]
98 If one is willing to accept a potentially larger mismatch between the true nonconvex cost-to-go function and its convex approximation architecture, then Algorithm 1 can even be applied to specific motor control, vehicle control or other nonlinear control problems. [sent-384, score-0.116]
99 Robust control of Markov decision processes with uncertain transition matrices. [sent-415, score-0.116]
100 Operation and configuration of a storage portfolio via convex optimization. [sent-554, score-0.125]
wordName wordTfidf (topN-words)
[('rddp', 0.503), ('st', 0.37), ('ut', 0.291), ('ddp', 0.277), ('vt', 0.234), ('exogenous', 0.192), ('wind', 0.18), ('endogenous', 0.177), ('vtr', 0.17), ('commitment', 0.133), ('dynamic', 0.113), ('historical', 0.107), ('nw', 0.09), ('sk', 0.084), ('vtd', 0.078), ('qti', 0.078), ('qt', 0.077), ('day', 0.077), ('energy', 0.075), ('nominal', 0.075), ('recursions', 0.074), ('tracking', 0.073), ('decision', 0.072), ('persistence', 0.072), ('gt', 0.071), ('programming', 0.067), ('robust', 0.066), ('ct', 0.066), ('lspi', 0.065), ('sl', 0.063), ('storage', 0.062), ('conditional', 0.061), ('producer', 0.052), ('maker', 0.051), ('state', 0.05), ('mi', 0.045), ('distributional', 0.044), ('control', 0.044), ('dp', 0.043), ('action', 0.043), ('policy', 0.041), ('terminal', 0.04), ('reinforcement', 0.04), ('percentiles', 0.039), ('trajectories', 0.038), ('dominates', 0.036), ('hourly', 0.036), ('portfolio', 0.035), ('piecewise', 0.035), ('bandwidth', 0.035), ('cost', 0.034), ('uncertainty', 0.034), ('downward', 0.034), ('weeks', 0.034), ('kh', 0.032), ('cdfs', 0.032), ('ec', 0.032), ('probabilities', 0.031), ('pro', 0.031), ('cone', 0.031), ('sedumi', 0.03), ('yalmip', 0.03), ('intractability', 0.029), ('quadratic', 0.028), ('convex', 0.028), ('conic', 0.028), ('index', 0.028), ('prices', 0.026), ('envelope', 0.026), ('intractabilities', 0.026), ('implied', 0.025), ('kernel', 0.025), ('trading', 0.025), ('variability', 0.025), ('expectations', 0.024), ('carolina', 0.024), ('interpolation', 0.024), ('cdf', 0.024), ('tting', 0.024), ('estimator', 0.024), ('bias', 0.023), ('decisions', 0.023), ('program', 0.023), ('expectation', 0.023), ('hannah', 0.023), ('mitigating', 0.023), ('lausanne', 0.023), ('backward', 0.022), ('si', 0.022), ('mitigate', 0.022), ('ew', 0.021), ('wins', 0.021), ('induction', 0.021), ('levels', 0.021), ('susceptible', 0.021), ('kuhn', 0.021), ('inherits', 0.021), ('median', 0.02), ('inventory', 0.02), ('histories', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 280 nips-2013-Robust Data-Driven Dynamic Programming
Author: Grani Adiwena Hanasusanto, Daniel Kuhn
Abstract: In stochastic optimal control the distribution of the exogenous noise is typically unknown and must be inferred from limited data before dynamic programming (DP)-based solution schemes can be applied. If the conditional expectations in the DP recursions are estimated via kernel regression, however, the historical sample paths enter the solution procedure directly as they determine the evaluation points of the cost-to-go functions. The resulting data-driven DP scheme is asymptotically consistent and admits an efficient computational solution when combined with parametric value function approximations. If training data is sparse, however, the estimated cost-to-go functions display a high variability and an optimistic bias, while the corresponding control policies perform poorly in out-of-sample tests. To mitigate these small sample effects, we propose a robust data-driven DP scheme, which replaces the expectations in the DP recursions with worst-case expectations over a set of distributions close to the best estimate. We show that the arising minmax problems in the DP recursions reduce to tractable conic programs. We also demonstrate that the proposed robust DP algorithm dominates various non-robust schemes in out-of-sample tests across several application domains. 1
2 0.28525385 348 nips-2013-Variational Policy Search via Trajectory Optimization
Author: Sergey Levine, Vladlen Koltun
Abstract: In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and highdimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks. 1
3 0.14655691 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
Author: Prashanth L.A., Mohammad Ghavamzadeh
Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application. 1
4 0.12422918 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
Author: Shiau Hong Lim, Huan Xu, Shie Mannor
Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1
5 0.11768712 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
Author: David J. Weiss, Ben Taskar
Abstract: Discriminative methods for learning structured models have enabled wide-spread use of very rich feature representations. However, the computational cost of feature extraction is prohibitive for large-scale or time-sensitive applications, often dominating the cost of inference in the models. Significant efforts have been devoted to sparsity-based model selection to decrease this cost. Such feature selection methods control computation statically and miss the opportunity to finetune feature extraction to each input at run-time. We address the key challenge of learning to control fine-grained feature extraction adaptively, exploiting nonhomogeneity of the data. We propose an architecture that uses a rich feedback loop between extraction and prediction. The run-time control policy is learned using efficient value-function approximation, which adaptively determines the value of information of features at the level of individual variables for each input. We demonstrate significant speedups over state-of-the-art methods on two challenging datasets. For articulated pose estimation in video, we achieve a more accurate state-of-the-art model that is also faster, with similar results on an OCR task. 1
6 0.11312157 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
7 0.10584508 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
8 0.10287109 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization
9 0.095326439 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
10 0.086042553 26 nips-2013-Adaptive Market Making via Online Learning
11 0.085458525 146 nips-2013-Large Scale Distributed Sparse Precision Estimation
12 0.084824339 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking
13 0.079880223 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
14 0.078970894 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
15 0.07643196 67 nips-2013-Conditional Random Fields via Univariate Exponential Families
16 0.072917409 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization
17 0.071878962 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
18 0.071472563 230 nips-2013-Online Learning with Costly Features and Labels
19 0.06979543 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
20 0.068326823 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods
topicId topicWeight
[(0, 0.176), (1, -0.126), (2, -0.003), (3, 0.026), (4, -0.044), (5, 0.009), (6, -0.018), (7, 0.082), (8, 0.003), (9, -0.045), (10, -0.028), (11, -0.111), (12, 0.023), (13, 0.055), (14, -0.019), (15, 0.021), (16, 0.031), (17, -0.05), (18, -0.01), (19, 0.0), (20, -0.032), (21, -0.009), (22, -0.002), (23, -0.015), (24, 0.02), (25, -0.02), (26, -0.02), (27, 0.003), (28, 0.005), (29, -0.025), (30, 0.095), (31, 0.037), (32, 0.012), (33, 0.057), (34, -0.081), (35, 0.021), (36, -0.127), (37, -0.036), (38, -0.067), (39, 0.066), (40, 0.004), (41, -0.038), (42, -0.03), (43, 0.03), (44, -0.081), (45, 0.138), (46, 0.114), (47, -0.096), (48, 0.168), (49, 0.104)]
simIndex simValue paperId paperTitle
same-paper 1 0.93278664 280 nips-2013-Robust Data-Driven Dynamic Programming
Author: Grani Adiwena Hanasusanto, Daniel Kuhn
Abstract: In stochastic optimal control the distribution of the exogenous noise is typically unknown and must be inferred from limited data before dynamic programming (DP)-based solution schemes can be applied. If the conditional expectations in the DP recursions are estimated via kernel regression, however, the historical sample paths enter the solution procedure directly as they determine the evaluation points of the cost-to-go functions. The resulting data-driven DP scheme is asymptotically consistent and admits an efficient computational solution when combined with parametric value function approximations. If training data is sparse, however, the estimated cost-to-go functions display a high variability and an optimistic bias, while the corresponding control policies perform poorly in out-of-sample tests. To mitigate these small sample effects, we propose a robust data-driven DP scheme, which replaces the expectations in the DP recursions with worst-case expectations over a set of distributions close to the best estimate. We show that the arising minmax problems in the DP recursions reduce to tractable conic programs. We also demonstrate that the proposed robust DP algorithm dominates various non-robust schemes in out-of-sample tests across several application domains. 1
2 0.57104158 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
Author: Prashanth L.A., Mohammad Ghavamzadeh
Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application. 1
3 0.5332846 348 nips-2013-Variational Policy Search via Trajectory Optimization
Author: Sergey Levine, Vladlen Koltun
Abstract: In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and highdimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks. 1
4 0.51923215 159 nips-2013-Learning Prices for Repeated Auctions with Strategic Buyers
Author: Kareem Amin, Afshin Rostamizadeh, Umar Syed
Abstract: Inspired by real-time ad exchanges for online display advertising, we consider the problem of inferring a buyer’s value distribution for a good when the buyer is repeatedly interacting with a seller through a posted-price mechanism. We model the buyer as a strategic agent, whose goal is to maximize her long-term surplus, and we are interested in mechanisms that maximize the seller’s long-term revenue. We define the natural notion of strategic regret — the lost revenue as measured against a truthful (non-strategic) buyer. We present seller algorithms that are no(strategic)-regret when the buyer discounts her future surplus — i.e. the buyer prefers showing advertisements to users sooner rather than later. We also give a lower bound on strategic regret that increases as the buyer’s discounting weakens and shows, in particular, that any seller algorithm will suffer linear strategic regret if there is no discounting. 1
5 0.51576847 26 nips-2013-Adaptive Market Making via Online Learning
Author: Jacob Abernethy, Satyen Kale
Abstract: We consider the design of strategies for market making in an exchange. A market maker generally seeks to profit from the difference between the buy and sell price of an asset, yet the market maker also takes exposure risk in the event of large price movements. Profit guarantees for market making strategies have typically required certain stochastic assumptions on the price fluctuations of the asset in question; for example, assuming a model in which the price process is mean reverting. We propose a class of “spread-based” market making strategies whose performance can be controlled even under worst-case (adversarial) settings. We prove structural properties of these strategies which allows us to design a master algorithm which obtains low regret relative to the best such strategy in hindsight. We run a set of experiments showing favorable performance on recent real-world stock price data. 1
6 0.5043996 139 nips-2013-How to Hedge an Option Against an Adversary: Black-Scholes Pricing is Minimax Optimal
7 0.49174637 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
8 0.46622327 146 nips-2013-Large Scale Distributed Sparse Precision Estimation
9 0.45490876 41 nips-2013-Approximate inference in latent Gaussian-Markov models from continuous time observations
10 0.45264673 266 nips-2013-Recurrent linear models of simultaneously-recorded neural populations
11 0.43509948 67 nips-2013-Conditional Random Fields via Univariate Exponential Families
12 0.42444667 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
13 0.41559899 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes
14 0.41492459 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
15 0.41251323 347 nips-2013-Variational Planning for Graph-based MDPs
16 0.40941226 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization
17 0.40157381 112 nips-2013-Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising
18 0.40154675 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models
19 0.39088851 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
20 0.38527426 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris
topicId topicWeight
[(2, 0.019), (16, 0.033), (33, 0.12), (34, 0.119), (36, 0.011), (41, 0.038), (49, 0.031), (56, 0.102), (66, 0.015), (70, 0.02), (85, 0.04), (89, 0.053), (92, 0.247), (93, 0.059), (95, 0.017)]
simIndex simValue paperId paperTitle
1 0.82180148 296 nips-2013-Sinkhorn Distances: Lightspeed Computation of Optimal Transport
Author: Marco Cuturi
Abstract: Optimal transport distances are a fundamental family of distances for probability measures and histograms of features. Despite their appealing theoretical properties, excellent performance in retrieval tasks and intuitive formulation, their computation involves the resolution of a linear program whose cost can quickly become prohibitive whenever the size of the support of these measures or the histograms’ dimension exceeds a few hundred. We propose in this work a new family of optimal transport distances that look at transport problems from a maximumentropy perspective. We smooth the classic optimal transport problem with an entropic regularization term, and show that the resulting optimum is also a distance which can be computed through Sinkhorn’s matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transport solvers. We also show that this regularized distance improves upon classic optimal transport distances on the MNIST classification problem.
same-paper 2 0.78753614 280 nips-2013-Robust Data-Driven Dynamic Programming
Author: Grani Adiwena Hanasusanto, Daniel Kuhn
Abstract: In stochastic optimal control the distribution of the exogenous noise is typically unknown and must be inferred from limited data before dynamic programming (DP)-based solution schemes can be applied. If the conditional expectations in the DP recursions are estimated via kernel regression, however, the historical sample paths enter the solution procedure directly as they determine the evaluation points of the cost-to-go functions. The resulting data-driven DP scheme is asymptotically consistent and admits an efficient computational solution when combined with parametric value function approximations. If training data is sparse, however, the estimated cost-to-go functions display a high variability and an optimistic bias, while the corresponding control policies perform poorly in out-of-sample tests. To mitigate these small sample effects, we propose a robust data-driven DP scheme, which replaces the expectations in the DP recursions with worst-case expectations over a set of distributions close to the best estimate. We show that the arising minmax problems in the DP recursions reduce to tractable conic programs. We also demonstrate that the proposed robust DP algorithm dominates various non-robust schemes in out-of-sample tests across several application domains. 1
3 0.76419199 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
Author: Marcelo Fiori, Pablo Sprechmann, Joshua Vogelstein, Pablo Muse, Guillermo Sapiro
Abstract: Graph matching is a challenging problem with very important applications in a wide range of fields, from image and video analysis to biological and biomedical problems. We propose a robust graph matching algorithm inspired in sparsityrelated techniques. We cast the problem, resembling group or collaborative sparsity formulations, as a non-smooth convex optimization problem that can be efficiently solved using augmented Lagrangian techniques. The method can deal with weighted or unweighted graphs, as well as multimodal data, where different graphs represent different types of data. The proposed approach is also naturally integrated with collaborative graph inference techniques, solving general network inference problems where the observed variables, possibly coming from different modalities, are not in correspondence. The algorithm is tested and compared with state-of-the-art graph matching techniques in both synthetic and real graphs. We also present results on multimodal graphs and applications to collaborative inference of brain connectivity from alignment-free functional magnetic resonance imaging (fMRI) data. The code is publicly available. 1
4 0.68774909 30 nips-2013-Adaptive dropout for training deep neural networks
Author: Jimmy Ba, Brendan Frey
Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1
5 0.65209931 99 nips-2013-Dropout Training as Adaptive Regularization
Author: Stefan Wager, Sida Wang, Percy Liang
Abstract: Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. 1
6 0.65016878 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
7 0.64869702 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
8 0.64767873 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
9 0.64753032 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
10 0.64738148 173 nips-2013-Least Informative Dimensions
11 0.64682961 5 nips-2013-A Deep Architecture for Matching Short Texts
12 0.64602363 201 nips-2013-Multi-Task Bayesian Optimization
14 0.64463848 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
15 0.64416492 69 nips-2013-Context-sensitive active sensing in humans
16 0.64291805 350 nips-2013-Wavelets on Graphs via Deep Learning
17 0.64285487 318 nips-2013-Structured Learning via Logistic Regression
18 0.64234543 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
19 0.64209926 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
20 0.64116865 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.