nips nips2012 nips2012-110 knowledge-graph by maker-knowledge-mining

110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems


Source: pdf

Author: Morteza Ibrahimi, Adel Javanmard, Benjamin V. Roy

Abstract: We study the problem of adaptive control of a high dimensional linear quadratic (LQ) system. Previous work established the asymptotic convergence to an optimal controller for various adaptive control schemes. More recently, for the average √ cost LQ problem, a regret bound of O( T ) was shown, apart form logarithmic factors. However, this bound scales exponentially with p, the dimension of the state space. In this work we consider the case where the matrices describing the dynamic of the LQ system are sparse and their dimensions are large. We present √ an adaptive control scheme that achieves a regret bound of O(p T ), apart from logarithmic factors. In particular, our algorithm has an average cost of (1 + ) times the optimum cost after T = polylog(p)O(1/ 2 ). This is in comparison to previous work on the dense dynamics where the algorithm requires time that scales exponentially with dimension in order to achieve regret of times the optimal cost. We believe that our result has prominent applications in the emerging area of computational advertising, in particular targeted online advertising and advertising in social networks. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We study the problem of adaptive control of a high dimensional linear quadratic (LQ) system. [sent-4, score-0.286]

2 Previous work established the asymptotic convergence to an optimal controller for various adaptive control schemes. [sent-5, score-0.461]

3 More recently, for the average √ cost LQ problem, a regret bound of O( T ) was shown, apart form logarithmic factors. [sent-6, score-0.355]

4 In this work we consider the case where the matrices describing the dynamic of the LQ system are sparse and their dimensions are large. [sent-8, score-0.192]

5 We present √ an adaptive control scheme that achieves a regret bound of O(p T ), apart from logarithmic factors. [sent-9, score-0.476]

6 In particular, our algorithm has an average cost of (1 + ) times the optimum cost after T = polylog(p)O(1/ 2 ). [sent-10, score-0.19]

7 This is in comparison to previous work on the dense dynamics where the algorithm requires time that scales exponentially with dimension in order to achieve regret of times the optimal cost. [sent-11, score-0.324]

8 We believe that our result has prominent applications in the emerging area of computational advertising, in particular targeted online advertising and advertising in social networks. [sent-12, score-0.4]

9 1 Introduction In this paper we address the problem of adaptive control of a high dimensional linear quadratic (LQ) system. [sent-13, score-0.286]

10 The matrices Q ∈ Rp×p and R ∈ Rr×r are positive semi-definite (PSD) matrices that determine the cost at each step. [sent-18, score-0.136]

11 The evolution of the system is described through the matrices A0 ∈ Rp×p and B 0 ∈ Rp×r . [sent-19, score-0.118]

12 Finally by high dimensional system we mean the case where p, r 1. [sent-20, score-0.138]

13 A celebrated fundamental theorem in control theory asserts that the above LQ system can be optimally controlled by a simple linear feedback if the pair (A0 , B 0 ) is controllable and the pair (A0 , Q1/2 ) is observable. [sent-21, score-0.413]

14 The optimal controller can be explicitly computed from the matrices describing the dynamics and the cost. [sent-22, score-0.335]

15 Throughout this paper we assume that controllability and observability conditions hold. [sent-23, score-0.083]

16 When the matrix Θ0 ≡ [A0 , B 0 ] is unknown, the task is that of adaptive control, where the system is to be learned and controlled at the same time. [sent-24, score-0.249]

17 Early works on the adaptive control of LQ systems relied on the certainty equivalence principle [2]. [sent-25, score-0.267]

18 In this scheme at each time t the unknown parameter Θ0 is estimated based on the observations collected so far and the optimal controller for the 1 estimated system is applied. [sent-26, score-0.434]

19 Such controllers are shown to converge to an optimal controller in the case of minimum variance cost, however, in general they may converge to a suboptimal controller [11]. [sent-27, score-0.466]

20 Subsequently, it has been shown that introducing random exploration by adding noise to the control signal, e. [sent-28, score-0.136]

21 All the aforementioned work have been concerned with the asymptotic convergence of the controller to an optimal controller. [sent-31, score-0.281]

22 In order to achieve regret bounds, cost-biased parameter estimation [12, 8, 1], in particular the optimism in the face of uncertainty (OFU) principle [13] has been shown to be effective. [sent-32, score-0.276]

23 The system is then controlled using the most optimistic parameter estimates, i. [sent-34, score-0.182]

24 The asymptotic convergence of the average cost of OFU for the LQR problem was shown in [6]. [sent-37, score-0.116]

25 This asymptotic result was extended in [1] by providing a bound for the cumulative regret. [sent-38, score-0.151]

26 Assume x(0) = 0 and for a control policy π define the average cost Jπ = limsup 1 T T E[ct ] . [sent-39, score-0.218]

27 (2) (cπ (t) − J∗ ) , T →∞ (3) t=0 Further, define the cumulative regret as T R(T ) = t=0 where cπ (t) is the cost of control policy π at time t and J∗ = J(Θ0 ) √ the optimal average cost. [sent-40, score-0.542]

28 is ˜ ˜ The algorithm proposed in [1] is shown to have cumulative regret O( T ) where O is hiding the logarithmic factors. [sent-41, score-0.375]

29 While no lower bound was provided for the regret, comparison with the multi√ armed bandit problem where a lower bound of O( T ) was shown for the general case [9], suggests that this scaling with time for the cumulative regret is optimal. [sent-42, score-0.376]

30 The focus of [1] was on scaling of the regret with time horizon T . [sent-43, score-0.222]

31 However, the regret of the proposed algorithm scales poorly with dimension. [sent-44, score-0.199]

32 More specifically, the analysis in [1] proves a regret √ bound of R(T ) < Cpp+r+2 T . [sent-45, score-0.232]

33 The current paper focuses on (many) applications where the state and control dimensions are much larger than the time horizon of interest. [sent-46, score-0.201]

34 A powerful reinforcement learning algorithm for these applications should have regret which depends gracefully on dimension. [sent-47, score-0.226]

35 In particular, [3] proved that under suitable conditions the unknown parameters of a noise driven system (i. [sent-52, score-0.206]

36 , no control) whose dynamics are modeled by linear stochastic differential equations can be estimated accurately with as few as O(log(p)) samples. [sent-54, score-0.145]

37 Furthermore, system dynamics would be correlated with past observations through the estimated gain matrix L. [sent-56, score-0.232]

38 Finally, there is no notion of cost in [3] while here we have to obtain bounds on cost and its scaling with p. [sent-57, score-0.189]

39 In this work we extend the result of [3] by showing that under suitable conditions, unknown parameters of sparse high dimensional LQ systems can be accurately estimated with as few as O(log(p + r)) observations. [sent-58, score-0.136]

40 Equipped with this efficient learning method, we show that sparse √ ˜ high dimensional LQ systems can be adaptively controlled with regret O(p T ). [sent-59, score-0.364]

41 Therefore, the cumulative cost at time T is bounded as Ω(pT ). [sent-61, score-0.166]

42 Comparing this to our regret bound, we see that for T = polylog(p)O( 1 ), the cumulative cost of our algorithm 2 is bounded by (1 + ) times the optimum cumulative cost. [sent-62, score-0.475]

43 This is in contrast with the result of [1] where the algorithm needs Ω(p2p ) steps in order to achieve regret of times the optimal cost. [sent-64, score-0.24]

44 Here we are particularly motivated by an emerging field of applications in marketing and advertising. [sent-66, score-0.098]

45 The use of dynamical optimal control models in advertising has a history of at least four decades, cf. [sent-67, score-0.389]

46 In these models, often a partial differential equation is used to describe how advertising expenditure over time translates into sales. [sent-69, score-0.295]

47 The basic problem is to find the advertising expenditure that maximizes the net profit. [sent-70, score-0.264]

48 The focus of these works is to model the temporal dynamics of 2 the advertising expenditure (the control variable) and the variables of interest (sales, goodwill level, etc. [sent-71, score-0.484]

49 There also exists a rich literature studying the spatial interdependence of consumers’ and firms’ behavior to devise marketing schemes [7]. [sent-73, score-0.164]

50 Combination of spatial interdependence and temporal dynamics models for optimal advertising was also considered [16, 15]. [sent-75, score-0.404]

51 A simple temporal dynamics model is extended in [15] by allowing state and control variables to have spatial dependence and introducing a diffusive component in the controlled PDE which describes the spatial dynamics. [sent-76, score-0.435]

52 The controlled PDE is then showed to be equivalent to an abstract linear control system of the form dx(t) = Ax(t) + Bu(t). [sent-77, score-0.318]

53 dt (4) Both [15] and [7] are concerned with the optimal control and the interactions are either dictated by the model or assumed known. [sent-78, score-0.229]

54 Our work deals with a discrete and noisy version of (4) where the dynamics is to be estimated but is known to be sparse. [sent-79, score-0.114]

55 In the model considered in [15] the state variable x lives in an infinite dimensional space. [sent-80, score-0.089]

56 Spatial models in marketing [7] usually consider state variables which have a large number of dimensions, e. [sent-81, score-0.114]

57 High dimensional state space and control is a recurring theme in these applications. [sent-84, score-0.225]

58 At the beginning of episode i the algorithm constructs a confidence set Ω(i) which is guaranteed to include the unknown parameter Θ0 with high probability. [sent-99, score-0.212]

59 The algorithm then chooses Θ(i) ∈ Ω(i) that has the smallest expected cost as the estimated parameter for episode i and applies the optimal control for the estimated parameter during episode i. [sent-100, score-0.617]

60 The confidence set is constructed using observations from the last episode only but the length of episodes are chosen to increase geometrically allowing for more accurate estimates and shrinkage of the confidence set by a constant factor at each episode. [sent-101, score-0.176]

61 Constructing confidence set: Define τi to be the start of episode i with τ0 = 0. [sent-103, score-0.149]

62 Let L(i) be the controller that has been chosen for episode i. [sent-104, score-0.332]

63 For t ∈ [τi , τi+1 ) the system is controlled by u(t) = −L(i) x(t) and the system dynamics can be written as x(t + 1) = (A0 − B 0 L(i) )x(t) + w(t + 1). [sent-105, score-0.357]

64 do Apply the control u(t) = −L(i) x(t) until τi+1 − 1 and observe the trace {x(t)}τi ≤t<τi+1 . [sent-112, score-0.136]

65 The first term in the cost function is the normalized negative log likelihood which measures the fidelity to the observations while the second term imposes the sparsity constraint on Θu . [sent-117, score-0.154]

66 For Θ(1) , Θ(2) ∈ Rp×q define the distance d(Θ(1) , Θ(2) ) as d(Θ(1) , Θ(2) ) = max Θ(1) − Θ(2) 2 , u u u∈[p] (7) where Θu is the uth row of the matrix Θ. [sent-119, score-0.1]

67 Having the estimator Θ(i) the algorithm constructs the confidence set for episode i as Ω(i) = {Θ ∈ Rp×q | d(Θ, Θ(i) ) ≤ 2−i }, (8) where > 0 is an input parameter to the algorithm. [sent-122, score-0.149]

68 Design of the controller: Let J(Θ) be the minimum expected cost if the interaction matrix is Θ = [A, B] and denote by L(Θ) the optimal controller that achieves the expected cost J(Θ). [sent-126, score-0.388]

69 The algorithm implements OFU principle by choosing, at the beginning of episode i, an estimate Θ(i) ∈ Ω(i) such that Θ(i) ∈ argmin J(Θ). [sent-127, score-0.212]

70 (9) Θ∈Ω(i) The optimal control corresponding to Θ(i) is then applied during episode i, i. [sent-128, score-0.326]

71 Recall that for Θ = [A, B], the optimal controller is given through the following relations K(Θ) = Q + AT K(Θ)A − AT K(Θ)B(B T K(Θ)B + R)−1 B T K(Θ)A , (Riccati equation) L(Θ) = (B T K(Θ)B + R)−1 B T K(Θ)A . [sent-131, score-0.224]

72 3 Main Results In this section we present performance guarantees in terms of cumulative regret and learning accuracy for the presented algorithm. [sent-133, score-0.283]

73 (10) 0 0 If the closed loop system (A − B L) is stable then the solution to the above equation exists and the state vector x(t) has a Normal stationary distribution with covariance Λ. [sent-136, score-0.166]

74 Define L to be (ρ, Cmin , α) identifiable (with respect to Θ0 ) if it satisfies the following conditions for all S ⊆ [q], |S| ≤ k. [sent-142, score-0.083]

75 The first condition simply states that if the system is controlled using the regulator L then the closed loop autonomous system is asymptotically stable. [sent-144, score-0.413]

76 The second and third conditions are similar to what is referred to in the sparse signal recovery literature as the mutual incoherence or irreprepresentable conditions. [sent-145, score-0.143]

77 Various examples and results exist for the matrix families that satisfy these conditions [18]. [sent-146, score-0.083]

78 There exists a vast literature on the applicability of these conditions and scenarios in which they are known to hold. [sent-156, score-0.083]

79 These conditions are almost necessary for the successful recovery by 1 relaxation. [sent-157, score-0.116]

80 For a discussion on these and other similar conditions imposed for sparse signal recovery we refer the reader to [19] and [20] and the references therein. [sent-158, score-0.143]

81 Our first result states that the system can be learned ij ij efficiently from its trajectory observations when it is controlled by an identifiable regulator. [sent-160, score-0.237]

82 Our second result states that equipped with an efficient learning algorithm, the LQ system of Eq. [sent-170, score-0.145]

83 (1) √ 3 ˜ can be controlled with regret O(p T log 2 (1/δ)) under suitable assumptions. [sent-171, score-0.335]

84 Θ0 and σL (Θ0 , ) = sup Θ∈N L(Θ) 2 σK (Θ0 , ) = ≤ C, (Θ0 ) sup Θ∈N (Θ0 ) Also define (Θ0 , ) = sup max(1, max Lj (Θ) 2 ) . [sent-177, score-0.103]

85 Then, with probability at least 1 − δ the cumulative regret of A LGORITHM (cf. [sent-186, score-0.308]

86 the table) is bounded as √ 3 ˜ R(T ) ≤ O(p T log 2 (1/δ)) , (12) ˜ where O is hiding the logarithmic factors. [sent-187, score-0.137]

87 2 we first state a set of sufficient conditions for the solution of the 1 -regularized least squares to be within some distance, as defined by d(·, ·), of the true parameter. [sent-191, score-0.183]

88 Subsequently, we prove that these conditions hold with high probability. [sent-192, score-0.108]

89 n L(Θ0 ) = u (13) The following proposition, a proof of which can be found in [20], provides a set of sufficient conditions for the accuracy of the 1 -regularized least squares solution. [sent-202, score-0.172]

90 (14) For any 0 < < Θmin if the following conditions hold G HS C S − HS C S the ∞ ∞ 1 -regularized λα , 3 α Cmin √ , ≤ 12 k ≤ GS Cmin − λ, 4k α Cmin √ , − HSS ∞ ≤ 12 k ∞ HSS ≤ (15) (16) least square solution (5) satisfies d(Θu , Θ0 ) ≤ . [sent-208, score-0.133]

91 First, we give a decomposition for the gap between the cost obtained by the algorithm and the optimal cost. [sent-255, score-0.149]

92 Notice that the left-hand side is the average τ cost occurred with initial state x(t) [5, 4]. [sent-260, score-0.124]

93 Technical lemmas The following lemmas establish upper bounds on C1 , C2 , C3 . [sent-270, score-0.093]

94 Regret bounds for the adaptive control of linear quadratic a systems. [sent-304, score-0.264]

95 Adaptive control of linear time invariant systems: the bet on the best principle. [sent-329, score-0.178]

96 Achieving optimality in adaptive control: the bet on the best approach. [sent-346, score-0.109]

97 The astrom-wittenmark self-tuning regulator revisited and els-based adaptive trackers. [sent-365, score-0.146]

98 A new family of optimal adaptive controllers for markov chains. [sent-370, score-0.167]

99 Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. [sent-380, score-0.216]

100 Sharp thresholds for high-dimensional and noisy sparsity recovery usingconstrained quadratic programming (lasso). [sent-403, score-0.096]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cmin', 0.616), ('lq', 0.295), ('regret', 0.199), ('advertising', 0.187), ('controller', 0.183), ('hjs', 0.154), ('episode', 0.149), ('qx', 0.146), ('ru', 0.138), ('control', 0.136), ('rp', 0.134), ('hij', 0.107), ('bt', 0.103), ('hss', 0.103), ('ofu', 0.103), ('ft', 0.099), ('system', 0.091), ('controlled', 0.091), ('dynamics', 0.084), ('cumulative', 0.084), ('conditions', 0.083), ('cost', 0.082), ('regulator', 0.079), ('expenditure', 0.077), ('ibrahimi', 0.077), ('stanford', 0.075), ('marketing', 0.072), ('lt', 0.068), ('polylog', 0.068), ('adaptive', 0.067), ('lj', 0.064), ('hs', 0.063), ('identi', 0.06), ('dence', 0.06), ('controllers', 0.059), ('hiding', 0.051), ('interdependence', 0.051), ('uth', 0.051), ('maxj', 0.05), ('rq', 0.05), ('lemma', 0.047), ('dimensional', 0.047), ('dynamic', 0.047), ('event', 0.046), ('lgorithm', 0.045), ('optimism', 0.045), ('log', 0.045), ('state', 0.042), ('bet', 0.042), ('rr', 0.042), ('spatial', 0.041), ('logarithmic', 0.041), ('optimal', 0.041), ('pde', 0.039), ('quadratic', 0.036), ('theorem', 0.035), ('lemmas', 0.034), ('asserts', 0.034), ('lai', 0.034), ('asymptotic', 0.034), ('proposition', 0.034), ('squares', 0.033), ('bound', 0.033), ('recovery', 0.033), ('loop', 0.033), ('unknown', 0.032), ('certainty', 0.032), ('principle', 0.032), ('beginning', 0.031), ('proof', 0.031), ('corollary', 0.031), ('differential', 0.031), ('gs', 0.03), ('estimated', 0.03), ('uenced', 0.029), ('interactions', 0.029), ('pseudo', 0.028), ('states', 0.028), ('reinforcement', 0.027), ('programming', 0.027), ('con', 0.027), ('observations', 0.027), ('matrices', 0.027), ('bandit', 0.027), ('sparse', 0.027), ('optimum', 0.026), ('emerging', 0.026), ('sup', 0.026), ('equipped', 0.026), ('feedback', 0.026), ('decomposition', 0.026), ('bounds', 0.025), ('hold', 0.025), ('max', 0.025), ('least', 0.025), ('row', 0.024), ('automatic', 0.024), ('horizon', 0.023), ('pt', 0.023), ('concerned', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

Author: Morteza Ibrahimi, Adel Javanmard, Benjamin V. Roy

Abstract: We study the problem of adaptive control of a high dimensional linear quadratic (LQ) system. Previous work established the asymptotic convergence to an optimal controller for various adaptive control schemes. More recently, for the average √ cost LQ problem, a regret bound of O( T ) was shown, apart form logarithmic factors. However, this bound scales exponentially with p, the dimension of the state space. In this work we consider the case where the matrices describing the dynamic of the LQ system are sparse and their dimensions are large. We present √ an adaptive control scheme that achieves a regret bound of O(p T ), apart from logarithmic factors. In particular, our algorithm has an average cost of (1 + ) times the optimum cost after T = polylog(p)O(1/ 2 ). This is in comparison to previous work on the dense dynamics where the algorithm requires time that scales exponentially with dimension in order to achieve regret of times the optimal cost. We believe that our result has prominent applications in the emerging area of computational advertising, in particular targeted online advertising and advertising in social networks. 1

2 0.11855543 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

Author: Ronald Ortner, Daniil Ryabko

Abstract: We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are H¨ lder continuity of rewards and transition o probabilities. 1

3 0.11362346 295 nips-2012-Risk-Aversion in Multi-armed Bandits

Author: Amir Sani, Alessandro Lazaric, Rémi Munos

Abstract: Stochastic multi–armed bandits solve the Exploration–Exploitation dilemma and ultimately maximize the expected reward. Nonetheless, in many practical problems, maximizing the expected reward is not the most desirable objective. In this paper, we introduce a novel setting based on the principle of risk–aversion where the objective is to compete against the arm with the best risk–return trade–off. This setting proves to be more difficult than the standard multi-arm bandit setting due in part to an exploration risk which introduces a regret associated to the variability of an algorithm. Using variance as a measure of risk, we define two algorithms, investigate their theoretical guarantees, and report preliminary empirical results. 1

4 0.10180202 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

Author: Brendan Mcmahan, Matthew Streeter

Abstract: Some of the most compelling applications of online convex optimization, including online prediction and classification, are unconstrained: the natural feasible set is Rn . Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point ˚ are known in advance. We present algox rithms that, without such prior knowledge, offer near-optimal regret bounds with respect to any choice of ˚. In particular, regret with respect to ˚ = 0 is constant. x x We then prove lower bounds showing that our guarantees are near-optimal in this setting. 1

5 0.093673281 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

Author: Nicolò Cesa-bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz

Abstract: Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. This is done using either a carefully designed projection or by a weight sharing technique. Via a novel unified analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other related regrets. Our analysis also captures and extends the generalized weight sharing technique of Bousquet and Warmuth, and can be refined in several ways, including improvements for small losses and adaptive tuning of parameters. 1

6 0.088407278 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

7 0.086445473 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

8 0.076267324 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

9 0.075057492 293 nips-2012-Relax and Randomize : From Value to Algorithms

10 0.074598476 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

11 0.074079663 102 nips-2012-Distributed Non-Stochastic Experts

12 0.072309747 61 nips-2012-Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

13 0.068754949 232 nips-2012-Multiplicative Forests for Continuous-Time Processes

14 0.065932624 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

15 0.062879279 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

16 0.062334139 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

17 0.060090851 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

18 0.059849005 344 nips-2012-Timely Object Recognition

19 0.05818747 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

20 0.057883658 254 nips-2012-On the Sample Complexity of Robust PCA


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.167), (1, -0.077), (2, 0.078), (3, 0.026), (4, 0.043), (5, 0.045), (6, -0.005), (7, -0.0), (8, -0.024), (9, 0.016), (10, 0.068), (11, 0.032), (12, -0.106), (13, -0.101), (14, -0.106), (15, 0.036), (16, -0.006), (17, 0.011), (18, -0.044), (19, -0.045), (20, 0.063), (21, 0.002), (22, -0.064), (23, -0.037), (24, 0.035), (25, -0.051), (26, -0.02), (27, -0.012), (28, 0.043), (29, -0.051), (30, 0.025), (31, -0.005), (32, 0.031), (33, 0.045), (34, 0.019), (35, 0.017), (36, -0.012), (37, -0.035), (38, 0.015), (39, 0.004), (40, 0.04), (41, 0.027), (42, 0.01), (43, -0.104), (44, 0.028), (45, 0.011), (46, -0.045), (47, -0.112), (48, -0.051), (49, 0.087)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.8904143 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

Author: Morteza Ibrahimi, Adel Javanmard, Benjamin V. Roy

Abstract: We study the problem of adaptive control of a high dimensional linear quadratic (LQ) system. Previous work established the asymptotic convergence to an optimal controller for various adaptive control schemes. More recently, for the average √ cost LQ problem, a regret bound of O( T ) was shown, apart form logarithmic factors. However, this bound scales exponentially with p, the dimension of the state space. In this work we consider the case where the matrices describing the dynamic of the LQ system are sparse and their dimensions are large. We present √ an adaptive control scheme that achieves a regret bound of O(p T ), apart from logarithmic factors. In particular, our algorithm has an average cost of (1 + ) times the optimum cost after T = polylog(p)O(1/ 2 ). This is in comparison to previous work on the dense dynamics where the algorithm requires time that scales exponentially with dimension in order to achieve regret of times the optimal cost. We believe that our result has prominent applications in the emerging area of computational advertising, in particular targeted online advertising and advertising in social networks. 1

2 0.67779648 295 nips-2012-Risk-Aversion in Multi-armed Bandits

Author: Amir Sani, Alessandro Lazaric, Rémi Munos

Abstract: Stochastic multi–armed bandits solve the Exploration–Exploitation dilemma and ultimately maximize the expected reward. Nonetheless, in many practical problems, maximizing the expected reward is not the most desirable objective. In this paper, we introduce a novel setting based on the principle of risk–aversion where the objective is to compete against the arm with the best risk–return trade–off. This setting proves to be more difficult than the standard multi-arm bandit setting due in part to an exploration risk which introduces a regret associated to the variability of an algorithm. Using variance as a measure of risk, we define two algorithms, investigate their theoretical guarantees, and report preliminary empirical results. 1

3 0.6693157 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

Author: Ronald Ortner, Daniil Ryabko

Abstract: We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are H¨ lder continuity of rewards and transition o probabilities. 1

4 0.66320199 61 nips-2012-Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Author: Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric

Abstract: We study the problem of identifying the best arm(s) in the stochastic multi-armed bandit setting. This problem has been studied in the literature from two different perspectives: fixed budget and fixed confidence. We propose a unifying approach that leads to a meta-algorithm called unified gap-based exploration (UGapE), with a common structure and similar theoretical analysis for these two settings. We prove a performance bound for the two versions of the algorithm showing that the two problems are characterized by the same notion of complexity. We also show how the UGapE algorithm as well as its theoretical analysis can be extended to take into account the variance of the arms and to multiple bandits. Finally, we evaluate the performance of UGapE and compare it with a number of existing fixed budget and fixed confidence algorithms. 1

5 0.63031483 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

Author: Brendan Mcmahan, Matthew Streeter

Abstract: Some of the most compelling applications of online convex optimization, including online prediction and classification, are unconstrained: the natural feasible set is Rn . Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point ˚ are known in advance. We present algox rithms that, without such prior knowledge, offer near-optimal regret bounds with respect to any choice of ˚. In particular, regret with respect to ˚ = 0 is constant. x x We then prove lower bounds showing that our guarantees are near-optimal in this setting. 1

6 0.58058697 102 nips-2012-Distributed Non-Stochastic Experts

7 0.56633508 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

8 0.56330913 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

9 0.54184473 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

10 0.51876366 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

11 0.49964747 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

12 0.49291447 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

13 0.47940692 161 nips-2012-Interpreting prediction markets: a stochastic approach

14 0.47544712 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability

15 0.47330233 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

16 0.46992871 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

17 0.46367773 151 nips-2012-High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer's Disease Progression Prediction

18 0.45353776 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

19 0.45322642 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

20 0.44854259 299 nips-2012-Scalable imputation of genetic data with a discrete fragmentation-coagulation process


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.023), (21, 0.031), (36, 0.013), (38, 0.12), (42, 0.03), (54, 0.016), (74, 0.019), (76, 0.13), (80, 0.055), (92, 0.472)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88702404 44 nips-2012-Approximating Concavely Parameterized Optimization Problems

Author: Joachim Giesen, Jens Mueller, Soeren Laue, Sascha Swiercy

Abstract: We consider an abstract class of optimization problems that are parameterized concavely in a single parameter, and show that the solution path along the √ parameter can always be approximated with accuracy ε > 0 by a set of size O(1/ ε). A √ lower bound of size Ω(1/ ε) shows that the upper bound is tight up to a constant factor. We also devise an algorithm that calls a step-size oracle and computes an √ approximate path of size O(1/ ε). Finally, we provide an implementation of the oracle for soft-margin support vector machines, and a parameterized semi-definite program for matrix completion. 1

2 0.88118553 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification

Author: Richard Socher, Brody Huval, Bharath Bath, Christopher D. Manning, Andrew Y. Ng

Abstract: Recent advances in 3D sensing technologies make it possible to easily record color and depth images which together can improve object recognition. Most current methods rely on very well-designed features for this new 3D modality. We introduce a model based on a combination of convolutional and recursive neural networks (CNN and RNN) for learning features and classifying RGB-D images. The CNN layer learns low-level translationally invariant features which are then given as inputs to multiple, fixed-tree RNNs in order to compose higher order features. RNNs can be seen as combining convolution and pooling into one efficient, hierarchical operation. Our main result is that even RNNs with random weights compose powerful features. Our model obtains state of the art performance on a standard RGB-D object dataset while being more accurate and faster during training and testing than comparable architectures such as two-layer CNNs. 1

3 0.87186533 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search

Author: Yunchao Gong, Sanjiv Kumar, Vishal Verma, Svetlana Lazebnik

Abstract: This paper focuses on the problem of learning binary codes for efficient retrieval of high-dimensional non-negative data that arises in vision and text applications where counts or frequencies are used as features. The similarity of such feature vectors is commonly measured using the cosine of the angle between them. In this work, we introduce a novel angular quantization-based binary coding (AQBC) technique for such data and analyze its properties. In its most basic form, AQBC works by mapping each non-negative feature vector onto the vertex of the binary hypercube with which it has the smallest angle. Even though the number of vertices (quantization landmarks) in this scheme grows exponentially with data dimensionality d, we propose a method for mapping feature vectors to their smallest-angle binary vertices that scales as O(d log d). Further, we propose a method for learning a linear transformation of the data to minimize the quantization error, and show that it results in improved binary codes. Experiments on image and text datasets show that the proposed AQBC method outperforms the state of the art. 1

4 0.86826468 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

Author: Yichuan Zhang, Zoubin Ghahramani, Amos J. Storkey, Charles A. Sutton

Abstract: Continuous relaxations play an important role in discrete optimization, but have not seen much use in approximate probabilistic inference. Here we show that a general form of the Gaussian Integral Trick makes it possible to transform a wide class of discrete variable undirected models into fully continuous systems. The continuous representation allows the use of gradient-based Hamiltonian Monte Carlo for inference, results in new ways of estimating normalization constants (partition functions), and in general opens up a number of new avenues for inference in difficult discrete systems. We demonstrate some of these continuous relaxation inference algorithms on a number of illustrative problems. 1

5 0.83703959 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space

Author: Lucas Theis, Jascha Sohl-dickstein, Matthias Bethge

Abstract: We present a new learning strategy based on an efficient blocked Gibbs sampler for sparse overcomplete linear models. Particular emphasis is placed on statistical image modeling, where overcomplete models have played an important role in discovering sparse representations. Our Gibbs sampler is faster than general purpose sampling schemes while also requiring no tuning as it is free of parameters. Using the Gibbs sampler and a persistent variant of expectation maximization, we are able to extract highly sparse distributions over latent sources from data. When applied to natural images, our algorithm learns source distributions which resemble spike-and-slab distributions. We evaluate the likelihood and quantitatively compare the performance of the overcomplete linear model to its complete counterpart as well as a product of experts model, which represents another overcomplete generalization of the complete linear model. In contrast to previous claims, we find that overcomplete representations lead to significant improvements, but that the overcomplete linear model still underperforms other models. 1

same-paper 6 0.83591712 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

7 0.62501013 329 nips-2012-Super-Bit Locality-Sensitive Hashing

8 0.62415421 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

9 0.59758151 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

10 0.59570444 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

11 0.58752024 94 nips-2012-Delay Compensation with Dynamical Synapses

12 0.58188516 24 nips-2012-A mechanistic model of early sensory processing based on subtracting sparse representations

13 0.58160359 260 nips-2012-Online Sum-Product Computation Over Trees

14 0.58128321 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

15 0.57940638 65 nips-2012-Cardinality Restricted Boltzmann Machines

16 0.57823479 163 nips-2012-Isotropic Hashing

17 0.57331085 102 nips-2012-Distributed Non-Stochastic Experts

18 0.56313396 148 nips-2012-Hamming Distance Metric Learning

19 0.56215978 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines

20 0.56038511 197 nips-2012-Learning with Recursive Perceptual Representations