nips nips2012 nips2012-21 knowledge-graph by maker-knowledge-mining

21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes


Source: pdf

Author: Thomas Furmston, David Barber

Abstract: Parametric policy search algorithms are one of the methods of choice for the optimisation of Markov Decision Processes, with Expectation Maximisation and natural gradient ascent being popular methods in this field. In this article we provide a unifying perspective of these two algorithms by showing that their searchdirections in the parameter space are closely related to the search-direction of an approximate Newton method. This analysis leads naturally to the consideration of this approximate Newton method as an alternative optimisation method for Markov Decision Processes. We are able to show that the algorithm has numerous desirable properties, absent in the naive application of Newton’s method, that make it a viable alternative to either Expectation Maximisation or natural gradient ascent. Empirical results suggest that the algorithm has excellent convergence and robustness properties, performing strongly in comparison to both Expectation Maximisation and natural gradient ascent. 1 Markov Decision Processes Markov Decision Processes (MDPs) are the most commonly used model for the description of sequential decision making processes in a fully observable environment, see e.g. [5]. A MDP is described by the tuple {S, A, H, p1 , p, π, R}, where S and A are sets known respectively as the state and action space, H ∈ N is the planning horizon, which can be either finite or infinite, and {p1 , p, π, R} are functions that are referred as the initial state distribution, transition dynamics, policy (or controller) and the reward function. In general the state and action spaces can be arbitrary sets, but we restrict our attention to either discrete sets or subsets of Rn , where n ∈ N. We use boldface notation to represent a vector and also use the notation z = (s, a) to denote a state-action pair. Given a MDP the trajectory of the agent is determined by the following recursive procedure: Given the agent’s state, st , at a given time-point, t ∈ NH , an action is selected according to the policy, at ∼ π(·|st ); The agent will then transition to a new state according to the transition dynamics, st+1 ∼ p(·|at , st ); this process is iterated sequentially through all of the time-points in the planning horizon, where the state of the initial time-point is determined by the initial state distribution s1 ∼ p1 (·). At each time-point the agent receives a (scalar) reward that is determined by the reward function, where this function depends on the current action and state of the environment. Typically the reward function is assumed to be bounded, but as the objective is linear in the reward function we assume w.l.o.g that it is non-negative. The most widely used objective in the MDP framework is to maximise the total expected reward of the agent over the course of the planning horizon. This objective can take various forms, including an infinite planning horizon, with either discounted or average rewards, or a finite planning horizon. The theoretical contributions of this paper are applicable to all three frameworks, but for notational ease and for reasons of space we concern ourselves with the infinite horizon framework with discounted rewards. In this framework the boundedness of the objective function is ensured by the 1 introduction of a discount factor, γ ∈ [0, 1), which scales the rewards of the various time-points in a geometric manner. Writing the objective function and trajectory distribution directly in terms of the parameter vector then, for any w ∈ W, the objective function takes the form ∞ Ept (a,s;w) γ t−1 R(a, s) , U (w) = (1) t=1 where we have denoted the parameter space by W and have used the notation pt (a, s; w) to represent the marginal p(st = s, at = a; w) of the joint state-action trajectory distribution H−1 p(a1:H , s1:H ; w) = π(aH |sH ; w) p(st+1 |at , st )π(at |st ; w) p1 (s1 ), H ∈ N. (2) t=1 Note that the policy is now written in terms of its parametric representation, π(a|s; w). It is well known that the global optimum of (1) can be obtained through dynamic programming, see e.g. [5]. However, due to various issues, such as prohibitively large state-action spaces or highly non-linear transition dynamics, it is not possible to find the global optimum of (1) in most real-world problems of interest. Instead most research in this area focuses on obtaining approximate solutions, for which there exist numerous techniques, such as approximate dynamic programming methods [6], Monte-Carlo tree search methods [19] and policy search methods, both parametric [27, 21, 16, 18] and non-parametric [2, 25]. This work is focused solely on parametric policy search methods, by which we mean gradient-based methods, such as steepest and natural gradient ascent [23, 1], along with Expectation Maximisation [11], which is a bound optimisation technique from the statistics literature. Since their introduction [14, 31, 10, 16] these methods have been the centre of a large amount of research, with much of it focusing on gradient estimation [21, 4], variance reduction techniques [30, 15], function approximation techniques [27, 8, 20] and real-world applications [18, 26]. While steepest gradient ascent has enjoyed some success it is known to suffer from some substantial issues that often make it unattractive in practice, such as slow convergence and susceptibility to poor scaling of the objective function [23]. Various optimisation methods have been introduced as an alternative, most notably natural gradient ascent [16, 24, 3] and Expectation Maximisation [18, 28], which are currently the methods of choice among parametric policy search algorithms. In this paper our primary focus is on the search-direction (in the parameter space) of these two methods. 2 Search Direction Analysis In this section we will perform a novel analysis of the search-direction of both natural gradient ascent and Expectation Maximisation. In gradient-based algorithms of Markov Decision Processes the update of the policy parameters take the form wnew = w + αM(w) w U (w), (3) + where α ∈ R is the step-size parameter and M(w) is some positive-definite matrix that possibly depends on w. It is well-known that such an update will increase the total expected reward, provided that α is sufficiently small, and this process will converge to a local optimum of (1) provided the step-size sequence is appropriately selected. While EM doesn’t have an update of the form given in (3) we shall see that the algorithm is closely related to such an update. It is convenient for later reference to note that the gradient w U (w) can be written in the following form w U (w) = Epγ (z;w)Q(z;w) w log π(a|s; w) , (4) where we use the expectation notation E[·] to denote the integral/summation w.r.t. a non-negative function. The term pγ (z; w) is a geometric weighted average of state-action occupancy marginals given by ∞ γ t−1 pt (z; w), pγ (z; w) = t=1 while the term Q(z; w) is referred to as the state-action value function and is equal to the total expected future reward from the current time-point onwards, given the current state-action pair, z, 2 and parameter vector, w, i.e. ∞ Ept (z ;w) γ t−1 R(z ) z1 = z . Q(z; w) = t=1 This is a standard result and due to reasons of space we have omitted the details, but see e.g. [27] or section(6.1) of the supplementary material for more details. An immediate issue concerning updates of the form (3) is in the selection of the ‘optimal’ choice of the matrix M(w), which clearly depends on the sense in which optimality is defined. There are numerous reasonable properties that are desirable of such an update, including the numerical stability and computational complexity of the parameter update, as well as the rate of convergence of the overall algorithm resulting from these updates. While all reasonable criteria the rate of convergence is of such importance in an optimisation algorithm that it is a logical starting point in our analysis. For this reason we concern ourselves with relating these two parametric policy search algorithms to the Newton method, which has the highly desirable property of having a quadratic rate of convergence in the vicinity of a local optimum. The Newton method is well-known to suffer from problems that make it either infeasible or unattractive in practice, but in terms of forming a basis for theoretical comparisons it is a logical starting point. We shall discuss some of the issues with the Newton method in more detail in section(3). In the Newton method the matrix M(w) is set to the negative inverse Hessian, i.e. M(w) = −H−1 (w), where H(w) = w T w U (w). where we have denoted the Hessian by H(w). Using methods similar to those used to calculate the gradient, it can be shown that the Hessian takes the form H(w) = H1 (w) + H2 (w), (5) where ∞ Ep(z1:t ;w) γ t−1 R(zt ) w Ep(z1:t ;w) γ t−1 R(zt ) H1 (w) = w log p(z1:t ; w) T w log p(z1:t ; w) , (6) t=1 ∞ H2 (w) = T w log p(z1:t ; w) . (7) t=1 We have omitted the details of the derivation, but these can be found in section(6.2) of the supplementary material, with a similar derivation of a sample-based estimate of the Hessian given in [4]. 2.1 Natural Gradient Ascent To overcome some of the issues that can hinder steepest gradient ascent an alternative, natural gradient, was introduced in [16]. Natural gradient ascent techniques originated in the neural network and blind source separation literature, see e.g. [1], and take the perspective that the parameter space has a Riemannian manifold structure, as opposed to a Euclidean structure. Deriving the steepest ascent direction of U (w) w.r.t. a local norm defined on this parameter manifold (as opposed to w.r.t. the Euclidean norm, which is the case in steepest gradient ascent) results in natural gradient ascent. We denote the quadratic form that induces this local norm on the parameter manifold by G(w), i.e. d(w)2 = wT G(w)w. The derivation for natural gradient ascent is well-known, see e.g. [1], and its application to the objective (1) results in a parameter update of the form wk+1 = wk + αk G−1 (wk ) w U (wk ). −1 In terms of (3) this corresponds to M(w) = G (w). In the case of MDPs the most commonly used local norm is given by the Fisher information matrix of the trajectory distribution, see e.g. [3, 24], and due to the Markovian structure of the dynamics it is given by G(w) = −Epγ (z;w) w T w log π(a|s; w) . (8) We note that there is an alternate, but equivalent, form of writing the Fisher information matrix, see e.g. [24], but we do not use it in this work. 3 In order to relate natural gradient ascent to the Newton method we first rewrite the matrix (7) into the following form H2 (w) = Epγ (z;w)Q(z;w) w T w log π(a|s; w) . (9) For reasons of space the details of this reformulation of (7) are left to section(6.2) of the supplementary material. Comparing the Fisher information matrix (8) with the form of H2 (w) given in (9) it is clear that natural gradient ascent has a relationship with the approximate Newton method that uses H2 (w) in place of H(w). In terms of (3) this approximate Newton method corresponds to setting −1 M(w) = −H2 (w). In particular it can be seen that the difference between the two methods lies in the non-negative function w.r.t. which the expectation is taken in (8) and (9). (It also appears that there is a difference in sign, but observing the form of M(w) for each algorithm shows that this is not the case.) In the Fisher information matrix the expectation is taken w.r.t. to the geometrically weighted summation of state-action occupancy marginals of the trajectory distribution, while in H2 (w) there is an additional weighting from the state-action value function. Hence, H2 (w) incorporates information about the reward structure of the objective function, whereas the Fisher information matrix does not, and so it will generally contain more information about the curvature of the objective function. 2.2 Expectation Maximisation The Expectation Maximisation algorithm, or EM-algorithm, is a powerful optimisation technique from the statistics literature, see e.g. [11], that has recently been the centre of much research in the planning and reinforcement learning communities, see e.g. [10, 28, 18]. A quantity of central importance in the EM-algorithm for MDPs is the following lower-bound on the log-objective log U (w) ≥ Hentropy (q(z1:t , t)) + Eq(z1:t ,t) log γ t−1 R(zt )p(z1:t ; w) , (10) where Hentropy is the entropy function and q(z1:t , t) is known as the ‘variational distribution’. Further details of the EM-algorithm for MDPs and a derivation of (10) are given in section(6.3) of the supplementary material or can be found in e.g. [18, 28]. The parameter update of the EM-algorithm is given by the maximum (w.r.t. w) of the ‘energy’ term, Q(w, wk ) = Epγ (z;wk )Q(z;wk ) log π(a|s; w) . (11) Note that Q is a two-parameter function, where the first parameter occurs inside the expectation and the second parameter defines the non-negative function w.r.t. the expectation is taken. This decoupling allows the maximisation over w to be performed explicitly in many cases of interest. For example, when the log-policy is quadratic in w the maximisation problems is equivalent to a least-squares problem and the optimum can be found through solving a linear system of equations. It has previously been noted, again see e.g. [18], that the parameter update of steepest gradient ascent and the EM-algorithm can be related through this ‘energy’ term. In particular the gradient (4) evaluated at wk can also be written as follows w|w=wk U (w) = 10 w|w=wk Q(w, wk ), where 10 we use the notation w to denote the first derivative w.r.t. the first parameter, while the update of the EM-algorithm is given by wk+1 = argmaxw∈W Q(w, wk ). In other words, steepest gradient ascent moves in the direction that most rapidly increases Q(w, wk ) w.r.t. the first variable, while the EM-algorithm maximises Q(w, wk ) w.r.t. the first variable. While this relationship is true, it is also quite a negative result. It states that in situations where it is not possible to explicitly perform the maximisation over w in (11) then the alternative, in terms of the EM-algorithm, is this generalised EM-algorithm, which is equivalent to steepest gradient ascent. Considering that algorithms such as EM are typically considered because of the negative aspects related to steepest gradient ascent this is an undesirable alternative. It is possible to find the optimum of (11) numerically, but this is also undesirable as it results in a double-loop algorithm that could be computationally expensive. Finally, this result provides no insight into the behaviour of the EM-algorithm, in terms of the direction of its parameter update, when the maximisation over w in (11) can be performed explicitly. Instead we provide the following result, which shows that the step-direction of the EM-algorithm has an underlying relationship with the Newton method. In particular we show that, under suitable 4 regularity conditions, the direction of the EM-update, i.e. wk+1 − wk , is the same, up to first order, as the direction of an approximate Newton method that uses H2 (w) in place of H(w). Theorem 1. Suppose we are given a Markov Decision Process with objective (1) and Markovian trajectory distribution (2). Consider the update of the parameter through Expectation Maximisation at the k th iteration of the algorithm, i.e. wk+1 = argmaxw∈W Q(w, wk ). Provided that Q(w, wk ) is twice continuously differentiable in the first parameter we have that −1 wk+1 − wk = −H2 (wk ) w|w=wk U (w) + O( wk+1 − wk 2 ). (12) Additionally, in the case where the log-policy is quadratic the relation to the approximate Newton method is exact, i.e. the second term on the r.h.s. (12) is zero. Proof. The idea of the proof is simple and only involves performing a Taylor expansion of 10 w Q(w, wk ). As Q is assumed to be twice continuously differentiable in the first component this Taylor expansion is possible and gives 10 w Q(wk+1 , wk ) = 10 w Q(wk , wk ) + 20 w Q(wk , wk )(wk+1 − wk ) + O( wk+1 − wk 2 ). (13) As wk+1 = argmaxw∈W Q(w, wk ) it follows that 10 Q(wk+1 , wk ) = 0. This means that, upon w ignoring higher order terms in wk+1 − wk , the Taylor expansion (13) can be rewritten into the form wk+1 − wk = − 20 −1 w Q(wk , wk ) 10 w Q(wk , wk ). (14) 10 = The proof is completed by observing that w|w=wk U (w) and w Q(wk , wk ) 20 Q(wk , wk ) = H2 (wk ). The second statement follows because in the case where the log-policy w is quadratic the higher order terms in the Taylor expansion vanish. 2.3 Summary In this section we have provided a novel analysis of both natural gradient ascent and Expectation Maximisation when applied to the MDP framework. Previously, while both of these algorithms have proved popular methods for MDP optimisation, there has been little understanding of them in terms of their search-direction in the parameter space or their relation to the Newton method. Firstly, our analysis shows that the Fisher information matrix, which is used in natural gradient ascent, is similar to H2 (w) in (5) with the exception that the information about the reward structure of the problem is not contained in the Fisher information matrix, while such information is contained in H2 (w). Additionally we have shown that the step-direction of the EM-algorithm is, up to first order, an approximate Newton method that uses H2 (w) in place of H(w) and employs a constant step-size of one. 3 An Approximate Newton Method −1 A natural follow on from the analysis in section(2) is the consideration of using M(w) = −H2 (w) in (3), a method we call the full approximate Newton method from this point onwards. In this section we show that this method has many desirable properties that make it an attractive alternative to other parametric policy search methods. Additionally, denoting the diagonal matrix formed from the diagonal elements of H2 (w) by D2 (w), we shall also consider the method that uses M(w) = −1 −D2 (w) in (3). We call this second method the diagonal approximate Newton method. Recall that in (3) it is necessary that M(w) is positive-definite (in the Newton method this corresponds to requiring the Hessian to be negative-definite) to ensure an increase of the objective. In general the objective (1) is not concave, which means that the Hessian will not be negative-definite over the entire parameter space. In such cases the Newton method can actually lower the objective and this is an undesirable aspect of the Newton method. An attractive property of the approximate Hessian, H2 (w), is that it is always negative-definite when the policy is log–concave in the policy parameters. This fact follows from the observation that in such cases H2 (w) is a non-negative mixture of negative-definite matrices, which again is negative-definite [9]. Additionally, the diagonal 5 terms of a negative-definite matrix are negative and so D2 (w) is also negative-definite when the controller is log-concave. To motivate this result we now briefly consider some widely used policies that are either log-concave or blockwise log-concave. Firstly, consider the Gibb’s policy, π(a|s; w) ∝ exp wT φ(a, s), where φ(a, s) ∈ Rnw is a feature vector. This policy is widely used in discrete systems and is logconcave in w, which can be seen from the fact that log π(a|s; w) is the sum of a linear term and a negative log-sum-exp term, both of which are concave [9]. In systems with a continuous stateaction space a common choice of controller is π(a|s; wmean , Σ) = N (a|Kφ(s) + m, Σ(s)), where wmean = {K, m} and φ(s) ∈ Rnw is a feature vector. The notation Σ(s) is used because there are cases where is it beneficial to have state dependent noise in the controller. This controller is not jointly log-concave in wmean and Σ, but it is blockwise log-concave in wmean and Σ−1 . In terms of wmean the log-policy is quadratic and the coefficient matrix of the quadratic term is negative-definite. In terms of Σ−1 the log-policy consists of a linear term and a log-determinant term, both of which are concave. In terms of evaluating the search direction it is clear from the forms of D2 (w) and H2 (w) that many of the pre-existing gradient evaluation techniques can be extended to the approximate Newton framework with little difficulty. In particular, gradient evaluation requires calculating the expectation of the derivative of the log-policy w.r.t. pγ (z; w)Q(z; w). In terms of inference the only additional calculation necessary to implement either the full or diagonal approximate Newton methods is the calculation of the expectation (w.r.t. to the same function) of the Hessian of the log-policy, or its diagonal terms. As an example in section(6.5) of the supplementary material we detail the extension of the recurrent state formulation of gradient evaluation in the average reward framework, see e.g. [31], to the approximate Newton method. We use this extension in the Tetris experiment that we consider in section(4). Given ns samples and nw parameters the complexity of these extensions scale as O(ns nw ) for the diagonal approximate Newton method, while it scales as O(ns n2 ) for the w full approximate Newton method. An issue with the Newton method is the inversion of the Hessian matrix, which scales with O(n3 ) in w the worst case. In the standard application of the Newton method this inversion has to be performed at each iteration and in large parameter systems this becomes prohibitively costly. In general H(w) will be dense and no computational savings will be possible when performing this matrix inversion. The same is not true, however, of the matrices D2 (w) and H2 (w). Firstly, as D2 (w) is a diagonal matrix it is trivial to invert. Secondly, there is an immediate source of sparsity that comes from taking the second derivative of the log-trajectory distribution in (7). This property ensures that any (product) sparsity over the control parameters in the log-trajectory distribution will correspond to sparsity in H2 (w). For example, in a partially observable Markov Decision Processes where the policy is modeled through a finite state controller, see e.g. [22], there are three functions to be optimised, namely the initial belief distribution, the belief transition dynamics and the policy. When the parameters of these three functions are independent H2 (w) will be block-diagonal (across the parameters of the three functions) and the matrix inversion can be performed more efficiently by inverting each of the block matrices individually. The reason that H(w) does not exhibit any such sparsity properties is due to the term H1 (w) in (5), which consists of the non-negative mixture of outer-product matrices. The vector in these outer-products is the derivative of the log-trajectory distribution and this typically produces a dense matrix. A undesirable aspect of steepest gradient ascent is that its performance is affected by the choice of basis used to represent the parameter space. An important and desirable property of the Newton method is that it is invariant to non-singular linear (affine) transformations of the parameter space, see e.g. [9]. This means that given a non-singular linear (affine) mapping T ∈ Rnw ×nw , the Newton ˜ update of the objective U (w) = U (T w) is related to the Newton update of the original objective through the same linear (affine) mapping, i.e. v + ∆vnt = T w + ∆wnt , where v = T w and ∆vnt and ∆wnt denote the respective Newton steps. In other words running the Newton method on U (w) ˜ and U (T −1 w) will give identical results. An important point to note is that this desirable property is maintained when using H2 (w) in an approximate Newton method, while using D2 (w) results in a method that is invariant to rescaling of the parameters, i.e. where T is a diagonal matrix with non-zero elements along the diagonal. This can be seen by using the linearity of the expectation operator and a proof of this statement is provided in section(6.4) of the supplementary material. 6 20 Completed Lines 400 θ2 15 10 5 0 −10 −8 −6 −4 θ1 −2 0 300 200 100 0 0 2 (a) Policy Trace 20 40 60 80 Training Iterations 100 (b) Tetris Problem Figure 1: (a) An empirical illustration of the affine invariance of the approximate Newton method, performed on the two state MDP of [16]. The plot shows the trace of the policy during training for the two different parameter spaces, where the results of the latter have been mapped back into the original parameter space for comparison. The plot shows the two steepest gradient ascent traces (blue cross and blue circle) and the two traces of the full approximate Newton method (red cross and red circle). (b) Results of the tetris problem for steepest gradient ascent (black), natural gradient ascent (green), the diagonal approximate Newton method (blue) and the full approximate Newton method (red). 4 Experiments The first experiment we performed was an empirical illustration that the full approximate Newton method is invariant to linear transformations of the parameter space. We considered the simple two state example of [16] as it allows us to plot the trace of the policy during training, since the policy has only two parameters. The policy was trained using both steepest gradient ascent and the full approximate Newton method and in both the original and linearly transformed parameter space. The policy traces of the two algorithms are plotted in figure(1.a). As expected steepest gradient ascent is affected by such mappings, whilst the full approximate Newton method is invariant to them. The second experiment was aimed at demonstrating the scalability of the full and diagonal approximate Newton methods to problems with a large state space. We considered the tetris domain, which is a popular computer game designed by Alexey Pajitnov in 1985. See [12] for more details. Firstly, we compared the performance of the full and diagonal approximate Newton methods to other parametric policy search methods. Tetris is typically played on a 20 × 10 grid, but due to computational costs we considered a 10 × 10 grid in the experiment. This results in a state space with roughly 7 × 2100 states. We modelled the policy through a Gibb’s distribution, where we considered a feature vector with the following features: the heights of each column, the difference in heights between adjacent columns, the maximum height and the number of ‘holes’. Under this policy it is not possible to obtain the explicit maximum over w in (11) and so a straightforward application of EM is not possible in this problem. We therefore compared the diagonal and full approximate Newton methods with steepest and natural gradient ascent. Due to reasons of space the exact implementation of the experiment is detailed in section(6.6) of the supplementary material. We ran 100 repetitions of the experiment, each consisting of 100 training iterations, and the mean and standard error of the results are given in figure(1.b). It can be seen that the full approximate Newton method outperforms all of the other methods, while the performance of the diagonal approximate Newton method is comparable to natural gradient ascent. We also ran several training runs of the full approximate Newton method on the full-sized 20 × 10 board and were able to obtain a score in the region of 14, 000 completed lines, which was obtained after roughly 40 training iterations. An approximate dynamic programming based method has previously been applied to the Tetris domain in [7]. The same set of features were used and a score of roughly 4, 500 completed lines was obtained after around 6 training iterations, after which the solution then deteriorated. In the third experiment we considered a finite horizon (controlled) linear dynamical system. This allowed the search-directions of the various algorithms to be computed exactly using [13] and removed any issues of approximate inference from the comparison. In particular we considered a 3-link rigid manipulator, linearized through feedback linearisation, see e.g. [17]. This system has a 7 Normalised Total Expected Reward Normalised Total Expected Reward 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 200 400 Training Time 600 (a) Model-Based Linear System 1 0.9 0.8 0.7 0.6 0 200 400 600 Training Iterations 800 (b) Model-Free Non-Linear System Figure 2: (a) The normalised total expected reward plotted against training time, in seconds, for the 3-link rigid manipulator. The plot shows the results for steepest gradient ascent (black), EM (blue), natural gradient ascent (green) and the approximate Newton method (red), where the plot shows the mean and standard error of the results. (b) The normalised total expected reward plotted against training iterations for the synthetic non-linear system of [29]. The plot shows the results for EM (blue), steepest gradient ascent (black), natural gradient ascent (green) and the approximate Newton method (red), where the plot shows the mean and standard error of the results. 6-dimensional state space, 3-dimensional action space and a 22-dimensional parameter space. Further details of the system can be found in section(6.7) of the supplementary material. We ran the experiment 100 times and the mean and standard error of the results plotted in figure(2.a). In this experiment the approximate Newton method found substantially better solutions than either steepest gradient ascent, natural gradient ascent or Expectation Maximisation. The superiority of the results in comparison to either steepest or natural gradient ascent can be explained by the fact that H2 (w) gives a better estimate of the curvature of the objective function. Expectation Maximisation performed poorly in this experiment, exhibiting sub-linear convergence. Steepest gradient ascent performed 3684 ± 314 training iterations in this experiment which, in comparison to the 203 ± 34 and 310 ± 40 iterations of natural gradient ascent and the approximate Newton method respectively, illustrates the susceptibility of this method to poor scaling. In the final experiment we considered the synthetic non-linear system considered in [29]. Full details of the system and the experiment can be found in section(6.8) of the supplementary material. We ran the experiment 100 times and the mean and standard error of the results are plotted in figure(2.b). Again the approximate Newton method outperforms both steepest and natural gradient ascent. In this example only the mean parameters of the Gaussian controller are optimised, while the parameters of the noise are held fixed, which means that the log-policy is quadratic in the policy parameters. Hence, in this example the EM-algorithm is a particular (less general) version of the approximate Newton method, where a fixed step-size of one is used throughout. The marked difference in performance between the EM-algorithm and the approximate Newton method shows the benefit of being able to tune the step-size sequence. In this experiment we considered five different step-size sequences for the approximate Newton method and all of them obtained superior results than the EM-algorithm. In contrast only one of the seven step-size sequences considered for steepest and natural gradient ascent outperformed the EM-algorithm. 5 Conclusion The contributions of this paper are twofold: Firstly we have given a novel analysis of Expectation Maximisation and natural gradient ascent when applied to the MDP framework, showing that both have close connections to an approximate Newton method; Secondly, prompted by this analysis we have considered the direct application of this approximate Newton method to the optimisation of MDPs, showing that it has numerous desirable properties that are not present in the naive application of the Newton method. In terms of empirical performance we have found the approximate Newton method to perform consistently well in comparison to EM and natural gradient ascent, highlighting its viability as an alternative to either of these methods. At present we have only considered actor type implementations of the approximate Newton method and the extension to actor-critic methods is a point of future research. 8 References [1] S. Amari. Natural Gradient Works Efficiently in Learning. Neural Computation, 10:251–276, 1998. [2] M. Azar, V. G´ mez, and H. Kappen. Dynamic policy programming with function approximation. Journal o of Machine Learning Research - Proceedings Track, 15:119–127, 2011. [3] J. Bagnell and J. Schneider. Covariant Policy Search. IJCAI, 18:1019–1024, 2003. [4] J. Baxter and P. Bartlett. Infinite Horizon Policy Gradient Estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001. [5] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, second edition, 2000. [6] D. P. Bertsekas. Approximate Policy Iteration: A Survey and Some New Methods. Research report, Massachusetts Institute of Technology, 2010. [7] D. P. Bertsekas and S. Ioffe. Temporal Differences-Based Policy Iteration and Applications in NeuroDynamic Programming. Research report, Massachusetts Institute of Technology, 1997. [8] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and L. Mark. Natural Actor-Critic Algorithms. Automatica, 45:2471–2482, 2009. [9] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [10] P. Dayan and G. E. Hinton. Using Expectation-Maximization for Reinforcement Learning. Neural Computation, 9:271–278, 1997. [11] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977. [12] C. Fahey. Tetris AI, Computers Play Tetris http://colinfahey.com/tetris/tetris_en. html, 2003. [13] T. Furmston and D. Barber. Efficient Inference for Markov Control Problems. UAI, 29:221–229, 2011. [14] P. W. Glynn. Likelihood Ratio Gradient Estimation for Stochastic Systems. Communications of the ACM, 33:97–84, 1990. [15] E. Greensmith, P. Bartlett, and J. Baxter. Variance Reduction Techniques For Gradient Based Estimates in Reinforcement Learning. Journal of Machine Learning Research, 5:1471–1530, 2004. [16] S. Kakade. A Natural Policy Gradient. NIPS, 14:1531–1538, 2002. [17] H. Khalil. Nonlinear Systems. Prentice Hall, 2001. [18] J. Kober and J. Peters. Policy Search for Motor Primitives in Robotics. Machine Learning, 84(1-2):171– 203, 2011. [19] L. Kocsis and C. Szepesv´ ri. Bandit Based Monte-Carlo Planning. European Conference on Machine a Learning (ECML), 17:282–293, 2006. [20] V. R. Konda and J. N. Tsitsiklis. On Actor-Critic Algorithms. SIAM J. Control Optim., 42(4):1143–1166, 2003. [21] P. Marbach and J. Tsitsiklis. Simulation-Based Optimisation of Markov Reward Processes. IEEE Transactions on Automatic Control, 46(2):191–209, 2001. [22] N. Meuleau, L. Peshkin, K. Kim, and L. Kaelbling. Learning Finite-State Controllers for Partially Observable Environments. UAI, 15:427–436, 1999. [23] J. Nocedal and S. Wright. Numerical Optimisation. Springer, 2006. [24] J. Peters and S. Schaal. Natural Actor-Critic. Neurocomputing, 71(7-9):1180–1190, 2008. [25] K. Rawlik, Toussaint. M, and S. Vijayakumar. On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference. International Conference on Robotics Science and Systems, 2012. [26] S. Richter, D. Aberdeen, and J. Yu. Natural Actor-Critic for Road Traffic Optimisation. NIPS, 19:1169– 1176, 2007. [27] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. NIPS, 13:1057–1063, 2000. [28] M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic Inference for Solving (PO)MDPs. Research Report EDI-INF-RR-0934, University of Edinburgh, School of Informatics, 2006. [29] N. Vlassis, M. Toussaint, G. Kontes, and S. Piperidis. Learning Model-Free Robot Control by a Monte Carlo EM Algorithm. Autonomous Robots, 27(2):123–130, 2009. [30] L. Weaver and N. Tao. The Optimal Reward Baseline for Gradient Based Reinforcement Learning. UAI, 17(29):538–545, 2001. [31] R. Williams. Simple Statistical Gradient Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8:229–256, 1992. 9

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract Parametric policy search algorithms are one of the methods of choice for the optimisation of Markov Decision Processes, with Expectation Maximisation and natural gradient ascent being popular methods in this field. [sent-9, score-0.879]

2 This analysis leads naturally to the consideration of this approximate Newton method as an alternative optimisation method for Markov Decision Processes. [sent-11, score-0.273]

3 We are able to show that the algorithm has numerous desirable properties, absent in the naive application of Newton’s method, that make it a viable alternative to either Expectation Maximisation or natural gradient ascent. [sent-12, score-0.296]

4 Empirical results suggest that the algorithm has excellent convergence and robustness properties, performing strongly in comparison to both Expectation Maximisation and natural gradient ascent. [sent-13, score-0.208]

5 At each time-point the agent receives a (scalar) reward that is determined by the reward function, where this function depends on the current action and state of the environment. [sent-21, score-0.327]

6 Typically the reward function is assumed to be bounded, but as the objective is linear in the reward function we assume w. [sent-22, score-0.262]

7 The most widely used objective in the MDP framework is to maximise the total expected reward of the agent over the course of the planning horizon. [sent-26, score-0.233]

8 This objective can take various forms, including an infinite planning horizon, with either discounted or average rewards, or a finite planning horizon. [sent-27, score-0.124]

9 (2) t=1 Note that the policy is now written in terms of its parametric representation, π(a|s; w). [sent-31, score-0.278]

10 This work is focused solely on parametric policy search methods, by which we mean gradient-based methods, such as steepest and natural gradient ascent [23, 1], along with Expectation Maximisation [11], which is a bound optimisation technique from the statistics literature. [sent-37, score-1.144]

11 Since their introduction [14, 31, 10, 16] these methods have been the centre of a large amount of research, with much of it focusing on gradient estimation [21, 4], variance reduction techniques [30, 15], function approximation techniques [27, 8, 20] and real-world applications [18, 26]. [sent-38, score-0.156]

12 While steepest gradient ascent has enjoyed some success it is known to suffer from some substantial issues that often make it unattractive in practice, such as slow convergence and susceptibility to poor scaling of the objective function [23]. [sent-39, score-0.791]

13 Various optimisation methods have been introduced as an alternative, most notably natural gradient ascent [16, 24, 3] and Expectation Maximisation [18, 28], which are currently the methods of choice among parametric policy search algorithms. [sent-40, score-0.924]

14 2 Search Direction Analysis In this section we will perform a novel analysis of the search-direction of both natural gradient ascent and Expectation Maximisation. [sent-42, score-0.501]

15 In gradient-based algorithms of Markov Decision Processes the update of the policy parameters take the form wnew = w + αM(w) w U (w), (3) + where α ∈ R is the step-size parameter and M(w) is some positive-definite matrix that possibly depends on w. [sent-43, score-0.319]

16 It is convenient for later reference to note that the gradient w U (w) can be written in the following form w U (w) = Epγ (z;w)Q(z;w) w log π(a|s; w) , (4) where we use the expectation notation E[·] to denote the integral/summation w. [sent-46, score-0.225]

17 While all reasonable criteria the rate of convergence is of such importance in an optimisation algorithm that it is a logical starting point in our analysis. [sent-59, score-0.13]

18 For this reason we concern ourselves with relating these two parametric policy search algorithms to the Newton method, which has the highly desirable property of having a quadratic rate of convergence in the vicinity of a local optimum. [sent-60, score-0.386]

19 1 Natural Gradient Ascent To overcome some of the issues that can hinder steepest gradient ascent an alternative, natural gradient, was introduced in [16]. [sent-71, score-0.747]

20 Natural gradient ascent techniques originated in the neural network and blind source separation literature, see e. [sent-72, score-0.449]

21 Deriving the steepest ascent direction of U (w) w. [sent-75, score-0.538]

22 the Euclidean norm, which is the case in steepest gradient ascent) results in natural gradient ascent. [sent-81, score-0.584]

23 The derivation for natural gradient ascent is well-known, see e. [sent-85, score-0.525]

24 [1], and its application to the objective (1) results in a parameter update of the form wk+1 = wk + αk G−1 (wk ) w U (wk ). [sent-87, score-0.567]

25 3 In order to relate natural gradient ascent to the Newton method we first rewrite the matrix (7) into the following form H2 (w) = Epγ (z;w)Q(z;w) w T w log π(a|s; w) . [sent-95, score-0.555]

26 Comparing the Fisher information matrix (8) with the form of H2 (w) given in (9) it is clear that natural gradient ascent has a relationship with the approximate Newton method that uses H2 (w) in place of H(w). [sent-98, score-0.639]

27 In terms of (3) this approximate Newton method corresponds to setting −1 M(w) = −H2 (w). [sent-99, score-0.115]

28 Hence, H2 (w) incorporates information about the reward structure of the objective function, whereas the Fisher information matrix does not, and so it will generally contain more information about the curvature of the objective function. [sent-109, score-0.205]

29 w) of the ‘energy’ term, Q(w, wk ) = Epγ (z;wk )Q(z;wk ) log π(a|s; w) . [sent-124, score-0.47]

30 (11) Note that Q is a two-parameter function, where the first parameter occurs inside the expectation and the second parameter defines the non-negative function w. [sent-125, score-0.119]

31 This decoupling allows the maximisation over w to be performed explicitly in many cases of interest. [sent-129, score-0.204]

32 For example, when the log-policy is quadratic in w the maximisation problems is equivalent to a least-squares problem and the optimum can be found through solving a linear system of equations. [sent-130, score-0.272]

33 [18], that the parameter update of steepest gradient ascent and the EM-algorithm can be related through this ‘energy’ term. [sent-133, score-0.732]

34 In particular the gradient (4) evaluated at wk can also be written as follows w|w=wk U (w) = 10 w|w=wk Q(w, wk ), where 10 we use the notation w to denote the first derivative w. [sent-134, score-1.116]

35 the first parameter, while the update of the EM-algorithm is given by wk+1 = argmaxw∈W Q(w, wk ). [sent-137, score-0.508]

36 In other words, steepest gradient ascent moves in the direction that most rapidly increases Q(w, wk ) w. [sent-138, score-1.164]

37 the first variable, while the EM-algorithm maximises Q(w, wk ) w. [sent-141, score-0.47]

38 It states that in situations where it is not possible to explicitly perform the maximisation over w in (11) then the alternative, in terms of the EM-algorithm, is this generalised EM-algorithm, which is equivalent to steepest gradient ascent. [sent-146, score-0.56]

39 Considering that algorithms such as EM are typically considered because of the negative aspects related to steepest gradient ascent this is an undesirable alternative. [sent-147, score-0.73]

40 Finally, this result provides no insight into the behaviour of the EM-algorithm, in terms of the direction of its parameter update, when the maximisation over w in (11) can be performed explicitly. [sent-149, score-0.254]

41 wk+1 − wk , is the same, up to first order, as the direction of an approximate Newton method that uses H2 (w) in place of H(w). [sent-153, score-0.61]

42 Provided that Q(w, wk ) is twice continuously differentiable in the first parameter we have that −1 wk+1 − wk = −H2 (wk ) w|w=wk U (w) + O( wk+1 − wk 2 ). [sent-159, score-1.435]

43 (12) Additionally, in the case where the log-policy is quadratic the relation to the approximate Newton method is exact, i. [sent-160, score-0.148]

44 The idea of the proof is simple and only involves performing a Taylor expansion of 10 w Q(w, wk ). [sent-167, score-0.497]

45 As Q is assumed to be twice continuously differentiable in the first component this Taylor expansion is possible and gives 10 w Q(wk+1 , wk ) = 10 w Q(wk , wk ) + 20 w Q(wk , wk )(wk+1 − wk ) + O( wk+1 − wk 2 ). [sent-168, score-2.377]

46 (13) As wk+1 = argmaxw∈W Q(w, wk ) it follows that 10 Q(wk+1 , wk ) = 0. [sent-169, score-0.94]

47 This means that, upon w ignoring higher order terms in wk+1 − wk , the Taylor expansion (13) can be rewritten into the form wk+1 − wk = − 20 −1 w Q(wk , wk ) 10 w Q(wk , wk ). [sent-170, score-1.907]

48 (14) 10 = The proof is completed by observing that w|w=wk U (w) and w Q(wk , wk ) 20 Q(wk , wk ) = H2 (wk ). [sent-171, score-0.976]

49 3 Summary In this section we have provided a novel analysis of both natural gradient ascent and Expectation Maximisation when applied to the MDP framework. [sent-174, score-0.501]

50 Additionally we have shown that the step-direction of the EM-algorithm is, up to first order, an approximate Newton method that uses H2 (w) in place of H(w) and employs a constant step-size of one. [sent-177, score-0.115]

51 3 An Approximate Newton Method −1 A natural follow on from the analysis in section(2) is the consideration of using M(w) = −H2 (w) in (3), a method we call the full approximate Newton method from this point onwards. [sent-178, score-0.229]

52 In this section we show that this method has many desirable properties that make it an attractive alternative to other parametric policy search methods. [sent-179, score-0.404]

53 Additionally, denoting the diagonal matrix formed from the diagonal elements of H2 (w) by D2 (w), we shall also consider the method that uses M(w) = −1 −D2 (w) in (3). [sent-180, score-0.187]

54 We call this second method the diagonal approximate Newton method. [sent-181, score-0.171]

55 An attractive property of the approximate Hessian, H2 (w), is that it is always negative-definite when the policy is log–concave in the policy parameters. [sent-185, score-0.55]

56 Additionally, the diagonal 5 terms of a negative-definite matrix are negative and so D2 (w) is also negative-definite when the controller is log-concave. [sent-187, score-0.148]

57 This policy is widely used in discrete systems and is logconcave in w, which can be seen from the fact that log π(a|s; w) is the sum of a linear term and a negative log-sum-exp term, both of which are concave [9]. [sent-190, score-0.253]

58 In systems with a continuous stateaction space a common choice of controller is π(a|s; wmean , Σ) = N (a|Kφ(s) + m, Σ(s)), where wmean = {K, m} and φ(s) ∈ Rnw is a feature vector. [sent-191, score-0.245]

59 This controller is not jointly log-concave in wmean and Σ, but it is blockwise log-concave in wmean and Σ−1 . [sent-193, score-0.271]

60 In terms of wmean the log-policy is quadratic and the coefficient matrix of the quadratic term is negative-definite. [sent-194, score-0.177]

61 In terms of evaluating the search direction it is clear from the forms of D2 (w) and H2 (w) that many of the pre-existing gradient evaluation techniques can be extended to the approximate Newton framework with little difficulty. [sent-196, score-0.303]

62 In particular, gradient evaluation requires calculating the expectation of the derivative of the log-policy w. [sent-197, score-0.245]

63 In terms of inference the only additional calculation necessary to implement either the full or diagonal approximate Newton methods is the calculation of the expectation (w. [sent-201, score-0.24]

64 5) of the supplementary material we detail the extension of the recurrent state formulation of gradient evaluation in the average reward framework, see e. [sent-206, score-0.355]

65 Given ns samples and nw parameters the complexity of these extensions scale as O(ns nw ) for the diagonal approximate Newton method, while it scales as O(ns n2 ) for the w full approximate Newton method. [sent-210, score-0.36]

66 For example, in a partially observable Markov Decision Processes where the policy is modeled through a finite state controller, see e. [sent-218, score-0.29]

67 A undesirable aspect of steepest gradient ascent is that its performance is affected by the choice of basis used to represent the parameter space. [sent-224, score-0.73]

68 This means that given a non-singular linear (affine) mapping T ∈ Rnw ×nw , the Newton ˜ update of the objective U (w) = U (T w) is related to the Newton update of the original objective through the same linear (affine) mapping, i. [sent-228, score-0.144]

69 An important point to note is that this desirable property is maintained when using H2 (w) in an approximate Newton method, while using D2 (w) results in a method that is invariant to rescaling of the parameters, i. [sent-232, score-0.174]

70 The plot shows the trace of the policy during training for the two different parameter spaces, where the results of the latter have been mapped back into the original parameter space for comparison. [sent-238, score-0.311]

71 The plot shows the two steepest gradient ascent traces (blue cross and blue circle) and the two traces of the full approximate Newton method (red cross and red circle). [sent-239, score-0.942]

72 (b) Results of the tetris problem for steepest gradient ascent (black), natural gradient ascent (green), the diagonal approximate Newton method (blue) and the full approximate Newton method (red). [sent-240, score-1.628]

73 4 Experiments The first experiment we performed was an empirical illustration that the full approximate Newton method is invariant to linear transformations of the parameter space. [sent-241, score-0.25]

74 We considered the simple two state example of [16] as it allows us to plot the trace of the policy during training, since the policy has only two parameters. [sent-242, score-0.552]

75 The policy was trained using both steepest gradient ascent and the full approximate Newton method and in both the original and linearly transformed parameter space. [sent-243, score-1.073]

76 The policy traces of the two algorithms are plotted in figure(1. [sent-244, score-0.292]

77 As expected steepest gradient ascent is affected by such mappings, whilst the full approximate Newton method is invariant to them. [sent-246, score-0.837]

78 The second experiment was aimed at demonstrating the scalability of the full and diagonal approximate Newton methods to problems with a large state space. [sent-247, score-0.241]

79 We considered the tetris domain, which is a popular computer game designed by Alexey Pajitnov in 1985. [sent-248, score-0.166]

80 Firstly, we compared the performance of the full and diagonal approximate Newton methods to other parametric policy search methods. [sent-250, score-0.487]

81 We modelled the policy through a Gibb’s distribution, where we considered a feature vector with the following features: the heights of each column, the difference in heights between adjacent columns, the maximum height and the number of ‘holes’. [sent-253, score-0.316]

82 Under this policy it is not possible to obtain the explicit maximum over w in (11) and so a straightforward application of EM is not possible in this problem. [sent-254, score-0.233]

83 We therefore compared the diagonal and full approximate Newton methods with steepest and natural gradient ascent. [sent-255, score-0.599]

84 It can be seen that the full approximate Newton method outperforms all of the other methods, while the performance of the diagonal approximate Newton method is comparable to natural gradient ascent. [sent-260, score-0.525]

85 We also ran several training runs of the full approximate Newton method on the full-sized 20 × 10 board and were able to obtain a score in the region of 14, 000 completed lines, which was obtained after roughly 40 training iterations. [sent-261, score-0.208]

86 An approximate dynamic programming based method has previously been applied to the Tetris domain in [7]. [sent-262, score-0.115]

87 6 0 200 400 600 Training Iterations 800 (b) Model-Free Non-Linear System Figure 2: (a) The normalised total expected reward plotted against training time, in seconds, for the 3-link rigid manipulator. [sent-280, score-0.211]

88 The plot shows the results for steepest gradient ascent (black), EM (blue), natural gradient ascent (green) and the approximate Newton method (red), where the plot shows the mean and standard error of the results. [sent-281, score-1.341]

89 (b) The normalised total expected reward plotted against training iterations for the synthetic non-linear system of [29]. [sent-282, score-0.238]

90 The plot shows the results for EM (blue), steepest gradient ascent (black), natural gradient ascent (green) and the approximate Newton method (red), where the plot shows the mean and standard error of the results. [sent-283, score-1.341]

91 In this experiment the approximate Newton method found substantially better solutions than either steepest gradient ascent, natural gradient ascent or Expectation Maximisation. [sent-289, score-1.029]

92 The superiority of the results in comparison to either steepest or natural gradient ascent can be explained by the fact that H2 (w) gives a better estimate of the curvature of the objective function. [sent-290, score-0.755]

93 Steepest gradient ascent performed 3684 ± 314 training iterations in this experiment which, in comparison to the 203 ± 34 and 310 ± 40 iterations of natural gradient ascent and the approximate Newton method respectively, illustrates the susceptibility of this method to poor scaling. [sent-292, score-1.228]

94 Again the approximate Newton method outperforms both steepest and natural gradient ascent. [sent-298, score-0.543]

95 In this example only the mean parameters of the Gaussian controller are optimised, while the parameters of the noise are held fixed, which means that the log-policy is quadratic in the policy parameters. [sent-299, score-0.335]

96 The marked difference in performance between the EM-algorithm and the approximate Newton method shows the benefit of being able to tune the step-size sequence. [sent-301, score-0.115]

97 In this experiment we considered five different step-size sequences for the approximate Newton method and all of them obtained superior results than the EM-algorithm. [sent-302, score-0.177]

98 In contrast only one of the seven step-size sequences considered for steepest and natural gradient ascent outperformed the EM-algorithm. [sent-303, score-0.746]

99 In terms of empirical performance we have found the approximate Newton method to perform consistently well in comparison to EM and natural gradient ascent, highlighting its viability as an alternative to either of these methods. [sent-305, score-0.343]

100 At present we have only considered actor type implementations of the approximate Newton method and the extension to actor-critic methods is a point of future research. [sent-306, score-0.14]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('newton', 0.562), ('wk', 0.47), ('ascent', 0.293), ('policy', 0.233), ('steepest', 0.22), ('maximisation', 0.184), ('gradient', 0.156), ('tetris', 0.141), ('reward', 0.114), ('optimisation', 0.107), ('wmean', 0.088), ('approximate', 0.084), ('hessian', 0.076), ('expectation', 0.069), ('controller', 0.069), ('fisher', 0.056), ('diagonal', 0.056), ('mdp', 0.054), ('argmaxw', 0.053), ('natural', 0.052), ('horizon', 0.048), ('normalised', 0.047), ('rnw', 0.047), ('parametric', 0.045), ('planning', 0.045), ('st', 0.044), ('reinforcement', 0.044), ('firstly', 0.043), ('mdps', 0.043), ('em', 0.042), ('trajectory', 0.041), ('ep', 0.041), ('agent', 0.04), ('search', 0.038), ('update', 0.038), ('desirable', 0.037), ('nw', 0.037), ('experiment', 0.037), ('completed', 0.036), ('undesirable', 0.036), ('decision', 0.036), ('ept', 0.035), ('furmston', 0.035), ('gibb', 0.035), ('hentropy', 0.035), ('vnt', 0.035), ('wnt', 0.035), ('objective', 0.034), ('state', 0.033), ('quadratic', 0.033), ('supplementary', 0.032), ('numerous', 0.031), ('occupancy', 0.031), ('susceptibility', 0.031), ('unattractive', 0.031), ('method', 0.031), ('ns', 0.031), ('full', 0.031), ('optimum', 0.03), ('plotted', 0.03), ('traces', 0.029), ('taylor', 0.029), ('heights', 0.029), ('inversion', 0.028), ('plot', 0.028), ('expansion', 0.027), ('dynamics', 0.026), ('action', 0.026), ('ran', 0.026), ('markov', 0.026), ('nite', 0.026), ('blockwise', 0.026), ('optimised', 0.026), ('issues', 0.026), ('direction', 0.025), ('processes', 0.025), ('system', 0.025), ('parameter', 0.025), ('considered', 0.025), ('toussaint', 0.025), ('transition', 0.025), ('observable', 0.024), ('derivation', 0.024), ('af', 0.024), ('reasons', 0.024), ('logical', 0.023), ('matrix', 0.023), ('iterations', 0.022), ('zt', 0.022), ('invariant', 0.022), ('shall', 0.021), ('blue', 0.021), ('manifold', 0.021), ('derivative', 0.02), ('performed', 0.02), ('concave', 0.02), ('red', 0.02), ('material', 0.02), ('alternative', 0.02), ('rigid', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

Author: Thomas Furmston, David Barber

Abstract: Parametric policy search algorithms are one of the methods of choice for the optimisation of Markov Decision Processes, with Expectation Maximisation and natural gradient ascent being popular methods in this field. In this article we provide a unifying perspective of these two algorithms by showing that their searchdirections in the parameter space are closely related to the search-direction of an approximate Newton method. This analysis leads naturally to the consideration of this approximate Newton method as an alternative optimisation method for Markov Decision Processes. We are able to show that the algorithm has numerous desirable properties, absent in the naive application of Newton’s method, that make it a viable alternative to either Expectation Maximisation or natural gradient ascent. Empirical results suggest that the algorithm has excellent convergence and robustness properties, performing strongly in comparison to both Expectation Maximisation and natural gradient ascent. 1 Markov Decision Processes Markov Decision Processes (MDPs) are the most commonly used model for the description of sequential decision making processes in a fully observable environment, see e.g. [5]. A MDP is described by the tuple {S, A, H, p1 , p, π, R}, where S and A are sets known respectively as the state and action space, H ∈ N is the planning horizon, which can be either finite or infinite, and {p1 , p, π, R} are functions that are referred as the initial state distribution, transition dynamics, policy (or controller) and the reward function. In general the state and action spaces can be arbitrary sets, but we restrict our attention to either discrete sets or subsets of Rn , where n ∈ N. We use boldface notation to represent a vector and also use the notation z = (s, a) to denote a state-action pair. Given a MDP the trajectory of the agent is determined by the following recursive procedure: Given the agent’s state, st , at a given time-point, t ∈ NH , an action is selected according to the policy, at ∼ π(·|st ); The agent will then transition to a new state according to the transition dynamics, st+1 ∼ p(·|at , st ); this process is iterated sequentially through all of the time-points in the planning horizon, where the state of the initial time-point is determined by the initial state distribution s1 ∼ p1 (·). At each time-point the agent receives a (scalar) reward that is determined by the reward function, where this function depends on the current action and state of the environment. Typically the reward function is assumed to be bounded, but as the objective is linear in the reward function we assume w.l.o.g that it is non-negative. The most widely used objective in the MDP framework is to maximise the total expected reward of the agent over the course of the planning horizon. This objective can take various forms, including an infinite planning horizon, with either discounted or average rewards, or a finite planning horizon. The theoretical contributions of this paper are applicable to all three frameworks, but for notational ease and for reasons of space we concern ourselves with the infinite horizon framework with discounted rewards. In this framework the boundedness of the objective function is ensured by the 1 introduction of a discount factor, γ ∈ [0, 1), which scales the rewards of the various time-points in a geometric manner. Writing the objective function and trajectory distribution directly in terms of the parameter vector then, for any w ∈ W, the objective function takes the form ∞ Ept (a,s;w) γ t−1 R(a, s) , U (w) = (1) t=1 where we have denoted the parameter space by W and have used the notation pt (a, s; w) to represent the marginal p(st = s, at = a; w) of the joint state-action trajectory distribution H−1 p(a1:H , s1:H ; w) = π(aH |sH ; w) p(st+1 |at , st )π(at |st ; w) p1 (s1 ), H ∈ N. (2) t=1 Note that the policy is now written in terms of its parametric representation, π(a|s; w). It is well known that the global optimum of (1) can be obtained through dynamic programming, see e.g. [5]. However, due to various issues, such as prohibitively large state-action spaces or highly non-linear transition dynamics, it is not possible to find the global optimum of (1) in most real-world problems of interest. Instead most research in this area focuses on obtaining approximate solutions, for which there exist numerous techniques, such as approximate dynamic programming methods [6], Monte-Carlo tree search methods [19] and policy search methods, both parametric [27, 21, 16, 18] and non-parametric [2, 25]. This work is focused solely on parametric policy search methods, by which we mean gradient-based methods, such as steepest and natural gradient ascent [23, 1], along with Expectation Maximisation [11], which is a bound optimisation technique from the statistics literature. Since their introduction [14, 31, 10, 16] these methods have been the centre of a large amount of research, with much of it focusing on gradient estimation [21, 4], variance reduction techniques [30, 15], function approximation techniques [27, 8, 20] and real-world applications [18, 26]. While steepest gradient ascent has enjoyed some success it is known to suffer from some substantial issues that often make it unattractive in practice, such as slow convergence and susceptibility to poor scaling of the objective function [23]. Various optimisation methods have been introduced as an alternative, most notably natural gradient ascent [16, 24, 3] and Expectation Maximisation [18, 28], which are currently the methods of choice among parametric policy search algorithms. In this paper our primary focus is on the search-direction (in the parameter space) of these two methods. 2 Search Direction Analysis In this section we will perform a novel analysis of the search-direction of both natural gradient ascent and Expectation Maximisation. In gradient-based algorithms of Markov Decision Processes the update of the policy parameters take the form wnew = w + αM(w) w U (w), (3) + where α ∈ R is the step-size parameter and M(w) is some positive-definite matrix that possibly depends on w. It is well-known that such an update will increase the total expected reward, provided that α is sufficiently small, and this process will converge to a local optimum of (1) provided the step-size sequence is appropriately selected. While EM doesn’t have an update of the form given in (3) we shall see that the algorithm is closely related to such an update. It is convenient for later reference to note that the gradient w U (w) can be written in the following form w U (w) = Epγ (z;w)Q(z;w) w log π(a|s; w) , (4) where we use the expectation notation E[·] to denote the integral/summation w.r.t. a non-negative function. The term pγ (z; w) is a geometric weighted average of state-action occupancy marginals given by ∞ γ t−1 pt (z; w), pγ (z; w) = t=1 while the term Q(z; w) is referred to as the state-action value function and is equal to the total expected future reward from the current time-point onwards, given the current state-action pair, z, 2 and parameter vector, w, i.e. ∞ Ept (z ;w) γ t−1 R(z ) z1 = z . Q(z; w) = t=1 This is a standard result and due to reasons of space we have omitted the details, but see e.g. [27] or section(6.1) of the supplementary material for more details. An immediate issue concerning updates of the form (3) is in the selection of the ‘optimal’ choice of the matrix M(w), which clearly depends on the sense in which optimality is defined. There are numerous reasonable properties that are desirable of such an update, including the numerical stability and computational complexity of the parameter update, as well as the rate of convergence of the overall algorithm resulting from these updates. While all reasonable criteria the rate of convergence is of such importance in an optimisation algorithm that it is a logical starting point in our analysis. For this reason we concern ourselves with relating these two parametric policy search algorithms to the Newton method, which has the highly desirable property of having a quadratic rate of convergence in the vicinity of a local optimum. The Newton method is well-known to suffer from problems that make it either infeasible or unattractive in practice, but in terms of forming a basis for theoretical comparisons it is a logical starting point. We shall discuss some of the issues with the Newton method in more detail in section(3). In the Newton method the matrix M(w) is set to the negative inverse Hessian, i.e. M(w) = −H−1 (w), where H(w) = w T w U (w). where we have denoted the Hessian by H(w). Using methods similar to those used to calculate the gradient, it can be shown that the Hessian takes the form H(w) = H1 (w) + H2 (w), (5) where ∞ Ep(z1:t ;w) γ t−1 R(zt ) w Ep(z1:t ;w) γ t−1 R(zt ) H1 (w) = w log p(z1:t ; w) T w log p(z1:t ; w) , (6) t=1 ∞ H2 (w) = T w log p(z1:t ; w) . (7) t=1 We have omitted the details of the derivation, but these can be found in section(6.2) of the supplementary material, with a similar derivation of a sample-based estimate of the Hessian given in [4]. 2.1 Natural Gradient Ascent To overcome some of the issues that can hinder steepest gradient ascent an alternative, natural gradient, was introduced in [16]. Natural gradient ascent techniques originated in the neural network and blind source separation literature, see e.g. [1], and take the perspective that the parameter space has a Riemannian manifold structure, as opposed to a Euclidean structure. Deriving the steepest ascent direction of U (w) w.r.t. a local norm defined on this parameter manifold (as opposed to w.r.t. the Euclidean norm, which is the case in steepest gradient ascent) results in natural gradient ascent. We denote the quadratic form that induces this local norm on the parameter manifold by G(w), i.e. d(w)2 = wT G(w)w. The derivation for natural gradient ascent is well-known, see e.g. [1], and its application to the objective (1) results in a parameter update of the form wk+1 = wk + αk G−1 (wk ) w U (wk ). −1 In terms of (3) this corresponds to M(w) = G (w). In the case of MDPs the most commonly used local norm is given by the Fisher information matrix of the trajectory distribution, see e.g. [3, 24], and due to the Markovian structure of the dynamics it is given by G(w) = −Epγ (z;w) w T w log π(a|s; w) . (8) We note that there is an alternate, but equivalent, form of writing the Fisher information matrix, see e.g. [24], but we do not use it in this work. 3 In order to relate natural gradient ascent to the Newton method we first rewrite the matrix (7) into the following form H2 (w) = Epγ (z;w)Q(z;w) w T w log π(a|s; w) . (9) For reasons of space the details of this reformulation of (7) are left to section(6.2) of the supplementary material. Comparing the Fisher information matrix (8) with the form of H2 (w) given in (9) it is clear that natural gradient ascent has a relationship with the approximate Newton method that uses H2 (w) in place of H(w). In terms of (3) this approximate Newton method corresponds to setting −1 M(w) = −H2 (w). In particular it can be seen that the difference between the two methods lies in the non-negative function w.r.t. which the expectation is taken in (8) and (9). (It also appears that there is a difference in sign, but observing the form of M(w) for each algorithm shows that this is not the case.) In the Fisher information matrix the expectation is taken w.r.t. to the geometrically weighted summation of state-action occupancy marginals of the trajectory distribution, while in H2 (w) there is an additional weighting from the state-action value function. Hence, H2 (w) incorporates information about the reward structure of the objective function, whereas the Fisher information matrix does not, and so it will generally contain more information about the curvature of the objective function. 2.2 Expectation Maximisation The Expectation Maximisation algorithm, or EM-algorithm, is a powerful optimisation technique from the statistics literature, see e.g. [11], that has recently been the centre of much research in the planning and reinforcement learning communities, see e.g. [10, 28, 18]. A quantity of central importance in the EM-algorithm for MDPs is the following lower-bound on the log-objective log U (w) ≥ Hentropy (q(z1:t , t)) + Eq(z1:t ,t) log γ t−1 R(zt )p(z1:t ; w) , (10) where Hentropy is the entropy function and q(z1:t , t) is known as the ‘variational distribution’. Further details of the EM-algorithm for MDPs and a derivation of (10) are given in section(6.3) of the supplementary material or can be found in e.g. [18, 28]. The parameter update of the EM-algorithm is given by the maximum (w.r.t. w) of the ‘energy’ term, Q(w, wk ) = Epγ (z;wk )Q(z;wk ) log π(a|s; w) . (11) Note that Q is a two-parameter function, where the first parameter occurs inside the expectation and the second parameter defines the non-negative function w.r.t. the expectation is taken. This decoupling allows the maximisation over w to be performed explicitly in many cases of interest. For example, when the log-policy is quadratic in w the maximisation problems is equivalent to a least-squares problem and the optimum can be found through solving a linear system of equations. It has previously been noted, again see e.g. [18], that the parameter update of steepest gradient ascent and the EM-algorithm can be related through this ‘energy’ term. In particular the gradient (4) evaluated at wk can also be written as follows w|w=wk U (w) = 10 w|w=wk Q(w, wk ), where 10 we use the notation w to denote the first derivative w.r.t. the first parameter, while the update of the EM-algorithm is given by wk+1 = argmaxw∈W Q(w, wk ). In other words, steepest gradient ascent moves in the direction that most rapidly increases Q(w, wk ) w.r.t. the first variable, while the EM-algorithm maximises Q(w, wk ) w.r.t. the first variable. While this relationship is true, it is also quite a negative result. It states that in situations where it is not possible to explicitly perform the maximisation over w in (11) then the alternative, in terms of the EM-algorithm, is this generalised EM-algorithm, which is equivalent to steepest gradient ascent. Considering that algorithms such as EM are typically considered because of the negative aspects related to steepest gradient ascent this is an undesirable alternative. It is possible to find the optimum of (11) numerically, but this is also undesirable as it results in a double-loop algorithm that could be computationally expensive. Finally, this result provides no insight into the behaviour of the EM-algorithm, in terms of the direction of its parameter update, when the maximisation over w in (11) can be performed explicitly. Instead we provide the following result, which shows that the step-direction of the EM-algorithm has an underlying relationship with the Newton method. In particular we show that, under suitable 4 regularity conditions, the direction of the EM-update, i.e. wk+1 − wk , is the same, up to first order, as the direction of an approximate Newton method that uses H2 (w) in place of H(w). Theorem 1. Suppose we are given a Markov Decision Process with objective (1) and Markovian trajectory distribution (2). Consider the update of the parameter through Expectation Maximisation at the k th iteration of the algorithm, i.e. wk+1 = argmaxw∈W Q(w, wk ). Provided that Q(w, wk ) is twice continuously differentiable in the first parameter we have that −1 wk+1 − wk = −H2 (wk ) w|w=wk U (w) + O( wk+1 − wk 2 ). (12) Additionally, in the case where the log-policy is quadratic the relation to the approximate Newton method is exact, i.e. the second term on the r.h.s. (12) is zero. Proof. The idea of the proof is simple and only involves performing a Taylor expansion of 10 w Q(w, wk ). As Q is assumed to be twice continuously differentiable in the first component this Taylor expansion is possible and gives 10 w Q(wk+1 , wk ) = 10 w Q(wk , wk ) + 20 w Q(wk , wk )(wk+1 − wk ) + O( wk+1 − wk 2 ). (13) As wk+1 = argmaxw∈W Q(w, wk ) it follows that 10 Q(wk+1 , wk ) = 0. This means that, upon w ignoring higher order terms in wk+1 − wk , the Taylor expansion (13) can be rewritten into the form wk+1 − wk = − 20 −1 w Q(wk , wk ) 10 w Q(wk , wk ). (14) 10 = The proof is completed by observing that w|w=wk U (w) and w Q(wk , wk ) 20 Q(wk , wk ) = H2 (wk ). The second statement follows because in the case where the log-policy w is quadratic the higher order terms in the Taylor expansion vanish. 2.3 Summary In this section we have provided a novel analysis of both natural gradient ascent and Expectation Maximisation when applied to the MDP framework. Previously, while both of these algorithms have proved popular methods for MDP optimisation, there has been little understanding of them in terms of their search-direction in the parameter space or their relation to the Newton method. Firstly, our analysis shows that the Fisher information matrix, which is used in natural gradient ascent, is similar to H2 (w) in (5) with the exception that the information about the reward structure of the problem is not contained in the Fisher information matrix, while such information is contained in H2 (w). Additionally we have shown that the step-direction of the EM-algorithm is, up to first order, an approximate Newton method that uses H2 (w) in place of H(w) and employs a constant step-size of one. 3 An Approximate Newton Method −1 A natural follow on from the analysis in section(2) is the consideration of using M(w) = −H2 (w) in (3), a method we call the full approximate Newton method from this point onwards. In this section we show that this method has many desirable properties that make it an attractive alternative to other parametric policy search methods. Additionally, denoting the diagonal matrix formed from the diagonal elements of H2 (w) by D2 (w), we shall also consider the method that uses M(w) = −1 −D2 (w) in (3). We call this second method the diagonal approximate Newton method. Recall that in (3) it is necessary that M(w) is positive-definite (in the Newton method this corresponds to requiring the Hessian to be negative-definite) to ensure an increase of the objective. In general the objective (1) is not concave, which means that the Hessian will not be negative-definite over the entire parameter space. In such cases the Newton method can actually lower the objective and this is an undesirable aspect of the Newton method. An attractive property of the approximate Hessian, H2 (w), is that it is always negative-definite when the policy is log–concave in the policy parameters. This fact follows from the observation that in such cases H2 (w) is a non-negative mixture of negative-definite matrices, which again is negative-definite [9]. Additionally, the diagonal 5 terms of a negative-definite matrix are negative and so D2 (w) is also negative-definite when the controller is log-concave. To motivate this result we now briefly consider some widely used policies that are either log-concave or blockwise log-concave. Firstly, consider the Gibb’s policy, π(a|s; w) ∝ exp wT φ(a, s), where φ(a, s) ∈ Rnw is a feature vector. This policy is widely used in discrete systems and is logconcave in w, which can be seen from the fact that log π(a|s; w) is the sum of a linear term and a negative log-sum-exp term, both of which are concave [9]. In systems with a continuous stateaction space a common choice of controller is π(a|s; wmean , Σ) = N (a|Kφ(s) + m, Σ(s)), where wmean = {K, m} and φ(s) ∈ Rnw is a feature vector. The notation Σ(s) is used because there are cases where is it beneficial to have state dependent noise in the controller. This controller is not jointly log-concave in wmean and Σ, but it is blockwise log-concave in wmean and Σ−1 . In terms of wmean the log-policy is quadratic and the coefficient matrix of the quadratic term is negative-definite. In terms of Σ−1 the log-policy consists of a linear term and a log-determinant term, both of which are concave. In terms of evaluating the search direction it is clear from the forms of D2 (w) and H2 (w) that many of the pre-existing gradient evaluation techniques can be extended to the approximate Newton framework with little difficulty. In particular, gradient evaluation requires calculating the expectation of the derivative of the log-policy w.r.t. pγ (z; w)Q(z; w). In terms of inference the only additional calculation necessary to implement either the full or diagonal approximate Newton methods is the calculation of the expectation (w.r.t. to the same function) of the Hessian of the log-policy, or its diagonal terms. As an example in section(6.5) of the supplementary material we detail the extension of the recurrent state formulation of gradient evaluation in the average reward framework, see e.g. [31], to the approximate Newton method. We use this extension in the Tetris experiment that we consider in section(4). Given ns samples and nw parameters the complexity of these extensions scale as O(ns nw ) for the diagonal approximate Newton method, while it scales as O(ns n2 ) for the w full approximate Newton method. An issue with the Newton method is the inversion of the Hessian matrix, which scales with O(n3 ) in w the worst case. In the standard application of the Newton method this inversion has to be performed at each iteration and in large parameter systems this becomes prohibitively costly. In general H(w) will be dense and no computational savings will be possible when performing this matrix inversion. The same is not true, however, of the matrices D2 (w) and H2 (w). Firstly, as D2 (w) is a diagonal matrix it is trivial to invert. Secondly, there is an immediate source of sparsity that comes from taking the second derivative of the log-trajectory distribution in (7). This property ensures that any (product) sparsity over the control parameters in the log-trajectory distribution will correspond to sparsity in H2 (w). For example, in a partially observable Markov Decision Processes where the policy is modeled through a finite state controller, see e.g. [22], there are three functions to be optimised, namely the initial belief distribution, the belief transition dynamics and the policy. When the parameters of these three functions are independent H2 (w) will be block-diagonal (across the parameters of the three functions) and the matrix inversion can be performed more efficiently by inverting each of the block matrices individually. The reason that H(w) does not exhibit any such sparsity properties is due to the term H1 (w) in (5), which consists of the non-negative mixture of outer-product matrices. The vector in these outer-products is the derivative of the log-trajectory distribution and this typically produces a dense matrix. A undesirable aspect of steepest gradient ascent is that its performance is affected by the choice of basis used to represent the parameter space. An important and desirable property of the Newton method is that it is invariant to non-singular linear (affine) transformations of the parameter space, see e.g. [9]. This means that given a non-singular linear (affine) mapping T ∈ Rnw ×nw , the Newton ˜ update of the objective U (w) = U (T w) is related to the Newton update of the original objective through the same linear (affine) mapping, i.e. v + ∆vnt = T w + ∆wnt , where v = T w and ∆vnt and ∆wnt denote the respective Newton steps. In other words running the Newton method on U (w) ˜ and U (T −1 w) will give identical results. An important point to note is that this desirable property is maintained when using H2 (w) in an approximate Newton method, while using D2 (w) results in a method that is invariant to rescaling of the parameters, i.e. where T is a diagonal matrix with non-zero elements along the diagonal. This can be seen by using the linearity of the expectation operator and a proof of this statement is provided in section(6.4) of the supplementary material. 6 20 Completed Lines 400 θ2 15 10 5 0 −10 −8 −6 −4 θ1 −2 0 300 200 100 0 0 2 (a) Policy Trace 20 40 60 80 Training Iterations 100 (b) Tetris Problem Figure 1: (a) An empirical illustration of the affine invariance of the approximate Newton method, performed on the two state MDP of [16]. The plot shows the trace of the policy during training for the two different parameter spaces, where the results of the latter have been mapped back into the original parameter space for comparison. The plot shows the two steepest gradient ascent traces (blue cross and blue circle) and the two traces of the full approximate Newton method (red cross and red circle). (b) Results of the tetris problem for steepest gradient ascent (black), natural gradient ascent (green), the diagonal approximate Newton method (blue) and the full approximate Newton method (red). 4 Experiments The first experiment we performed was an empirical illustration that the full approximate Newton method is invariant to linear transformations of the parameter space. We considered the simple two state example of [16] as it allows us to plot the trace of the policy during training, since the policy has only two parameters. The policy was trained using both steepest gradient ascent and the full approximate Newton method and in both the original and linearly transformed parameter space. The policy traces of the two algorithms are plotted in figure(1.a). As expected steepest gradient ascent is affected by such mappings, whilst the full approximate Newton method is invariant to them. The second experiment was aimed at demonstrating the scalability of the full and diagonal approximate Newton methods to problems with a large state space. We considered the tetris domain, which is a popular computer game designed by Alexey Pajitnov in 1985. See [12] for more details. Firstly, we compared the performance of the full and diagonal approximate Newton methods to other parametric policy search methods. Tetris is typically played on a 20 × 10 grid, but due to computational costs we considered a 10 × 10 grid in the experiment. This results in a state space with roughly 7 × 2100 states. We modelled the policy through a Gibb’s distribution, where we considered a feature vector with the following features: the heights of each column, the difference in heights between adjacent columns, the maximum height and the number of ‘holes’. Under this policy it is not possible to obtain the explicit maximum over w in (11) and so a straightforward application of EM is not possible in this problem. We therefore compared the diagonal and full approximate Newton methods with steepest and natural gradient ascent. Due to reasons of space the exact implementation of the experiment is detailed in section(6.6) of the supplementary material. We ran 100 repetitions of the experiment, each consisting of 100 training iterations, and the mean and standard error of the results are given in figure(1.b). It can be seen that the full approximate Newton method outperforms all of the other methods, while the performance of the diagonal approximate Newton method is comparable to natural gradient ascent. We also ran several training runs of the full approximate Newton method on the full-sized 20 × 10 board and were able to obtain a score in the region of 14, 000 completed lines, which was obtained after roughly 40 training iterations. An approximate dynamic programming based method has previously been applied to the Tetris domain in [7]. The same set of features were used and a score of roughly 4, 500 completed lines was obtained after around 6 training iterations, after which the solution then deteriorated. In the third experiment we considered a finite horizon (controlled) linear dynamical system. This allowed the search-directions of the various algorithms to be computed exactly using [13] and removed any issues of approximate inference from the comparison. In particular we considered a 3-link rigid manipulator, linearized through feedback linearisation, see e.g. [17]. This system has a 7 Normalised Total Expected Reward Normalised Total Expected Reward 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 200 400 Training Time 600 (a) Model-Based Linear System 1 0.9 0.8 0.7 0.6 0 200 400 600 Training Iterations 800 (b) Model-Free Non-Linear System Figure 2: (a) The normalised total expected reward plotted against training time, in seconds, for the 3-link rigid manipulator. The plot shows the results for steepest gradient ascent (black), EM (blue), natural gradient ascent (green) and the approximate Newton method (red), where the plot shows the mean and standard error of the results. (b) The normalised total expected reward plotted against training iterations for the synthetic non-linear system of [29]. The plot shows the results for EM (blue), steepest gradient ascent (black), natural gradient ascent (green) and the approximate Newton method (red), where the plot shows the mean and standard error of the results. 6-dimensional state space, 3-dimensional action space and a 22-dimensional parameter space. Further details of the system can be found in section(6.7) of the supplementary material. We ran the experiment 100 times and the mean and standard error of the results plotted in figure(2.a). In this experiment the approximate Newton method found substantially better solutions than either steepest gradient ascent, natural gradient ascent or Expectation Maximisation. The superiority of the results in comparison to either steepest or natural gradient ascent can be explained by the fact that H2 (w) gives a better estimate of the curvature of the objective function. Expectation Maximisation performed poorly in this experiment, exhibiting sub-linear convergence. Steepest gradient ascent performed 3684 ± 314 training iterations in this experiment which, in comparison to the 203 ± 34 and 310 ± 40 iterations of natural gradient ascent and the approximate Newton method respectively, illustrates the susceptibility of this method to poor scaling. In the final experiment we considered the synthetic non-linear system considered in [29]. Full details of the system and the experiment can be found in section(6.8) of the supplementary material. We ran the experiment 100 times and the mean and standard error of the results are plotted in figure(2.b). Again the approximate Newton method outperforms both steepest and natural gradient ascent. In this example only the mean parameters of the Gaussian controller are optimised, while the parameters of the noise are held fixed, which means that the log-policy is quadratic in the policy parameters. Hence, in this example the EM-algorithm is a particular (less general) version of the approximate Newton method, where a fixed step-size of one is used throughout. The marked difference in performance between the EM-algorithm and the approximate Newton method shows the benefit of being able to tune the step-size sequence. In this experiment we considered five different step-size sequences for the approximate Newton method and all of them obtained superior results than the EM-algorithm. In contrast only one of the seven step-size sequences considered for steepest and natural gradient ascent outperformed the EM-algorithm. 5 Conclusion The contributions of this paper are twofold: Firstly we have given a novel analysis of Expectation Maximisation and natural gradient ascent when applied to the MDP framework, showing that both have close connections to an approximate Newton method; Secondly, prompted by this analysis we have considered the direct application of this approximate Newton method to the optimisation of MDPs, showing that it has numerous desirable properties that are not present in the naive application of the Newton method. In terms of empirical performance we have found the approximate Newton method to perform consistently well in comparison to EM and natural gradient ascent, highlighting its viability as an alternative to either of these methods. At present we have only considered actor type implementations of the approximate Newton method and the extension to actor-critic methods is a point of future research. 8 References [1] S. Amari. Natural Gradient Works Efficiently in Learning. Neural Computation, 10:251–276, 1998. [2] M. Azar, V. G´ mez, and H. Kappen. Dynamic policy programming with function approximation. Journal o of Machine Learning Research - Proceedings Track, 15:119–127, 2011. [3] J. Bagnell and J. Schneider. Covariant Policy Search. IJCAI, 18:1019–1024, 2003. [4] J. Baxter and P. Bartlett. Infinite Horizon Policy Gradient Estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001. [5] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, second edition, 2000. [6] D. P. Bertsekas. Approximate Policy Iteration: A Survey and Some New Methods. Research report, Massachusetts Institute of Technology, 2010. [7] D. P. Bertsekas and S. Ioffe. Temporal Differences-Based Policy Iteration and Applications in NeuroDynamic Programming. Research report, Massachusetts Institute of Technology, 1997. [8] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and L. Mark. Natural Actor-Critic Algorithms. Automatica, 45:2471–2482, 2009. [9] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [10] P. Dayan and G. E. Hinton. Using Expectation-Maximization for Reinforcement Learning. Neural Computation, 9:271–278, 1997. [11] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977. [12] C. Fahey. Tetris AI, Computers Play Tetris http://colinfahey.com/tetris/tetris_en. html, 2003. [13] T. Furmston and D. Barber. Efficient Inference for Markov Control Problems. UAI, 29:221–229, 2011. [14] P. W. Glynn. Likelihood Ratio Gradient Estimation for Stochastic Systems. Communications of the ACM, 33:97–84, 1990. [15] E. Greensmith, P. Bartlett, and J. Baxter. Variance Reduction Techniques For Gradient Based Estimates in Reinforcement Learning. Journal of Machine Learning Research, 5:1471–1530, 2004. [16] S. Kakade. A Natural Policy Gradient. NIPS, 14:1531–1538, 2002. [17] H. Khalil. Nonlinear Systems. Prentice Hall, 2001. [18] J. Kober and J. Peters. Policy Search for Motor Primitives in Robotics. Machine Learning, 84(1-2):171– 203, 2011. [19] L. Kocsis and C. Szepesv´ ri. Bandit Based Monte-Carlo Planning. European Conference on Machine a Learning (ECML), 17:282–293, 2006. [20] V. R. Konda and J. N. Tsitsiklis. On Actor-Critic Algorithms. SIAM J. Control Optim., 42(4):1143–1166, 2003. [21] P. Marbach and J. Tsitsiklis. Simulation-Based Optimisation of Markov Reward Processes. IEEE Transactions on Automatic Control, 46(2):191–209, 2001. [22] N. Meuleau, L. Peshkin, K. Kim, and L. Kaelbling. Learning Finite-State Controllers for Partially Observable Environments. UAI, 15:427–436, 1999. [23] J. Nocedal and S. Wright. Numerical Optimisation. Springer, 2006. [24] J. Peters and S. Schaal. Natural Actor-Critic. Neurocomputing, 71(7-9):1180–1190, 2008. [25] K. Rawlik, Toussaint. M, and S. Vijayakumar. On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference. International Conference on Robotics Science and Systems, 2012. [26] S. Richter, D. Aberdeen, and J. Yu. Natural Actor-Critic for Road Traffic Optimisation. NIPS, 19:1169– 1176, 2007. [27] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. NIPS, 13:1057–1063, 2000. [28] M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic Inference for Solving (PO)MDPs. Research Report EDI-INF-RR-0934, University of Edinburgh, School of Informatics, 2006. [29] N. Vlassis, M. Toussaint, G. Kontes, and S. Piperidis. Learning Model-Free Robot Control by a Monte Carlo EM Algorithm. Autonomous Robots, 27(2):123–130, 2009. [30] L. Weaver and N. Tao. The Optimal Reward Baseline for Gradient Based Reinforcement Learning. UAI, 17(29):538–545, 2001. [31] R. Williams. Simple Statistical Gradient Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8:229–256, 1992. 9

2 0.17221382 60 nips-2012-Bayesian nonparametric models for ranked data

Author: Francois Caron, Yee W. Teh

Abstract: We develop a Bayesian nonparametric extension of the popular Plackett-Luce choice model that can handle an infinite number of choice items. Our framework is based on the theory of random atomic measures, with the prior specified by a gamma process. We derive a posterior characterization and a simple and effective Gibbs sampler for posterior simulation. We develop a time-varying extension of our model, and apply it to the New York Times lists of weekly bestselling books. 1

3 0.16828285 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

4 0.16316524 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin

Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1

5 0.15443741 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

6 0.1475049 27 nips-2012-A quasi-Newton proximal splitting method

7 0.14446773 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

8 0.14424184 348 nips-2012-Tractable Objectives for Robust Policy Optimization

9 0.13989511 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

10 0.12751178 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation

11 0.12306009 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

12 0.1185627 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

13 0.11584995 160 nips-2012-Imitation Learning by Coaching

14 0.11198998 142 nips-2012-Generalization Bounds for Domain Adaptation

15 0.11072367 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

16 0.10743451 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

17 0.10548591 344 nips-2012-Timely Object Recognition

18 0.099445477 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

19 0.099195883 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

20 0.087714024 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.179), (1, -0.267), (2, 0.024), (3, -0.059), (4, -0.04), (5, 0.071), (6, 0.041), (7, -0.057), (8, 0.125), (9, -0.017), (10, -0.035), (11, 0.027), (12, -0.006), (13, 0.024), (14, -0.009), (15, -0.005), (16, -0.008), (17, 0.008), (18, 0.052), (19, -0.003), (20, -0.0), (21, 0.034), (22, 0.09), (23, 0.053), (24, -0.037), (25, -0.003), (26, 0.008), (27, -0.034), (28, -0.119), (29, 0.03), (30, 0.005), (31, 0.056), (32, 0.001), (33, -0.169), (34, 0.011), (35, -0.054), (36, 0.057), (37, -0.043), (38, -0.039), (39, 0.041), (40, 0.008), (41, 0.001), (42, 0.041), (43, 0.047), (44, 0.051), (45, 0.007), (46, -0.0), (47, -0.032), (48, -0.03), (49, -0.067)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94831216 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

Author: Thomas Furmston, David Barber

Abstract: Parametric policy search algorithms are one of the methods of choice for the optimisation of Markov Decision Processes, with Expectation Maximisation and natural gradient ascent being popular methods in this field. In this article we provide a unifying perspective of these two algorithms by showing that their searchdirections in the parameter space are closely related to the search-direction of an approximate Newton method. This analysis leads naturally to the consideration of this approximate Newton method as an alternative optimisation method for Markov Decision Processes. We are able to show that the algorithm has numerous desirable properties, absent in the naive application of Newton’s method, that make it a viable alternative to either Expectation Maximisation or natural gradient ascent. Empirical results suggest that the algorithm has excellent convergence and robustness properties, performing strongly in comparison to both Expectation Maximisation and natural gradient ascent. 1 Markov Decision Processes Markov Decision Processes (MDPs) are the most commonly used model for the description of sequential decision making processes in a fully observable environment, see e.g. [5]. A MDP is described by the tuple {S, A, H, p1 , p, π, R}, where S and A are sets known respectively as the state and action space, H ∈ N is the planning horizon, which can be either finite or infinite, and {p1 , p, π, R} are functions that are referred as the initial state distribution, transition dynamics, policy (or controller) and the reward function. In general the state and action spaces can be arbitrary sets, but we restrict our attention to either discrete sets or subsets of Rn , where n ∈ N. We use boldface notation to represent a vector and also use the notation z = (s, a) to denote a state-action pair. Given a MDP the trajectory of the agent is determined by the following recursive procedure: Given the agent’s state, st , at a given time-point, t ∈ NH , an action is selected according to the policy, at ∼ π(·|st ); The agent will then transition to a new state according to the transition dynamics, st+1 ∼ p(·|at , st ); this process is iterated sequentially through all of the time-points in the planning horizon, where the state of the initial time-point is determined by the initial state distribution s1 ∼ p1 (·). At each time-point the agent receives a (scalar) reward that is determined by the reward function, where this function depends on the current action and state of the environment. Typically the reward function is assumed to be bounded, but as the objective is linear in the reward function we assume w.l.o.g that it is non-negative. The most widely used objective in the MDP framework is to maximise the total expected reward of the agent over the course of the planning horizon. This objective can take various forms, including an infinite planning horizon, with either discounted or average rewards, or a finite planning horizon. The theoretical contributions of this paper are applicable to all three frameworks, but for notational ease and for reasons of space we concern ourselves with the infinite horizon framework with discounted rewards. In this framework the boundedness of the objective function is ensured by the 1 introduction of a discount factor, γ ∈ [0, 1), which scales the rewards of the various time-points in a geometric manner. Writing the objective function and trajectory distribution directly in terms of the parameter vector then, for any w ∈ W, the objective function takes the form ∞ Ept (a,s;w) γ t−1 R(a, s) , U (w) = (1) t=1 where we have denoted the parameter space by W and have used the notation pt (a, s; w) to represent the marginal p(st = s, at = a; w) of the joint state-action trajectory distribution H−1 p(a1:H , s1:H ; w) = π(aH |sH ; w) p(st+1 |at , st )π(at |st ; w) p1 (s1 ), H ∈ N. (2) t=1 Note that the policy is now written in terms of its parametric representation, π(a|s; w). It is well known that the global optimum of (1) can be obtained through dynamic programming, see e.g. [5]. However, due to various issues, such as prohibitively large state-action spaces or highly non-linear transition dynamics, it is not possible to find the global optimum of (1) in most real-world problems of interest. Instead most research in this area focuses on obtaining approximate solutions, for which there exist numerous techniques, such as approximate dynamic programming methods [6], Monte-Carlo tree search methods [19] and policy search methods, both parametric [27, 21, 16, 18] and non-parametric [2, 25]. This work is focused solely on parametric policy search methods, by which we mean gradient-based methods, such as steepest and natural gradient ascent [23, 1], along with Expectation Maximisation [11], which is a bound optimisation technique from the statistics literature. Since their introduction [14, 31, 10, 16] these methods have been the centre of a large amount of research, with much of it focusing on gradient estimation [21, 4], variance reduction techniques [30, 15], function approximation techniques [27, 8, 20] and real-world applications [18, 26]. While steepest gradient ascent has enjoyed some success it is known to suffer from some substantial issues that often make it unattractive in practice, such as slow convergence and susceptibility to poor scaling of the objective function [23]. Various optimisation methods have been introduced as an alternative, most notably natural gradient ascent [16, 24, 3] and Expectation Maximisation [18, 28], which are currently the methods of choice among parametric policy search algorithms. In this paper our primary focus is on the search-direction (in the parameter space) of these two methods. 2 Search Direction Analysis In this section we will perform a novel analysis of the search-direction of both natural gradient ascent and Expectation Maximisation. In gradient-based algorithms of Markov Decision Processes the update of the policy parameters take the form wnew = w + αM(w) w U (w), (3) + where α ∈ R is the step-size parameter and M(w) is some positive-definite matrix that possibly depends on w. It is well-known that such an update will increase the total expected reward, provided that α is sufficiently small, and this process will converge to a local optimum of (1) provided the step-size sequence is appropriately selected. While EM doesn’t have an update of the form given in (3) we shall see that the algorithm is closely related to such an update. It is convenient for later reference to note that the gradient w U (w) can be written in the following form w U (w) = Epγ (z;w)Q(z;w) w log π(a|s; w) , (4) where we use the expectation notation E[·] to denote the integral/summation w.r.t. a non-negative function. The term pγ (z; w) is a geometric weighted average of state-action occupancy marginals given by ∞ γ t−1 pt (z; w), pγ (z; w) = t=1 while the term Q(z; w) is referred to as the state-action value function and is equal to the total expected future reward from the current time-point onwards, given the current state-action pair, z, 2 and parameter vector, w, i.e. ∞ Ept (z ;w) γ t−1 R(z ) z1 = z . Q(z; w) = t=1 This is a standard result and due to reasons of space we have omitted the details, but see e.g. [27] or section(6.1) of the supplementary material for more details. An immediate issue concerning updates of the form (3) is in the selection of the ‘optimal’ choice of the matrix M(w), which clearly depends on the sense in which optimality is defined. There are numerous reasonable properties that are desirable of such an update, including the numerical stability and computational complexity of the parameter update, as well as the rate of convergence of the overall algorithm resulting from these updates. While all reasonable criteria the rate of convergence is of such importance in an optimisation algorithm that it is a logical starting point in our analysis. For this reason we concern ourselves with relating these two parametric policy search algorithms to the Newton method, which has the highly desirable property of having a quadratic rate of convergence in the vicinity of a local optimum. The Newton method is well-known to suffer from problems that make it either infeasible or unattractive in practice, but in terms of forming a basis for theoretical comparisons it is a logical starting point. We shall discuss some of the issues with the Newton method in more detail in section(3). In the Newton method the matrix M(w) is set to the negative inverse Hessian, i.e. M(w) = −H−1 (w), where H(w) = w T w U (w). where we have denoted the Hessian by H(w). Using methods similar to those used to calculate the gradient, it can be shown that the Hessian takes the form H(w) = H1 (w) + H2 (w), (5) where ∞ Ep(z1:t ;w) γ t−1 R(zt ) w Ep(z1:t ;w) γ t−1 R(zt ) H1 (w) = w log p(z1:t ; w) T w log p(z1:t ; w) , (6) t=1 ∞ H2 (w) = T w log p(z1:t ; w) . (7) t=1 We have omitted the details of the derivation, but these can be found in section(6.2) of the supplementary material, with a similar derivation of a sample-based estimate of the Hessian given in [4]. 2.1 Natural Gradient Ascent To overcome some of the issues that can hinder steepest gradient ascent an alternative, natural gradient, was introduced in [16]. Natural gradient ascent techniques originated in the neural network and blind source separation literature, see e.g. [1], and take the perspective that the parameter space has a Riemannian manifold structure, as opposed to a Euclidean structure. Deriving the steepest ascent direction of U (w) w.r.t. a local norm defined on this parameter manifold (as opposed to w.r.t. the Euclidean norm, which is the case in steepest gradient ascent) results in natural gradient ascent. We denote the quadratic form that induces this local norm on the parameter manifold by G(w), i.e. d(w)2 = wT G(w)w. The derivation for natural gradient ascent is well-known, see e.g. [1], and its application to the objective (1) results in a parameter update of the form wk+1 = wk + αk G−1 (wk ) w U (wk ). −1 In terms of (3) this corresponds to M(w) = G (w). In the case of MDPs the most commonly used local norm is given by the Fisher information matrix of the trajectory distribution, see e.g. [3, 24], and due to the Markovian structure of the dynamics it is given by G(w) = −Epγ (z;w) w T w log π(a|s; w) . (8) We note that there is an alternate, but equivalent, form of writing the Fisher information matrix, see e.g. [24], but we do not use it in this work. 3 In order to relate natural gradient ascent to the Newton method we first rewrite the matrix (7) into the following form H2 (w) = Epγ (z;w)Q(z;w) w T w log π(a|s; w) . (9) For reasons of space the details of this reformulation of (7) are left to section(6.2) of the supplementary material. Comparing the Fisher information matrix (8) with the form of H2 (w) given in (9) it is clear that natural gradient ascent has a relationship with the approximate Newton method that uses H2 (w) in place of H(w). In terms of (3) this approximate Newton method corresponds to setting −1 M(w) = −H2 (w). In particular it can be seen that the difference between the two methods lies in the non-negative function w.r.t. which the expectation is taken in (8) and (9). (It also appears that there is a difference in sign, but observing the form of M(w) for each algorithm shows that this is not the case.) In the Fisher information matrix the expectation is taken w.r.t. to the geometrically weighted summation of state-action occupancy marginals of the trajectory distribution, while in H2 (w) there is an additional weighting from the state-action value function. Hence, H2 (w) incorporates information about the reward structure of the objective function, whereas the Fisher information matrix does not, and so it will generally contain more information about the curvature of the objective function. 2.2 Expectation Maximisation The Expectation Maximisation algorithm, or EM-algorithm, is a powerful optimisation technique from the statistics literature, see e.g. [11], that has recently been the centre of much research in the planning and reinforcement learning communities, see e.g. [10, 28, 18]. A quantity of central importance in the EM-algorithm for MDPs is the following lower-bound on the log-objective log U (w) ≥ Hentropy (q(z1:t , t)) + Eq(z1:t ,t) log γ t−1 R(zt )p(z1:t ; w) , (10) where Hentropy is the entropy function and q(z1:t , t) is known as the ‘variational distribution’. Further details of the EM-algorithm for MDPs and a derivation of (10) are given in section(6.3) of the supplementary material or can be found in e.g. [18, 28]. The parameter update of the EM-algorithm is given by the maximum (w.r.t. w) of the ‘energy’ term, Q(w, wk ) = Epγ (z;wk )Q(z;wk ) log π(a|s; w) . (11) Note that Q is a two-parameter function, where the first parameter occurs inside the expectation and the second parameter defines the non-negative function w.r.t. the expectation is taken. This decoupling allows the maximisation over w to be performed explicitly in many cases of interest. For example, when the log-policy is quadratic in w the maximisation problems is equivalent to a least-squares problem and the optimum can be found through solving a linear system of equations. It has previously been noted, again see e.g. [18], that the parameter update of steepest gradient ascent and the EM-algorithm can be related through this ‘energy’ term. In particular the gradient (4) evaluated at wk can also be written as follows w|w=wk U (w) = 10 w|w=wk Q(w, wk ), where 10 we use the notation w to denote the first derivative w.r.t. the first parameter, while the update of the EM-algorithm is given by wk+1 = argmaxw∈W Q(w, wk ). In other words, steepest gradient ascent moves in the direction that most rapidly increases Q(w, wk ) w.r.t. the first variable, while the EM-algorithm maximises Q(w, wk ) w.r.t. the first variable. While this relationship is true, it is also quite a negative result. It states that in situations where it is not possible to explicitly perform the maximisation over w in (11) then the alternative, in terms of the EM-algorithm, is this generalised EM-algorithm, which is equivalent to steepest gradient ascent. Considering that algorithms such as EM are typically considered because of the negative aspects related to steepest gradient ascent this is an undesirable alternative. It is possible to find the optimum of (11) numerically, but this is also undesirable as it results in a double-loop algorithm that could be computationally expensive. Finally, this result provides no insight into the behaviour of the EM-algorithm, in terms of the direction of its parameter update, when the maximisation over w in (11) can be performed explicitly. Instead we provide the following result, which shows that the step-direction of the EM-algorithm has an underlying relationship with the Newton method. In particular we show that, under suitable 4 regularity conditions, the direction of the EM-update, i.e. wk+1 − wk , is the same, up to first order, as the direction of an approximate Newton method that uses H2 (w) in place of H(w). Theorem 1. Suppose we are given a Markov Decision Process with objective (1) and Markovian trajectory distribution (2). Consider the update of the parameter through Expectation Maximisation at the k th iteration of the algorithm, i.e. wk+1 = argmaxw∈W Q(w, wk ). Provided that Q(w, wk ) is twice continuously differentiable in the first parameter we have that −1 wk+1 − wk = −H2 (wk ) w|w=wk U (w) + O( wk+1 − wk 2 ). (12) Additionally, in the case where the log-policy is quadratic the relation to the approximate Newton method is exact, i.e. the second term on the r.h.s. (12) is zero. Proof. The idea of the proof is simple and only involves performing a Taylor expansion of 10 w Q(w, wk ). As Q is assumed to be twice continuously differentiable in the first component this Taylor expansion is possible and gives 10 w Q(wk+1 , wk ) = 10 w Q(wk , wk ) + 20 w Q(wk , wk )(wk+1 − wk ) + O( wk+1 − wk 2 ). (13) As wk+1 = argmaxw∈W Q(w, wk ) it follows that 10 Q(wk+1 , wk ) = 0. This means that, upon w ignoring higher order terms in wk+1 − wk , the Taylor expansion (13) can be rewritten into the form wk+1 − wk = − 20 −1 w Q(wk , wk ) 10 w Q(wk , wk ). (14) 10 = The proof is completed by observing that w|w=wk U (w) and w Q(wk , wk ) 20 Q(wk , wk ) = H2 (wk ). The second statement follows because in the case where the log-policy w is quadratic the higher order terms in the Taylor expansion vanish. 2.3 Summary In this section we have provided a novel analysis of both natural gradient ascent and Expectation Maximisation when applied to the MDP framework. Previously, while both of these algorithms have proved popular methods for MDP optimisation, there has been little understanding of them in terms of their search-direction in the parameter space or their relation to the Newton method. Firstly, our analysis shows that the Fisher information matrix, which is used in natural gradient ascent, is similar to H2 (w) in (5) with the exception that the information about the reward structure of the problem is not contained in the Fisher information matrix, while such information is contained in H2 (w). Additionally we have shown that the step-direction of the EM-algorithm is, up to first order, an approximate Newton method that uses H2 (w) in place of H(w) and employs a constant step-size of one. 3 An Approximate Newton Method −1 A natural follow on from the analysis in section(2) is the consideration of using M(w) = −H2 (w) in (3), a method we call the full approximate Newton method from this point onwards. In this section we show that this method has many desirable properties that make it an attractive alternative to other parametric policy search methods. Additionally, denoting the diagonal matrix formed from the diagonal elements of H2 (w) by D2 (w), we shall also consider the method that uses M(w) = −1 −D2 (w) in (3). We call this second method the diagonal approximate Newton method. Recall that in (3) it is necessary that M(w) is positive-definite (in the Newton method this corresponds to requiring the Hessian to be negative-definite) to ensure an increase of the objective. In general the objective (1) is not concave, which means that the Hessian will not be negative-definite over the entire parameter space. In such cases the Newton method can actually lower the objective and this is an undesirable aspect of the Newton method. An attractive property of the approximate Hessian, H2 (w), is that it is always negative-definite when the policy is log–concave in the policy parameters. This fact follows from the observation that in such cases H2 (w) is a non-negative mixture of negative-definite matrices, which again is negative-definite [9]. Additionally, the diagonal 5 terms of a negative-definite matrix are negative and so D2 (w) is also negative-definite when the controller is log-concave. To motivate this result we now briefly consider some widely used policies that are either log-concave or blockwise log-concave. Firstly, consider the Gibb’s policy, π(a|s; w) ∝ exp wT φ(a, s), where φ(a, s) ∈ Rnw is a feature vector. This policy is widely used in discrete systems and is logconcave in w, which can be seen from the fact that log π(a|s; w) is the sum of a linear term and a negative log-sum-exp term, both of which are concave [9]. In systems with a continuous stateaction space a common choice of controller is π(a|s; wmean , Σ) = N (a|Kφ(s) + m, Σ(s)), where wmean = {K, m} and φ(s) ∈ Rnw is a feature vector. The notation Σ(s) is used because there are cases where is it beneficial to have state dependent noise in the controller. This controller is not jointly log-concave in wmean and Σ, but it is blockwise log-concave in wmean and Σ−1 . In terms of wmean the log-policy is quadratic and the coefficient matrix of the quadratic term is negative-definite. In terms of Σ−1 the log-policy consists of a linear term and a log-determinant term, both of which are concave. In terms of evaluating the search direction it is clear from the forms of D2 (w) and H2 (w) that many of the pre-existing gradient evaluation techniques can be extended to the approximate Newton framework with little difficulty. In particular, gradient evaluation requires calculating the expectation of the derivative of the log-policy w.r.t. pγ (z; w)Q(z; w). In terms of inference the only additional calculation necessary to implement either the full or diagonal approximate Newton methods is the calculation of the expectation (w.r.t. to the same function) of the Hessian of the log-policy, or its diagonal terms. As an example in section(6.5) of the supplementary material we detail the extension of the recurrent state formulation of gradient evaluation in the average reward framework, see e.g. [31], to the approximate Newton method. We use this extension in the Tetris experiment that we consider in section(4). Given ns samples and nw parameters the complexity of these extensions scale as O(ns nw ) for the diagonal approximate Newton method, while it scales as O(ns n2 ) for the w full approximate Newton method. An issue with the Newton method is the inversion of the Hessian matrix, which scales with O(n3 ) in w the worst case. In the standard application of the Newton method this inversion has to be performed at each iteration and in large parameter systems this becomes prohibitively costly. In general H(w) will be dense and no computational savings will be possible when performing this matrix inversion. The same is not true, however, of the matrices D2 (w) and H2 (w). Firstly, as D2 (w) is a diagonal matrix it is trivial to invert. Secondly, there is an immediate source of sparsity that comes from taking the second derivative of the log-trajectory distribution in (7). This property ensures that any (product) sparsity over the control parameters in the log-trajectory distribution will correspond to sparsity in H2 (w). For example, in a partially observable Markov Decision Processes where the policy is modeled through a finite state controller, see e.g. [22], there are three functions to be optimised, namely the initial belief distribution, the belief transition dynamics and the policy. When the parameters of these three functions are independent H2 (w) will be block-diagonal (across the parameters of the three functions) and the matrix inversion can be performed more efficiently by inverting each of the block matrices individually. The reason that H(w) does not exhibit any such sparsity properties is due to the term H1 (w) in (5), which consists of the non-negative mixture of outer-product matrices. The vector in these outer-products is the derivative of the log-trajectory distribution and this typically produces a dense matrix. A undesirable aspect of steepest gradient ascent is that its performance is affected by the choice of basis used to represent the parameter space. An important and desirable property of the Newton method is that it is invariant to non-singular linear (affine) transformations of the parameter space, see e.g. [9]. This means that given a non-singular linear (affine) mapping T ∈ Rnw ×nw , the Newton ˜ update of the objective U (w) = U (T w) is related to the Newton update of the original objective through the same linear (affine) mapping, i.e. v + ∆vnt = T w + ∆wnt , where v = T w and ∆vnt and ∆wnt denote the respective Newton steps. In other words running the Newton method on U (w) ˜ and U (T −1 w) will give identical results. An important point to note is that this desirable property is maintained when using H2 (w) in an approximate Newton method, while using D2 (w) results in a method that is invariant to rescaling of the parameters, i.e. where T is a diagonal matrix with non-zero elements along the diagonal. This can be seen by using the linearity of the expectation operator and a proof of this statement is provided in section(6.4) of the supplementary material. 6 20 Completed Lines 400 θ2 15 10 5 0 −10 −8 −6 −4 θ1 −2 0 300 200 100 0 0 2 (a) Policy Trace 20 40 60 80 Training Iterations 100 (b) Tetris Problem Figure 1: (a) An empirical illustration of the affine invariance of the approximate Newton method, performed on the two state MDP of [16]. The plot shows the trace of the policy during training for the two different parameter spaces, where the results of the latter have been mapped back into the original parameter space for comparison. The plot shows the two steepest gradient ascent traces (blue cross and blue circle) and the two traces of the full approximate Newton method (red cross and red circle). (b) Results of the tetris problem for steepest gradient ascent (black), natural gradient ascent (green), the diagonal approximate Newton method (blue) and the full approximate Newton method (red). 4 Experiments The first experiment we performed was an empirical illustration that the full approximate Newton method is invariant to linear transformations of the parameter space. We considered the simple two state example of [16] as it allows us to plot the trace of the policy during training, since the policy has only two parameters. The policy was trained using both steepest gradient ascent and the full approximate Newton method and in both the original and linearly transformed parameter space. The policy traces of the two algorithms are plotted in figure(1.a). As expected steepest gradient ascent is affected by such mappings, whilst the full approximate Newton method is invariant to them. The second experiment was aimed at demonstrating the scalability of the full and diagonal approximate Newton methods to problems with a large state space. We considered the tetris domain, which is a popular computer game designed by Alexey Pajitnov in 1985. See [12] for more details. Firstly, we compared the performance of the full and diagonal approximate Newton methods to other parametric policy search methods. Tetris is typically played on a 20 × 10 grid, but due to computational costs we considered a 10 × 10 grid in the experiment. This results in a state space with roughly 7 × 2100 states. We modelled the policy through a Gibb’s distribution, where we considered a feature vector with the following features: the heights of each column, the difference in heights between adjacent columns, the maximum height and the number of ‘holes’. Under this policy it is not possible to obtain the explicit maximum over w in (11) and so a straightforward application of EM is not possible in this problem. We therefore compared the diagonal and full approximate Newton methods with steepest and natural gradient ascent. Due to reasons of space the exact implementation of the experiment is detailed in section(6.6) of the supplementary material. We ran 100 repetitions of the experiment, each consisting of 100 training iterations, and the mean and standard error of the results are given in figure(1.b). It can be seen that the full approximate Newton method outperforms all of the other methods, while the performance of the diagonal approximate Newton method is comparable to natural gradient ascent. We also ran several training runs of the full approximate Newton method on the full-sized 20 × 10 board and were able to obtain a score in the region of 14, 000 completed lines, which was obtained after roughly 40 training iterations. An approximate dynamic programming based method has previously been applied to the Tetris domain in [7]. The same set of features were used and a score of roughly 4, 500 completed lines was obtained after around 6 training iterations, after which the solution then deteriorated. In the third experiment we considered a finite horizon (controlled) linear dynamical system. This allowed the search-directions of the various algorithms to be computed exactly using [13] and removed any issues of approximate inference from the comparison. In particular we considered a 3-link rigid manipulator, linearized through feedback linearisation, see e.g. [17]. This system has a 7 Normalised Total Expected Reward Normalised Total Expected Reward 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 200 400 Training Time 600 (a) Model-Based Linear System 1 0.9 0.8 0.7 0.6 0 200 400 600 Training Iterations 800 (b) Model-Free Non-Linear System Figure 2: (a) The normalised total expected reward plotted against training time, in seconds, for the 3-link rigid manipulator. The plot shows the results for steepest gradient ascent (black), EM (blue), natural gradient ascent (green) and the approximate Newton method (red), where the plot shows the mean and standard error of the results. (b) The normalised total expected reward plotted against training iterations for the synthetic non-linear system of [29]. The plot shows the results for EM (blue), steepest gradient ascent (black), natural gradient ascent (green) and the approximate Newton method (red), where the plot shows the mean and standard error of the results. 6-dimensional state space, 3-dimensional action space and a 22-dimensional parameter space. Further details of the system can be found in section(6.7) of the supplementary material. We ran the experiment 100 times and the mean and standard error of the results plotted in figure(2.a). In this experiment the approximate Newton method found substantially better solutions than either steepest gradient ascent, natural gradient ascent or Expectation Maximisation. The superiority of the results in comparison to either steepest or natural gradient ascent can be explained by the fact that H2 (w) gives a better estimate of the curvature of the objective function. Expectation Maximisation performed poorly in this experiment, exhibiting sub-linear convergence. Steepest gradient ascent performed 3684 ± 314 training iterations in this experiment which, in comparison to the 203 ± 34 and 310 ± 40 iterations of natural gradient ascent and the approximate Newton method respectively, illustrates the susceptibility of this method to poor scaling. In the final experiment we considered the synthetic non-linear system considered in [29]. Full details of the system and the experiment can be found in section(6.8) of the supplementary material. We ran the experiment 100 times and the mean and standard error of the results are plotted in figure(2.b). Again the approximate Newton method outperforms both steepest and natural gradient ascent. In this example only the mean parameters of the Gaussian controller are optimised, while the parameters of the noise are held fixed, which means that the log-policy is quadratic in the policy parameters. Hence, in this example the EM-algorithm is a particular (less general) version of the approximate Newton method, where a fixed step-size of one is used throughout. The marked difference in performance between the EM-algorithm and the approximate Newton method shows the benefit of being able to tune the step-size sequence. In this experiment we considered five different step-size sequences for the approximate Newton method and all of them obtained superior results than the EM-algorithm. In contrast only one of the seven step-size sequences considered for steepest and natural gradient ascent outperformed the EM-algorithm. 5 Conclusion The contributions of this paper are twofold: Firstly we have given a novel analysis of Expectation Maximisation and natural gradient ascent when applied to the MDP framework, showing that both have close connections to an approximate Newton method; Secondly, prompted by this analysis we have considered the direct application of this approximate Newton method to the optimisation of MDPs, showing that it has numerous desirable properties that are not present in the naive application of the Newton method. In terms of empirical performance we have found the approximate Newton method to perform consistently well in comparison to EM and natural gradient ascent, highlighting its viability as an alternative to either of these methods. At present we have only considered actor type implementations of the approximate Newton method and the extension to actor-critic methods is a point of future research. 8 References [1] S. Amari. Natural Gradient Works Efficiently in Learning. Neural Computation, 10:251–276, 1998. [2] M. Azar, V. G´ mez, and H. Kappen. Dynamic policy programming with function approximation. Journal o of Machine Learning Research - Proceedings Track, 15:119–127, 2011. [3] J. Bagnell and J. Schneider. Covariant Policy Search. IJCAI, 18:1019–1024, 2003. [4] J. Baxter and P. Bartlett. Infinite Horizon Policy Gradient Estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001. [5] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, second edition, 2000. [6] D. P. Bertsekas. Approximate Policy Iteration: A Survey and Some New Methods. Research report, Massachusetts Institute of Technology, 2010. [7] D. P. Bertsekas and S. Ioffe. Temporal Differences-Based Policy Iteration and Applications in NeuroDynamic Programming. Research report, Massachusetts Institute of Technology, 1997. [8] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and L. Mark. Natural Actor-Critic Algorithms. Automatica, 45:2471–2482, 2009. [9] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [10] P. Dayan and G. E. Hinton. Using Expectation-Maximization for Reinforcement Learning. Neural Computation, 9:271–278, 1997. [11] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977. [12] C. Fahey. Tetris AI, Computers Play Tetris http://colinfahey.com/tetris/tetris_en. html, 2003. [13] T. Furmston and D. Barber. Efficient Inference for Markov Control Problems. UAI, 29:221–229, 2011. [14] P. W. Glynn. Likelihood Ratio Gradient Estimation for Stochastic Systems. Communications of the ACM, 33:97–84, 1990. [15] E. Greensmith, P. Bartlett, and J. Baxter. Variance Reduction Techniques For Gradient Based Estimates in Reinforcement Learning. Journal of Machine Learning Research, 5:1471–1530, 2004. [16] S. Kakade. A Natural Policy Gradient. NIPS, 14:1531–1538, 2002. [17] H. Khalil. Nonlinear Systems. Prentice Hall, 2001. [18] J. Kober and J. Peters. Policy Search for Motor Primitives in Robotics. Machine Learning, 84(1-2):171– 203, 2011. [19] L. Kocsis and C. Szepesv´ ri. Bandit Based Monte-Carlo Planning. European Conference on Machine a Learning (ECML), 17:282–293, 2006. [20] V. R. Konda and J. N. Tsitsiklis. On Actor-Critic Algorithms. SIAM J. Control Optim., 42(4):1143–1166, 2003. [21] P. Marbach and J. Tsitsiklis. Simulation-Based Optimisation of Markov Reward Processes. IEEE Transactions on Automatic Control, 46(2):191–209, 2001. [22] N. Meuleau, L. Peshkin, K. Kim, and L. Kaelbling. Learning Finite-State Controllers for Partially Observable Environments. UAI, 15:427–436, 1999. [23] J. Nocedal and S. Wright. Numerical Optimisation. Springer, 2006. [24] J. Peters and S. Schaal. Natural Actor-Critic. Neurocomputing, 71(7-9):1180–1190, 2008. [25] K. Rawlik, Toussaint. M, and S. Vijayakumar. On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference. International Conference on Robotics Science and Systems, 2012. [26] S. Richter, D. Aberdeen, and J. Yu. Natural Actor-Critic for Road Traffic Optimisation. NIPS, 19:1169– 1176, 2007. [27] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. NIPS, 13:1057–1063, 2000. [28] M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic Inference for Solving (PO)MDPs. Research Report EDI-INF-RR-0934, University of Edinburgh, School of Informatics, 2006. [29] N. Vlassis, M. Toussaint, G. Kontes, and S. Piperidis. Learning Model-Free Robot Control by a Monte Carlo EM Algorithm. Autonomous Robots, 27(2):123–130, 2009. [30] L. Weaver and N. Tao. The Optimal Reward Baseline for Gradient Based Reinforcement Learning. UAI, 17(29):538–545, 2001. [31] R. Williams. Simple Statistical Gradient Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8:229–256, 1992. 9

2 0.72471416 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

3 0.6653735 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

Author: Aaron Wilson, Alan Fern, Prasad Tadepalli

Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. 1

4 0.66240609 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin

Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1

5 0.6575473 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

6 0.6502642 38 nips-2012-Algorithms for Learning Markov Field Policies

7 0.63286 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

8 0.61698592 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

9 0.58247882 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

10 0.57734364 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

11 0.55801332 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

12 0.55602771 160 nips-2012-Imitation Learning by Coaching

13 0.55558091 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

14 0.55417514 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

15 0.54232097 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

16 0.5372417 348 nips-2012-Tractable Objectives for Robust Policy Optimization

17 0.52670074 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

18 0.52511156 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

19 0.46580303 358 nips-2012-Value Pursuit Iteration

20 0.46275532 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.024), (21, 0.028), (38, 0.143), (42, 0.041), (53, 0.264), (54, 0.09), (74, 0.032), (76, 0.128), (80, 0.085), (92, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88751131 328 nips-2012-Submodular-Bregman and the Lovász-Bregman Divergences with Applications

Author: Rishabh Iyer, Jeff A. Bilmes

Abstract: We introduce a class of discrete divergences on sets (equivalently binary vectors) that we call the submodular-Bregman divergences. We consider two kinds, defined either from tight modular upper or tight modular lower bounds of a submodular function. We show that the properties of these divergences are analogous to the (standard continuous) Bregman divergence. We demonstrate how they generalize many useful divergences, including the weighted Hamming distance, squared weighted Hamming, weighted precision, recall, conditional mutual information, and a generalized KL-divergence on sets. We also show that the generalized Bregman divergence on the Lov´ sz extension of a submodular function, which we a call the Lov´ sz-Bregman divergence, is a continuous extension of a submodular a Bregman divergence. We point out a number of applications, and in particular show that a proximal algorithm defined through the submodular Bregman divergence provides a framework for many mirror-descent style algorithms related to submodular function optimization. We also show that a generalization of the k-means algorithm using the Lov´ sz Bregman divergence is natural in clustering scenarios where a ordering is important. A unique property of this algorithm is that computing the mean ordering is extremely efficient unlike other order based distance measures. 1

2 0.86563057 359 nips-2012-Variational Inference for Crowdsourcing

Author: Qiang Liu, Jian Peng, Alex Ihler

Abstract: Crowdsourcing has become a popular paradigm for labeling large datasets. However, it has given rise to the computational task of aggregating the crowdsourced labels provided by a collection of unreliable annotators. We approach this problem by transforming it into a standard inference problem in graphical models, and applying approximate variational methods, including belief propagation (BP) and mean field (MF). We show that our BP algorithm generalizes both majority voting and a recent algorithm by Karger et al. [1], while our MF method is closely related to a commonly used EM algorithm. In both cases, we find that the performance of the algorithms critically depends on the choice of a prior distribution on the workers’ reliability; by choosing the prior properly, both BP and MF (and EM) perform surprisingly well on both simulated and real-world datasets, competitive with state-of-the-art algorithms based on more complicated modeling assumptions. 1

3 0.85535139 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification

Author: Nicolò Cesa-bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella

Abstract: We present very efficient active learning algorithms for link classification in signed networks. Our algorithms are motivated by a stochastic model in which edge labels are obtained through perturbations of a initial sign assignment consistent with a two-clustering of the nodes. We provide a theoretical analysis within this model, showing that we can achieve an optimal (to whithin a constant factor) number of mistakes on any graph G = (V, E) such that |E| = Ω(|V |3/2 ) by querying O(|V |3/2 ) edge labels. More generally, we show an algorithm that achieves optimality to within a factor of O(k) by querying at most order of |V | + (|V |/k)3/2 edge labels. The running time of this algorithm is at most of order |E| + |V | log |V |. 1

4 0.83334303 313 nips-2012-Sketch-Based Linear Value Function Approximation

Author: Marc Bellemare, Joel Veness, Michael Bowling

Abstract: Hashing is a common method to reduce large, potentially infinite feature vectors to a fixed-size table. In reinforcement learning, hashing is often used in conjunction with tile coding to represent states in continuous spaces. Hashing is also a promising approach to value function approximation in large discrete domains such as Go and Hearts, where feature vectors can be constructed by exhaustively combining a set of atomic features. Unfortunately, the typical use of hashing in value function approximation results in biased value estimates due to the possibility of collisions. Recent work in data stream summaries has led to the development of the tug-of-war sketch, an unbiased estimator for approximating inner products. Our work investigates the application of this new data structure to linear value function approximation. Although in the reinforcement learning setting the use of the tug-of-war sketch leads to biased value estimates, we show that this bias can be orders of magnitude less than that of standard hashing. We provide empirical results on two RL benchmark domains and fifty-five Atari 2600 games to highlight the superior learning performance obtained when using tug-of-war hashing. 1

5 0.78553843 12 nips-2012-A Neural Autoregressive Topic Model

Author: Hugo Larochelle, Stanislas Lauly

Abstract: We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more meaningful document representations. Specifically, we take inspiration from the conditional mean-field recursive equations of the Replicated Softmax in order to define a neural network architecture that estimates the probability of observing a new word in a given document given the previously observed words. This paradigm also allows us to replace the expensive softmax distribution over words with a hierarchical distribution over paths in a binary tree of words. The end result is a model whose training complexity scales logarithmically with the vocabulary size instead of linearly as in the Replicated Softmax. Our experiments show that our model is competitive both as a generative model of documents and as a document representation learning algorithm. 1

same-paper 6 0.78253883 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

7 0.76903796 30 nips-2012-Accuracy at the Top

8 0.67338741 160 nips-2012-Imitation Learning by Coaching

9 0.67130119 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

10 0.66556484 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

11 0.66143072 214 nips-2012-Minimizing Sparse High-Order Energies by Submodular Vertex-Cover

12 0.66077238 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

13 0.65767127 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

14 0.65370202 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

15 0.65349013 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

16 0.64954382 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

17 0.64954382 293 nips-2012-Relax and Randomize : From Value to Algorithms

18 0.64900947 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

19 0.64900136 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

20 0.64756811 38 nips-2012-Algorithms for Learning Markov Field Policies