nips nips2004 nips2004-102 nips2004-102-reference knowledge-graph by maker-knowledge-mining

102 nips-2004-Learning first-order Markov models for control

Source: pdf

Author: Pieter Abbeel, Andrew Y. Ng

Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a ﬁrst-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the ﬁrst-order (single-step) transitions. But for many problems, the ﬁrstorder conditional independence assumptions are not satisﬁed, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a ﬁrst-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the beneﬁts of using ﬁrst-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1

reference text

[1] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991.

[2] T. G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. JAIR, 2000.

[3] P. Gahinet, A. Nemirovski, A. Laub, and M. Chilali. LMI Control Toolbox. Natick, MA, 1995.

[4] Z. Ghahramani. Learning dynamic Bayesian networks. In Adaptive Processing of Sequences and Data Structures, pages 168–197. Springer-Verlag, 1998. 9 Experimental details: The true process has two different (hidden) modes for arrivals. The ﬁrst mode has a very low arrival rate, and the second mode has a very high arrival rate. We denote the steady state distribution over the two modes by (φ1 , φ2 ). (I.e., the system spends a fraction φ1 of the time in the low arrival rate mode, and a fraction φ2 = 1 − φ1 of the time in high arrival rate mode.) Given the steady state distribution, the state transition matrix [a 1 − a; 1 − b b] has only one remaining degree of freedom, which (essentially) controls how often the system switches between the two modes. (Here, a [resp. b] is the probability, if we are in the slow [resp. fast] mode, of staying in the same mode the next time step.) More speciﬁcally, assuming φ1 > φ2 , we have b ∈ [0, 1], a = 1 − (1 − b)φ2 /φ1 . The larger b is, the more slowly the system switches between modes. Our experiments used φ1 = 0.8, φ2 = 0.2, P (arrival|mode 1) = 0.01, P (arrival|mode 2) = 0.99. This means b = 0.2 gives independent arrival modes for consecutive time steps. In our experiments, q0 = 0, and q1 was equal to the average arrival rate p = φ1 P (arrival|mode 1) + φ2 P (arrival|mode 2). Note that the highest service rate q2 (= 0.75) is lower than the fast mode’s arrival rate. Training data was generated using 8000 simulations of 25 time steps each, in which the queue length is initialized randomly, and the same (randomly chosen) action is taken on all 25 time steps. To reduce computational requirements, we only included the terms of the objective (Eqn. 7) for which k = 20. We used a discount factor γ = .95 and approximated utilities by truncating at a ﬁnite horizon of 100. Note that although we explain the queuing process by arrival/departure rates, the algorithm learns full transition matrices for each action, and not only the arrival/departure rates.

[5] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artiﬁcial Intelligence, 101, 1998.

[6] M. Kearns, Y. Mansour, and A. Y. Ng. Approximate planning in large POMDPs via reusable trajectories. In NIPS 12, 1999.

[7] R. Neal and G. Hinton. A view of the EM algorithm that justiﬁes incremental, sparse, and other variants. In Learning in Graphical Models, pages 355–368. MIT Press, 1999.

[8] H. Ney, U. Essen, and R. Kneser. On structuring probabilistic dependencies in stochastic language modeling. Computer Speech and Language, 8, 1994.

[9] A. Y. Ng and M. I. Jordan. On discriminative vs. generative classiﬁers: A comparison of logistic regression and naive Bayes. In NIPS 14, 2002.

[10] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kauffman, 1988.

[11] D. Precup, R. S. Sutton, and S. Singh. Theoretical results on reinforcement learning with temporally abstract options. In Proc. ECML, 1998.

[12] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 1989.

[13] J. K. Satia and R. L. Lave. Markov decision processes with uncertain transition probabilities. Operations Research, 1973.

[14] L. K. Saul and M. I. Jordan. Mixed memory Markov models: decomposing complex stochastic processes as mixtures of simpler ones. Machine Learning, 37, 1999.

[15] R. S. Sutton. TD models: Modeling the world at a mixture of time scales. In Proc. ICML, 1995.

[16] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.

[17] C. C. White and H. K. Eldeib. Markov decision processes with imprecise transition probabilities. Operations Research, 1994. Appendix: Derivation of EM algorithm This Appendix derives the EM algorithm that optimizes Eqn. (7). The derivation is based on [7]’s method. Note that because of discounting, the objective is slightly different from the standard setting of learning the parameters of a Markov chain with unobserved variables in the training data. Since we are using a ﬁrst-order model, we have Pθ (st+k |st , at:t+k−1 ) = ˆ ˆ ˆ ˆ St+1:t+k−1 Pθ (st+k |St+k−1 , at+k−1 )Pθ (St+k−1 |St+k−2 , at+k−2 ) . . . Pθ (St+1 |st , at ). Here, the summation is over all possible state sequences St+1:t+k−1 . So we have T −1 T −t k ˆ t=0 k=1 γ log Pθ (st+k |st , at:t+k−1 ) = ≥ T −1 t=0 γ log Pθ (st+1 |st , at ) + ˆ T −1 t=0 T −t k=2 γ k log Qt,k (St+1:t+k−1 ) St+1:t+k−1 Qt,k (St+1:t+k−1 ) Pθ (st+k |St+k−1 , at+k−1 )Pθ (St+k−1 |St+k−2 , at+k−2 ) . . . Pθ (St+1 |st , at ) ˆ ˆ ˆ T −1 T −1 T −t k ˆ t=0 γ log Pθ (st+1 |st , at ) + t=0 k=2 γ Qt,k (St+1:t+k−1 ) Pθ (st+k |St+k−1 ,at+k−1 )Pθ (St+k−1 |St+k−2 ,at+k−2 )...Pθ (St+1 |st ,at ) ˆ ˆ ˆ . log Qt,k (St+1:t+k−1 ) (8) Here, Qt,k is a probability distribution, and the inequality follows from Jensen’s inequality and the concavity of log(·). As in [7], the EM algorithm optimizes Eqn. (8) by alternately optimizing with respect to the distributions Qt,k (E-step), and the transition probabilities Pθ (·|·, ·) (M-step). Optimizing with respect to the Qt,k variables (E-step) is achieved by ˆ setting Qt,k (St+1:t+k−1 ) = Pθ (St+1 , . . . , St+k−1 |St = st , St+k = st+k , At:t+k−1 = at:t+k−1 ). (9) ˆ Optimizing with respect to the transition probabilities Pθ (·|·, ·) (M-step) for Qt,k ˆ ˆ ˆ ﬁxed as in Eqn. (9) is done by updating θ to θnew such that ∀ i, j ∈ S, ∀ a ∈ A we have that Pθnew (j|i, a) = stats(j, i, a)/ k∈S stats(k, i, a), where ˆ T −1 T −t k−1 k stats(j, i, a) = ˆ t=0 k=1 l=0 γ Pθ (St+l+1 = j, St+l = i|St = st , St+k = st+k , At:t+k−1 = at:t+k−1 )1{at+l = a}. Note that only the pairwise marginals Pθ (St+l+1 , St+l |St , St+k , At:t+k−1 ) are needed in the M-step, and so it is sufﬁcient to ˆ compute only these when optimizing with respect to the Qt,k variables in the E-step.