1. Trang chủ
  2. » Giáo Dục - Đào Tạo

Online learning and planning of dynamical systems using gaussian processes

100 286 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 100
Dung lượng 2,1 MB

Nội dung

ONLINE LEARNING AND PLANNING OF DYNAMICAL SYSTEMS USING GAUSSIAN PROCESSES MODEL BASED BAYESIAN REINFORCEMENT LEARNING ANKIT GOYAL B.Tech., Indian Institute of Technology, Roorkee, India, 2012 A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE 2015 Online learning and planning of dynamical systems using Gaussian processes Ankit Goyal April 26, 2015 Declaration I hereby declare that this is my original work and has been written by me in its entirety. I have duly acknowledged all the sources of information which have been used in this thesis. This thesis has also not been submitted for any degree in any university previously. Name: Ankit Goyal Signed: Date: 30 April 2015 i ii Acknowledgement First and foremost, I would like to thank my supervisor Prof. Lee Wee Sun and my co-supervisor Prof. David Hsu for all their help and support. Their keen insight and sound knowledge of fundamentals are a constant source of inspiration to me. I appreciate their long-standing, generous and patient support during my work on the thesis. I am deeply thankful to them for being available for questions and feedback at all hours. I would also like to thank Prof. Damien Ernst (University of Liege, Belgium) for pointing me towards the relevant medical application and Dr. Marc Diesenroth (Imperial College London, UK) for his kind support in clearing my doubts regarding control systems and Gaussian processes. I would also like to mention that Marc’s PILCO (PhD thesis) work provided seed for my initial thought process and played an important component in shaping my thesis in its present form. My gratitude goes also to my family, for helping me through all of my time at the university. I also thank my lab-mates for the active discussions we have had about various topics. Their stimulating conversation helped brighten the day. I would especially like to thank Zhan Wei Lim for all his help and support throughout my candidature. Last but not the least, I thank my roommates and friends, who have made it possible for me to feel at home in a new place. iii iv Contents Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Background and related work . . . . . . . . . . . . . . . . . . . . . . . . 2.1 2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Gaussian Process 2.1.2 Sequential Decision Making under uncertainty . . . . . . . . 19 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Conceptual framework and proposed algorithm . . . . . . . . . . . . . . . 29 3.1 Conceptual framework . . . . . . . . . . . . . . . . . . . . . . . . . 34 This section has been largely shaped from [Snelson, 2007] v 3.1.1 3.2 Learning the (auto-regressive) transition model . . . . . . . 35 Proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.1 Computational Complexity . . . . . . . . . . . . . . . . . . . 38 3.2.2 Nearest neighbor search . . . . . . . . . . . . . . . . . . . . 39 3.2.3 Revised algorithm . . . . . . . . . . . . . . . . . . . . . . . . 40 Problem definition and experimental results . . . . . . . . . . . . . . . . 45 4.1 4.2 Learning swing up control of under-actuated pendulum . . . . . . . 45 4.1.1 Experimental results . . . . . . . . . . . . . . . . . . . . . . 49 4.1.2 Comparison with Q-learning method . . . . . . . . . . . . . 57 Learning STI drug strategies for HIV infected patients . . . . . . . 59 4.2.1 4.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . 64 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 A Equations of Dynamical system . . . . . . . . . . . . . . . . . . . . . . . 75 A.1 Simple pendulum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 A.2 HIV infected patient . . . . . . . . . . . . . . . . . . . . . . . . . . 77 B Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 B.1 Simple pendulum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 B.2 HIV infected patient . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 vi Summary Decision-making problems with complicated and/or partially unknown underlying generative process and limited data has been quite pervasive in several research areas including robotics, automatic control, operations research, artificial intelligence, economics, medicine etc. In such areas, we can take great advantage from algorithms that learn from data and aid decision making. Over years, Reinforcement learning (RL) has been emerged as a general computational framework to the goal-directed experience-based learning for sequential decision making under uncertainty. However, with no task-specific knowledge, it often lacks efficiency in terms of the number of required samples. This lack of sample efficiency makes RL inapplicable to many real world problems. Thus, a central challenge in RL is how to extract more information from available experience to facilitate fast learning with little data. The contribution of this dissertation are: • Proposal of (online) sequential (or non-episodic) reinforcement learning framework for modeling a variety of single agent problems and algorithms. • Systematic treatment of model bias for sample efficiency by using Gaussian processes for model learning and using the uncertainty information for long term prediction in the planning algorithms. • Empirical evaluation of the results for the swing-up control of simple pendulum and designing suitable (interrupted) drug strategies for HIV infected patient. vii viii lated types of policies have also been proposed in the literature for stochastic and/or partially observable settings many of them belonging to the class of Monte-Carlo tree search techniques. A key issue for these techniques to work well is to have good tree-exploration strategies. One of the very popular and good heuristics is Upper confidence bound (UCB), which happens to work well for the game of Go, which had a large branching factor. Investigating whether the systematic approach proposed here for designing such strategies could be used in such settings would be very relevant. • Online optimization of hyper-parameters: We have assumed that the hyper-parameters of the GP model are trained offline in batch mode from the historical data-sets or by random excitation of the system. But, instead of using a pre-processing step, it would be more relevant that the algorithm should adapt/tune the hyper-parameters of the GP model while interacting with the environment in some online (non-convex) optimization setting. • Extension to POMDP by learning the observation model: Unlike the transition model, the Observation model would be much more challenging to learn and would employ techniques from unsupervised learning. We can learn the underlying state space model directly using latent variable methods or manifold learning combined with dynamics. [Wang et al., 2008; Lawrence et al., 2011] could provide a starting point to explore and extend further. • Extension from discrete actions to continuous actions: We can use some probabilistic inference techniques for finding the actions in the continuous domain directly instead of discretizing it and performing tree search. 71 72 Appendices 73 74 Appendix A Equations of Dynamical system A.1 Simple pendulum The pendulum shown in Figure A.1 is of mass m, length l, and the pendulum angle θ is measured anti-clockwise from hanging down. We assume that the pendulum is thin and the only source of actuation is the torque τ which can be applied at the pivoted point. Figure A.1: Pivoted Pendulum (pictorial representation) We derive the equations of motion via system Lagrangian L, which is the difference between kinetic energy K and potential energy V, and is given by 1 L = K − V = mv + I θ˙2 + mlgcosθ, 2 where g is the acceleration due to gravity and I = of the pendulum about its midpoint. 75 ml2 12 (A.1) is the moment of inertia The x and y coordinates of the midpoint of pendulum are x = lsinθ y = lcosθ (A.2) and the squared velocity of the midpoint is v = x˙ + y˙ = l2 θ˙2 (A.3) 1 L = ml2 θ˙2 + mglcosθ + I θ˙2 2 (A.4) Therefore, In the system Lagrangian, the equations of motion can generally be derived from set of equations defined through, d ∂L ∂L − = Fi , dt ∂ q˙i ∂qi (A.5) where Fi are the non-conservative forces and, qi and q˙i are the state variables of the system. In our case, ∂L = ml2 θ˙ + I θ˙ ∂ θ˙ ∂L = − mlgsinθ ∂θ (A.6) ¨ ml2 + I) + mlgsinθ = τ − bθ, ˙ θ( (A.7) which gives, with b as the friction coefficient. After collecting both variables, x = [x1 , x2 ] = ˙ the equations of motion can be conveniently expressed as two coupled ordi[θ, θ], 76 nary differential equations, x˙1 = x2 (A.8) τ − bx2 − 12 mglx1 x˙2 = ml2 + I The parameter values which have been selected for the experiments are : m = kg, l = m, b = 0.01 and g = 9.81 m/s2 . A.2 HIV infected patient In this section, we introduce the mathematical model we have used to artificially generate the data needed by the learning algorithm. This mathematical model has been taken from the Appendix section of [Ernst et al., 2006] and [Adams et al., 2004] to which we refer the reader for further information. This mathematical model is described by the following set of ordinary differential equations: T˙1 = λ1 − d1 T1 − (1 − )k1 V T1 T˙2 = λ2 − d2 T2 − (1 − f )k2 V T2 T˙1∗ = (1 − )k1 V T1 − δT1∗ − m1 ET1∗ (A.9) T˙2∗ = (1 − f )k2 V T2 − δT2∗ − m2 ET2∗ V˙ = (1 − ˙∗ )NT δ(T1 + T˙2∗ ) − cV − [(1 − E˙ = λE + dE (T˙1∗ + T˙2∗ ) bE (T˙1∗ + T˙2∗ ) E − E − δE E (T˙1∗ + T˙2∗ ) + Kb (T˙1∗ + T˙2∗ ) + Kd )ρ1 k1 T1 + (1 − f ρ2 k2 T2 ]V where T1 and T1∗ denotes the number of non-infected and infected CD4+ Tlymphocytes (in cells/ml), T2 and T2∗ denotes the number of non-infected and infected macrophages (in cells/ml), V the number of free viruses (in copies/ml) and E the number of cytotoxic T-lymphocytes (in cells/ml). The drug and represent the values of the control actions corresponding to the reverse tran- scriptase inhibitor and the protease inhibitor respectively. In each period during which the RTI and the PI is administrated to the patient, 77 and is set equal to 0.7 and 0.3 respectively, and if not administered set to 0.0. The mean values of the different parameters of the model are taken from [Adams et al., 2004] : λ1 = 10, 000, d1 = 0.01, k1 = 8.0 ∗ 10−7 , λ2 = 31.98, d2 = 0.01, f = 0.34, k2 = 1.0 ∗ 10−4 , λ = 0.7, m1 = 1.0 ∗ 10−5 , m2 = ∗ 10−5 , NT = 100, c = 13, ρ1 = 1, ρ2 = 1, λE = 1, bE = 0.3, Kb = 100, dE = 0.25, Kd = 500, δE = 0.1. We have sampled the new patients around this mean value of the parameters. 78 Appendix B Parameter Settings B.1 Simple pendulum The action τ ∈ (−1.4, 1.4) N-m has been discritized (uniformly) to values. The angular velocity is upper bounded to ±12 m/s while generating the data. A small jitter was added to the diagonal elements of covariance matrix to keep it positive definite at every step. The hyper-parameters were trained in the batch mode using the GPy library1 in python. B.2 HIV infected patient The initial data collected for the pre-processing step for tuning the hyper-parameters was positively skewed. Therefore, we have introduced a transformation step and have done our training and planning in the log(1+x) domain, where x denotes our state variable. We found reasonably good results for this transformation. The new patient or patient dependent parameters pi are sampled from pi ∼ N (µ, (0.05µ)2 ) where µ is the mean values of the parameters given in Appendix A. GPy: A Gaussian process framework in python, http://github.com/SheffieldML/GPy 79 80 Bibliography Abbeel, P., Coates, A., Quigley, M., and Ng, A. Y. (2007). An application of reinforcement learning to aerobatic helicopter flight. Advances in neural information processing systems, 19:1. Abbeel, P., Quigley, M., and Ng, A. Y. (2006). Using inaccurate models in reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 1–8. ACM. Adams, B., Banks, H., Kwon, H.-D., and Tran, H. T. (2004). Dynamic multidrug therapies for hiv: Optimal and sti control approaches. Mathematical Biosciences and Engineering, 1(2):223–241. Astrom, K. and Wittenmark, B. (2008). Adaptive control. mineola. Atkeson, C. G., Moore, A. W., and Schaal, S. (1997). Locally weighted learning for control. In Lazy learning, pages 75–113. Springer. Atkeson, C. G. and Santamaria, J. C. (1997). A comparison of direct and modelbased reinforcement learning. In In International Conference on Robotics and Automation. Citeseer. Bajaria, S. H., Webb, G., and Kirschner, D. E. (2004). Predicting differential responses to structured treatment interruptions during haart. Bulletin of Mathematical Biology, 66(5):1093–1118. Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press. 81 Barto, A. G., Bradtke, S. J., and Singh, S. P. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1):81–138. Bertsekas, D. P., Bertsekas, D. P., Bertsekas, D. P., and Bertsekas, D. P. (1995). Dynamic programming and optimal control, volume 1. Athena Scientific Belmont, MA. Bishop, C. M. et al. (2006). Pattern recognition and machine learning, volume 1. springer New York. Bonhoeffer, S., Rembiszewski, M., Ortiz, G. M., and Nixon, D. F. (2000). Risks and benefits of structured antiretroviral drug therapy interruptions in hiv-1 infection. Aids, 14(15):2313–2322. Caselton, W. F. and Zidek, J. V. (1984). Optimal monitoring network designs. Statistics & Probability Letters, 2(4):223–227. Deisenroth, M., Fox, D., and Rasmussen, C. (2013). Gaussian processes for dataefficient learning in robotics and control. Deisenroth, M. P. (2010). Efficient reinforcement learning using gaussian processes, volume 9. KIT Scientific Publishing. Deisenroth, M. P., Rasmussen, C. E., and Peters, J. (2009). Gaussian process dynamic programming. Neurocomputing, 72(7):1508–1524. Duff, M. (2003). Design for an optimal probe. In ICML, pages 131–138. Engel, Y., Mannor, S., and Meir, R. (2003). Bayes meets bellman: The gaussian process approach to temporal difference learning. In ICML, volume 20, page 154. Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. In Journal of Machine Learning Research, pages 503–556. 82 Ernst, D., Stan, G.-B., Goncalves, J., and Wehenkel, L. (2006). Clinical data based optimal sti strategies for hiv: a reinforcement learning approach. In Decision and Control, 2006 45th IEEE Conference on, pages 667–672. IEEE. Fabri, S. and Kadirkamanathan, V. (1998). Dual adaptive control of nonlinear stochastic systems using neural networks. Automatica, 34(2):245–253. Feldbaum, A. (1960). Dual control theory. Automation and Remote Control, 21(9):874–1039. Girard, A., Rasmussen, C. E., Quinonero-Candela, J., and Murray-Smith, R. (2003). Gaussian process priors with uncertain inputs? application to multiplestep ahead time series forecasting. Guestrin, C., Krause, A., and Singh, A. P. (2005). Near-optimal sensor placements in gaussian processes. In Proceedings of the 22nd international conference on Machine learning, pages 265–272. ACM. Hitsuda, M. et al. (1968). Representation of gaussian processes equivalent to wiener process. Osaka Journal of Mathematics, 5(2):299–312. Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996). Reinforcement learning: A survey. arXiv preprint cs/9605103. Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Journal of Fluids Engineering, 82(1):35–45. Ko, J., Klein, D. J., Fox, D., and Haehnel, D. (2007). Gaussian processes and reinforcement learning for identification and control of an autonomous blimp. In ICRA, pages 742–747. Kolter, J. Z., Plagemann, C., Jackson, D. T., Ng, A. Y., and Thrun, S. (2010). A probabilistic approach to mixed open-loop and closed-loop control, with application to extreme autonomous driving. In Robotics and Automation (ICRA), 2010 IEEE International Conference on, pages 839–845. IEEE. 83 Krause, A. and Guestrin, C. (2007). Nonmyopic active learning of gaussian processes: an exploration-exploitation approach. In Proceedings of the 24th international conference on Machine learning, pages 449–456. ACM. Lawrence, N. D., Titsias, M. K., and Damianou, A. (2011). Variational gaussian process dynamical systems. In Advances in Neural Information Processing Systems, pages 2510–2518. Lisziewicz, J., Rosenberg, E., Lieberman, J., Jessen, H., Lopalco, L., Siliciano, R., Walker, B., and Lori, F. (1999). Control of hiv despite the discontinuation of antiretroviral therapy. New England Journal of Medicine, 340(21):1683–1683. Lori, F., Maserati, R., Foli, A., Seminari, E., Timpone, J., and Lisziewicz, J. (2000). Structured treatment interruptions to control hiv-1 infection. The Lancet, 355(9200):287–288. Lozano, F., Lozano, J., and Garcia, M. (2007). An artificial economy based on reinforcement learning and agent based modeling. Documentos de Trabajo, Facultad de Economia, Universidad del Rosario, (18). Maciejowski, J. M. (2002). Predictive control: with constraints. Pearson education. MacKay, D. J. (1998). Introduction to gaussian processes. NATO ASI Series F Computer and Systems Sciences, 168:133–166. Mayne, D. Q. and Michalska, H. (1990). Receding horizon control of nonlinear systems. Automatic Control, IEEE Transactions on, 35(7):814–824. McFarlane, D. C. and Glover, K. (1990). Robust controller design using normalized coprime factor plant descriptions. Neal, R. M. (1995). Bayesian learning for neural networks. PhD thesis, University of Toronto. 84 Ng, A. Y., Harada, D., and Russell, S. (1999). Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287. Powell, W. B. (2012). Ai, or and control theory: a rosetta stone for stochastic optimization. Princeton University. Puterman, M. L. (2009). Markov decision processes: discrete stochastic dynamic programming, volume 414. John Wiley & Sons. Rasmussen, C. E. (2006). Gaussian processes for machine learning. Rasmussen, C. E., Kuss, M., et al. (2003). Gaussian processes in reinforcement learning. In NIPS, volume 4, page 1. Riedmiller, M. (2005). Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In Machine Learning: ECML 2005, pages 317–328. Springer. Schaal, S. et al. (1997). Learning from demonstration. Advances in neural information processing systems, pages 1040–1046. Schneider, J. G. (1997). Exploiting model uncertainty estimates for safe dynamic control learning. Advances in Neural Information Processing Systems, pages 1047–1053. Simao, H. P., Day, J., George, A. P., Gifford, T., Nienow, J., and Powell, W. B. (2009). An approximate dynamic programming algorithm for large-scale fleet management: A case application. Transportation Science, 43(2):178–197. Smallwood, R. D. and Sondik, E. J. (1973). The optimal control of partially observable markov processes over a finite horizon. Operations Research, 21(5):1071– 1088. Snelson, E. and Ghahramani, Z. (2006). Sparse gaussian processes using pseudoinputs. 85 Snelson, E. L. (2007). Flexible and efficient Gaussian process models for machine learning. PhD thesis, University of London. Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the seventh international conference on machine learning, pages 216–224. Sutton, R. S. and Barto, A. G. (1998). Introduction to reinforcement learning. MIT Press. Sutton, R. S., Barto, A. G., and Williams, R. J. (1992). Reinforcement learning is direct adaptive optimal control. Control Systems, IEEE, 12(2):19–22. Tesauro, G. (1994). Td-gammon, a self-teaching backgammon program, achieves master-level play. Neural computation, 6(2):215–219. Wang, J. M., Fleet, D. J., and Hertzmann, A. (2008). Gaussian process dynamical models for human motion. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(2):283–298. Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine learning, 8(3-4):279– 292. Wilson, A., Fern, A., and Tadepalli, P. (2010). Incorporating domain models into bayesian optimization for rl. In Machine Learning and Knowledge Discovery in Databases, pages 467–482. Springer. Wittenmark, B. (1995). Adaptive dual control methods: An overview. 86 [...]... Motivation As a joint field of computational statistics and artificial intelligence, machine learning is concerned with the design and development of methods, algorithms and techniques that allow computers to learn structure from data and extract relevant information in automated fashion As a branch of machine learning, reinforcement learning (RL) is a computational approach to learning from interactions... term planning and decision making and hence reduce the model bias in a principled manner Our framework assumes a fully observable world and is applicable to sequential tasks with dynamic (non-stationary) environments Hence, our approach combines ideas from optimal control with the generality of reinforcement learning and narrows the gap between planning, control and learning A logical extension of the... research gaps and relevant directions for further work and extensions 7 8 Chapter 2 Background and related work 2.1 Background We provide a brief overview and background on Gaussian processes and sequential decision making under uncertainty, the two central elements of this thesis For more details on Gaussian Processes in the context of Machine Learning, we refer the reader to [Rasmussen, 2006; Bishop...List of Tables 4.1 Deterministic pendulum: Average time steps ± 1.96×standard error for different planning horizon and nearest neighbors 52 4.2 Stochastic pendulum: Average time steps ± 1.96×standard error for different planning horizon and nearest neighbors 55 4.3 Partially-observable pendulum: Average time steps ± 1.96×standard error for different planning horizon and nearest... also be thought of, a collection of random variables, any finite number of which have (consistent) Gaussian distributions Suppose, we choose a particular finite subset of these random function variables f = f1 , f2 , , fN , with corresponding inputs X = x1 , x2 , , xN , where f1 = f (x1 ), f2 = f (x2 ), , fN = f (xN ) In a GP, any such set of random function variables are multivariate Gaussian distributed,... denotes a multivariate Gaussian distribution with mean vector, µ and covariance matrix, K These Gaussian distributions are consistent and follows the usual rules of probability apply to the collection of random variables, e.g marginalization, p(f1 ) = p(f1 , f2 )df2 (2.5) p(f1 , f2 ) p(f2 ) (2.6) and conditioning, p(f1 |f2 ) = A Gaussian process is fully specified by a mean function m(x) and covariance function... decision making, we refer to [Sutton and Barto, 1998; Puterman, 2009; Bertsekas et al., 1995] 2.1.1 Gaussian Process 1 The Gaussian process (GP) is a simple, tractable and general class of probability distributions on functions The concept of GP is quite old and has been studied over centuries under different names, for instance, the famous Wiener process, a particular type of Gaussian process [Hitsuda et al.,... environment using Gaussian Processes, and for each time step, the best action is computed by tree search in receding horizon manner Currently, our proposed algorithm can handle learning problem 5 with continuous state space and discretized action space • Showed the success of the algorithm, to learn the swing-up control for simple pendulum Swing-up task is generally consider hard in the control literature and. .. 4.9 Empirical evaluation of Q -learning method for learning swing-up control of simple pendulum 58 4.10 Policy comparison of our method and Q -learning agent 59 4.11 E1 (q): unhealthy locally asymptotically stable equilibrium point with its domain of attraction N1 (q); E2 (q): healthy locally asymptotically stable equilibrium point with its domain of attraction N2 (q);... black, along with two standard deviations in gray Figure 2.2: Gaussian Process Posterior and uncertain test input Generally, if the Gaussian input x∗ ∼ N (µ, Σ) is mapped through a nonlinear function, the exact predictive distribution is non Gaussian and nonunimodal, p(f∗ |µ, Σ) = p(f∗ |x∗ )p(x∗ |µ, Σ)dx∗ , (2.22) as shown in [Figure 2.3a] and cannot be computed analytically And, one may have to resolve . SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE 2015 Online learning and planning of dynamical systems using Gaussian processes Ankit Goyal April. ONLINE LEARNING AND PLANNING OF DYNAMICAL SYSTEMS USING GAUSSIAN PROCESSES MODEL BASED BAYESIAN REINFORCEMENT LEARNING ANKIT GOYAL B.Tech., Indian Institute of Technology, Roorkee,. and algorithms. • Systematic treatment of model bias for sample efficiency by using Gaussian processes for model learning and using the uncertainty information for long term prediction in the planning

Ngày đăng: 22/09/2015, 15:18

TỪ KHÓA LIÊN QUAN