Continuous POMDPs for robotic tasks

CONTINUOUS POMDPS FOR ROBOTIC TASKS Bai, Haoyu B.Sc., Fudan University, 2009 A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY SCHOOL OF COMPUTING NATIONAL UNIVERSITY OF SINGAPORE 2014 Declaration I hereby declare that this thesis is my original work and it has been written by me in its entirety. I have duly acknowledged all the sources of information which have been used in the thesis. This thesis has also not been submitted for any degree in any university previously. Bai, Haoyu September 23, 2014 i Acknowledgements I would like to thank my advisor, Professor David Hsu, for all of his support and insight. David is a guide, not only for the journey towards this thesis, but also the journey towards a meaningful life. His insightful suggestions and our heated discussions always echo over my ears, and will continue to reform my perspective towards the world. Professor Lee Wee Sun has also closely advised my research. With his deep knowledge and sharp mind, Wee Sun has generated so many sparks in our discussions. I appreciate the suggestions by Professor Leong Tze Yun and Professor Bryan Low, which are very helpful in improving the thesis. A big portion of my time on this tropical island is spent in our small but cozy lab, which is shared with many labmates: Wu Dan, Amit Jain, Koh Li Ling, Won Kok Sung, Lim Zhan Wei, Ye Nan, Wu Kegui, Cao Nannan, Le Trong Dao, etc I have learnt so much from them. I am fortunate to meet many good friends, including my alumni, Luo Hang, and my roommates, Cheng Yuan, Lu Xuesong, Wei Xueliang, and Gao Junhong. Because of them, the life in Singapore is more colorful. It is wonderful that wherever I go there are always old friends awaiting me. Qianqian in Munich, Huichao in Pittsburgh, Jianwen and Jing in Washington D.C., Siyu, Wenshi and Shi in Beijing, Zhiqing in Tianjin, Ning, Xueliang and many other friends in the Bay Area. Faraway from home, we are supporting each other. I will always remember the heartwarming scene of my father, Bai Jiani hua, using a small knife, sharpening pencils for me in the early school years. My mother, Wang Guoli, introduced me to computer science and always encourage me to pursue my dream. I am so grateful that they gave me my first PC, a 80286, which accompanied me for many joyful days and nights. The most important wisdom I have ever heard is from my grandma: “Don’t evil with your technology.” Finally, to my wife Yumei, who keeps feeding me the energy so that I can complete this thesis. ii Contents List of Tables vii List of Figures ix Introduction 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Background 2.1 POMDP Preliminary . . . . . . . . . . . . . . . . . . . . . . 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Continuous-state POMDPs 25 3.1 Modelling in Continuous Space . . . . . . . . . . . . . . . . 25 3.2 Value Iteration and Policy Graph . . . . . . . . . . . . . . . 28 3.3 Monte Carlo Value Iteration . . . . . . . . . . . . . . . . . . 31 3.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.6 Application to Unmanned Aircraft Collision Avoidance . . . 48 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Planning How to Learn 61 i 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . 64 4.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Continuous-observation POMDPs 79 5.1 Generalized Policy Graph . . . . . . . . . . . . . . . . . . . 79 5.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Conclusion 113 Bibliography 117 ii Abstract Planning under uncertain and dynamic environments is an essential capability for autonomous robots. Partially observable Markov decision processes (POMDPs) provide a general framework for solving such problems and have been applied to di↵erent robotic tasks such as manipulation with robot hands, self-driving car navigation, and unmanned aircraft collision avoidance. While there is dramatic progress in solving discrete POMDPs, progress on continuous POMDPs has been limited. However, it is often much more natural to model robotic tasks in a continuous space. We developed several algorithms that enable POMDP planning with continuous states, continuous observations as well as continuous unknown model parameters. These algorithms have been applied to di↵erent robotic tasks such as unmanned aircraft collision avoidance and autonomous vehicle navigation. Experimental results for these robotic tasks demonstrated the benefits of probabilistic planning with continuous models: continuous models are simpler to construct and provide more accurate description of the robot system; our continuous planning algorithms are general for a broad class of tasks, scale to more difficult problems and often results in improved performance comparing with discrete planning. Therefore, these algorithmic and modeling techniques are powerful tools for robotic planning under uncertainty. These tools are necessary for building more intelligent and reliable robots and would eventually lead to wider application of robotic technology. iii iv Chapter 5. Continuous-observation POMDPs Let = maxb2B |V ⇤ (b) t Vt (b)| be the maximum error of Vt (b) over the sampled beliefs in B. We first bound the maximum error of Vt (b) at any arbitrary belief b B in terms of t. For any point b B, let b0 be the closest point in B to b. Then |V ⇤ (b) Vt (b)|  |V ⇤ (b) V ⇤ (b0 )| + |V ⇤ (b0 ) Vt (b0 )| + |Vt (b0 ) Vt (b)| (5.15) Applying Lemma 5.3 twice to V ⇤ and Vt , respectively, and observing that |V ⇤ (b0 ) Vt (b0 )|  t, we get |V ⇤ (b) Next, we bound the error |V ⇤ (b0 ) Vt (b0 )| |HV ⇤ (b0 ) |HV ⇤ (b0 ) Vt (b)|  t. 2Rmax B + t. (5.16) For any b0 B, ˆ b0 Vt (b0 )| H HVt (b0 )| + |HVt (b0 ) ˆ b0 Vt (b0 )|, H (5.17) ˆ b0 denotes invoking MC-Backup at b0 . The inequality in the first where H line in (5.17) holds, because by definition, V ⇤ (b0 ) = HV ⇤ (b0 ), V ⇤ (b0 ) Vt (b0 ), and Vt (b0 ) ˆ b0 Vt (b0 ). It is well known that the backup operator H H is a contraction. The contraction property and (5.16) together imply |HV ⇤ (b0 ) HVt (b0 )|  |V ⇤ (b) Vt (b)|  ⇣ 2R max B + t ⌘ . (5.18) Theorem 5.2 guarantees that a single invocation of MC-Backup at a belief b0 has small approximation error with high probability, if N and M are sufficiently large. To obtain Vt , we perform t iterations of backup over the set B synchronously. Thus there are |B|t invocations of MC-Backup in total. Applying the union bound together with Theorem 5.2, every 109 Chapter 5. Continuous-observation POMDPs MC-Backup invocation achieves ˆ b0 Vt (b0 )| < ✏ H |HVt (b0 ) with probability M ⌧ , if we choose N (5.19) ⇢N (✏/5, ⌧ /2|A||B|t) and choose ⇢N (✏/5, ⌧ /2|A||B|t). We then combine (5.17–5.19) together with the definition of t and get t  ⇣ 2R max B + t ⌘ + ✏. For any initial policy graph, the error is bounded by  2Rmax /(1 ). After solving the recurrence relation, we have t (1 = ✏  t )Rmax B t Rmax + + (1 )2 t Rmax B Rmax + + (1 )2 )✏ (1 t Substituting it into (5.16) gives us the final result. 5.6 Summary This chapter presents a new algorithm for solving POMDPs with continuous states and observations. These continuous models are natural for robotic tasks that require integrated perception and planning. We provide experimental results demonstrating the potential of this new algorithm for robot planning and learning under uncertainty. We also provide a theoretical analysis on the convergence of the algorithm. The algorithm can be applied to robotic tasks with very large or continuous state spaces and observation spaces. It can handle problems with 110 Chapter 5. Continuous-observation POMDPs complex dynamic and sensor models. The approach is especially suitable for the problems that requires real-time control, since it computes a policy o✏ine for fast online execution. However, the algorithm has a high computational cost due to the large number of Monte Carlo simulations. This limited the scope of the algorithm to the problems with short planning horizon. The limitation could be addressed with parallel computing [Lee and Kim, 2013] or macro-actions [Lim et al., 2011]. Our algorithm uses sampling instead of fixed discretization to handle continuous state and observation spaces. The foundational ideas of probabilistic sampling and Monte Carlo simulation open up a range of new opportunities to design algorithms for complex robotic planning and learning tasks. The improved capabilities of planning under uncertainty algorithms could also potentially enable new applications of robotic technologies. 111 Chapter 5. Continuous-observation POMDPs 112 Chapter Conclusion POMDP is a very general framework for planning under uncertainty. However, modeling robotic tasks with POMDPs is often difficult. Existing POMDP algorithms mostly handle discrete POMDPs, while the natural model for robotic tasks usually involves continuous states, continuous observations, and also continuous model parameters. This thesis developed three algorithms. Although each algorithm tackles the continuity in a di↵erent component of the POMDP model, all of them are built up on the foundational ideas of probabilistic sampling and Monte Carlo simulations, which are proven techniques for e↵ectively attacking high-dimensional and continuous spaces. Our first algorithm, MCVI, solves POMDPs with continuous state spaces. It samples the state space, and applies Monte Carlo simulations over value iterations to construct policy graph. The second algorithm is specialized at handling POMDPs modeling robotic tasks with uncertain model parameters, where the possible parameter values are continuous and the dynamic of the robotic system forms a continuous state space. The algorithm samples the model parameter space and uses motion planning algorithms as a heuristic to guide the policy tree construction. Finally, the third algorithm extends MCVI to handle continuous state spaces and continuous observation spaces simul113 Chapter 6. Conclusion taneously. The algorithm generalizes the policy graph, where each graph node becomes a classifier. The classifier exploits the structure of the value iteration equation, and is constructed by sampling the continuous spaces and performing Monte Carlo simulations. Although probabilistic sampling and Monte Carlo simulations introduce approximation errors, the theoretical analysis for our algorithms guarantees that the approximation errors can be reduced at a rate that is polynomial to the number of samples. This enables an adjustable trade-o↵ between computation time and policy optimality. When a good policy is more desirable, we could simply increase the number of samples and thus wait for a longer computation time. Our algorithms handle continuous spaces. They compute o✏ine a policy for fast online execution. These features widen the scope of POMDP planning from traditional task-level planning to real-time control of robotic systems. Since the computed policy is adaptive to the uncertainty in control and sensing, once computed, the policy can be executed many times for the robotic task without additional computation. All of our algorithms assume discrete action space. However, it is often relatively easier to manually specify a set of discrete actions for robotic tasks. Monte Carlo simulations in these algorithms incur a high computational cost. Parallelization of the Monte Carlo simulations can improve the practical performance of these algorithms [Lee and Kim, 2013]. We could also consider reusing or caching the simulation results. Pruning the policy could also potentially improve the performance of our algorithms [Kurniawati et al., 2008; Pineau et al., 2003; Smith and Simmons, 2005]. Our algorithms have been applied to several di↵erent robotic tasks, including robotic exploration, aircraft collision avoidance, acrobot with parameter learning, and autonomous vehicle navigation. Experimental results indicated several benefits of our algorithms over discrete POMDPs 114 Chapter 6. Conclusion and previous works on continuous POMDPs. Comparing with discrete POMDP algorithms, our algorithms simplify model construction and achieve a better performance. For example, in the aircraft collision avoidance task, we directly construct continuous models and solve them using MCVI, while in previous works, it requires thorough analysis of the problem domain to build a discrete model. Since the continuous models are more accurate, our 2D continuous-state model can achieve 2–3 times improvement on performance over the 2D discrete models. Our algorithms can easily scale to high-dimensional space, which enable us to build models that capture the full flight dynamics in 3D space. This further improves the performance to more than 70 times over the 2D discrete models. Previous works on continuous POMDPs often seek compromise between model expressiveness and solution optimality. Our algorithms retain the full expressiveness of continuous POMDPs while still achieving the solution optimality comparable with those algorithms for discrete or parametric models. The experimental results indicate that, on parametric models, our algorithms can achieve the same level of performance as the algorithms specialized on these models, such as continuous Perseus [Spaan and Vlassis, 2005]. On the other side, previous algorithms such as MC-POMDP [Thrun, 2000a] and GENC-POMDP [Brechtel et al., 2013] allow the same expressiveness as our algorithms, but they produce policies with worse quality. Through theoretical analysis and empirical experiments, we have demonstrated that continuous POMDPs are powerful tools for handling uncertainty in robotic tasks. The algorithms developed in this thesis fully unleash the power of continuous POMDPs as a natural language for modeling robotic tasks under uncertainty. In the past two decade, robotics has gradually became a useful technology in certain fields. With the advancing of mechanical, electronics and 115 Chapter 6. Conclusion computing technologies, it is inevitable for robots to be ambient in our society. However, before we can see cars safely driving itself on the road, robot arms cooperating with human on assembly line, and smart personal robot accompanying the elderly, technology has to be developed for the robots to sense and adapt to the ever-changing world. As a promising step towards the realization of this technology, this thesis provides a systematic approach of using continuous POMDPs to handle uncertainty in robotic tasks. 116 Bibliography [Asmuth et al., 2009] J. Asmuth, L. Li, M. L. Littman, A. Nouri, and D. Wingate. A Bayesian sampling approach to exploration in reinforcement learning. In Uncertainty in Artificial Intelligence, pages 19–26, 2009. [Bagnell et al., 2003] J. Bagnell, S. Kakade, A. Ng, and J. Schneider. Policy search by dynamic programming. In Advances in Neural Information Processing Systems (NIPS), volume 16, Vancouver, B.C., 2003. [Bai et al., 2011] H. Bai, D. Hsu, W. Lee, and V. Ngo. Monte carlo value iteration for Continuous-State POMDPs. In Algorithmic Foundations of Robotics IX — Proc. Int. Workshop on the Algorithmic Foundations of Robotics (WAFR), pages 175–191. Springer, 2011. [Bandyopadhyay et al., 2012] T. Bandyopadhyay, K. Won, and E. Frazzoli. Intention-aware motion planning. In Workshop on Algorithmic Foundations of Robotics, 2012. [Bandyopadhyay et al., 2013] T. Bandyopadhyay, C. Jie, and D. Hsu. Intention-Aware Pedestrian Avoidance. Experimental Robotics, 88:963– 977, 2013. [Bellman and Kalaba, 1959] R. Bellman and R. Kalaba. On adaptive control processes. IRE Transactions on Automatic Control, 4(2):1–9, November 1959. [Bellman, 1957] R. Bellman. Dynamic Programming. Princeton University Press, June 1957. [Bertsekas, 2000] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 2nd edition, November 2000. [Billingsley, 2006] T. B. Billingsley. Safety analysis of TCAS on Global Hawk using airspace encounter models. Master thesis, Massachusetts Institute of Technology, 2006. 117 BIBLIOGRAPHY [Brechtel et al., 2013] S. Brechtel, T. Gindele, and R. Dillmann. Solving continuous POMDPs: Value iteration with incremental learning of an efficient space representation. In Proceedings of the 30th International Conference on Machine Learning, volume 28, 2013. [Brooks et al., 2006] A. Brooks, A. Makarenko, S. Williams, and H. Durrant-Whyte. Parametric POMDPs for planning in continuous state spaces. Robotics and Autonomous Systems, 54(11):887–897, November 2006. [Brunskill et al., 2008] E. Brunskill, L. Kaelbling, T. Lozano-Perez, and N. Roy. Continuous-State POMDPs with hybrid dynamics. In Proceedings of the Tenth International Symposium on Artificial Intelligence and Mathematics (ISAIM), Fort Lauderdale, FL, January 2008. [Chaudhari et al., 2013] P. Chaudhari, S. Karaman, D. Hsu, and E. Frazzoli. Sampling-based algorithms for continuous-time POMDPs. In American Control Conference, pages 4604–4610. IEEE, 2013. [Cheng, 1988] H.-T. Cheng. Algorithms for partially observable Markov decision processes. PhD thesis, University of British Columbia, 1988. [Choset et al., 2005] H. Choset, K. M. Lynch, S. Hutchinson, G. Kantor, W. Burgard, L. E. Kavraki, and S. Thrun. Principles of Robot Motion : Theory, Algorithms, and Implementations, chapter 7. The MIT Press, 2005. [Crisan and Doucet, 2002] D. Crisan and A. Doucet. A survey of convergence results on particle filtering methods for practitioners. IEEE Transactions on Signal Processing, 50, 2002. [Du↵, 2002] M. Du↵. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, University of Massachusetts Amherst, 2002. [Dutta Roy et al., 2004] S. Dutta Roy, S. Chaudhury, and S. Banerjee. Active recognition through next view planning: a survey. Pattern Recognition, 37:429–446, 2004. [Erdmann and Mason, 1986] M. A. Erdmann and M. T. Mason. An exploration of sensorless manipulation. IEEE Journal on Robotics and Automation, 3:369–379, 1986. [Feldbaum, 1965] A. A. Feldbaum. Optimal Control Systems. Academic Press, 1965. 118 BIBLIOGRAPHY [Filatov and Unbehauen, 2008] N. M. Filatov and H. Unbehauen. Adaptive Dual Control: Theory and Applications. Springer, 2008. [Hansen, 1998] E. A. Hansen. Solving POMDPs by searching in policy space. In Proceedings of the Fourteenth International Conference on Uncertainty In Artificial Intelligence, pages 211–219, 1998. [Hauskrecht, 2000] M. Hauskrecht. Value-function approximations for partially observable Markov decision processes. Journal of Artificial Intelligence Research, 13(1):33 – 94, 2000. [Haussler, 1992] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1):78–150, September 1992. [Hoe↵ding, 1963] W. Hoe↵ding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13–30, 1963. [Hoey and Poupart, 2005a] J. Hoey and P. Poupart. Solving POMDPs with Continuous or Large Discrete Observation Spaces. Proc. Int. Joint Conference on Artifical Intelligence, pages 1332—-1338, 2005. [Hoey and Poupart, 2005b] J. Hoey and P. Poupart. Solving POMDPs with continuous or large discrete observation spaces. Proc. Int. Joint Conference on Artifical Intelligence, pages 1332—1338, 2005. [Hsiao et al., 2007] K. Hsiao, L. P. Kaelbling, and T. Lozano-Perez. Grasping POMDPs. In Proceedings of the IEEE International Conference on Robotics & Automation, pages 4485—-4692, 2007. [Hsu et al., 2007] D. Hsu, W. S. Lee, and N. Rong. What makes some POMDP problems easy to approximate? In Advances in Neural Information Processing Systems (NIPS), 2007. [Ji et al., 2007] S. Ji, R. Parr, H. Li, X. Liao, and L. Carin. Point-based policy iteration. In Proc. of the 22nd National Conference on Artificial Intelligence, 2007. [Kaelbling and Lozano-Perez, 2011a] L. P. Kaelbling and T. Lozano-Perez. Hierarchical task and motion planning in the now. 2011 IEEE International Conference on Robotics and Automation, pages 1470–1477, May 2011. [Kaelbling and Lozano-Pérez, 2011b] L. Kaelbling and T. Lozano-Pérez. Pre-image backchaining in belief space for mobile manipulation. International Symposium on . . . , pages 1–16, 2011. 119 BIBLIOGRAPHY [Kaelbling et al., 1996] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: a survey. Journal of Artificial Intelligence Research, 4:237–285, 1996. [Kaelbling et al., 1998] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1-2):99–134, May 1998. [Kaelbling, 1993] L. P. Kaelbling. Learning in Embedded Systems. The MIT Press, May 1993. [Kaplow, 2010] R. Kaplow. Point-based POMDP solvers: Survey and comparative analysis. PhD thesis, McGill University, 2010. [Karaman and Frazzoli, 2010] S. Karaman and E. Frazzoli. Samplingbased algorithms for optimal motion planning. The International Journal of Robotics Research, 30:20, 2010. [Kavraki et al., 1996] L. Kavraki, P. Svestka, J. Latombe, and M. Overmars. Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Transactions on Robotics and Automation, 12(4):566–580, 1996. [Kearns et al., 1999] M. Kearns, Y. Mansour, and A. Y. Ng. Approximate planning in large POMDPs via reusable trajectories. 1999. [Kochenderfer et al., 2008] M. J. Kochenderfer, L. P. Espindle, J. K. Kuchar, and J. D. Griffith. Correlated encounter model for cooperative aircraft in the national airspace system. Project Report ATC-344, MIT Lincoln Laboratory, Lexington, MA, 2008. [Kolter and Ng, 2009] J. Z. Kolter and A. Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1–8, 2009. [Kuchar and Drumm, 2007] J. K. Kuchar and A. C. Drumm. The traffic alert and collision avoidance system. Lincoln Laboratory Journal, 16(2):277, 2007. [Kuchar, 2005] J. Kuchar. Safety analysis methodology for unmanned aerial vehicle (UAV) collision avoidance systems. In USA/Europe Air Traffic Management R&D Seminars, Baltimore, MD, 2005. [Kurniawati et al., 2008] H. Kurniawati, D. Hsu, and W. S. Lee. SARSOP: efficient point-based POMDP planning by approximating optimally reachable belief spaces. In Proc. Robotics: Science and Systems, 2008. 120 BIBLIOGRAPHY [Latombe, 1991] J.-C. Latombe. Robot Motion Planning. Springer, 1991. [Le Menec and Pop-Stefanov, 2013] S. Le Menec and B. Pop-Stefanov. Particle guidance: applying POMDPs to the optimization of mid-course guidance laws for long-range missiles. In Automatic Control in Aerospace, volume 19, pages 295–300, 2013. [Lee and Kim, 2013] T. Lee and Y. J. Kim. GPU-based motion planning under uncertainties using POMDP. In Proc. IEEE Int. Conf. on Robotics & Automation, pages 4576–4581. IEEE, May 2013. [Lim et al., 2011] Z. Lim, D. Hsu, and W. Lee. Monte Carlo value iteration with macro-actions. In Advances in Neural Information Processing Systems (NIPS), 2011. [Littman et al., 1995] M. Littman, A. Cassandra, and L. P. Kaelbling. Learning policies for partially observable environments: Scaling up. Proceedings of the 12th International Conference on Machine Learning, 1995. [Littman, 1996] M. L. Littman. Algorithms for sequential decision making. PhD thesis, Brown University, 1996. [Lozano-Perez et al., 1984] T. Lozano-Perez, M. T. Mason, and R. H. Taylor. Automatic synthesis of fine-motion strategies for robots. The International Journal of Robotics Research, 3:3–24, 1984. [Makarenko et al., 2002] A. A. Makarenko, S. B. Williams, F. Bourgault, and H. F. Durrant-Whyte. An experiment in integrated exploration. IEEERSJ International Conference on Intelligent Robots and Systems, 1:534–539, 2002. [Ng and Jordan, 2000] A. Y. Ng and M. Jordan. PEGASUS: a policy search method for large MDPs and POMDPs. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pages 406 – 415, 2000. [Papadimitriou and Tsitsiklis, 1987a] C. Papadimitriou and J. N. Tsitsiklis. The complexity of markov decision processes. Mathematics of Operations Research, 12:441450, August 1987. ACM ID: 35581. [Papadimitriou and Tsitsiklis, 1987b] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of markov decision processes. Mathematics of Operations Research, 12(3):441–450, August 1987. [Pineau et al., 2003] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. In Proceedings 121 BIBLIOGRAPHY of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI), Acapulco, Mexico, 2003. [Platt Jr et al., 2010a] R. Platt Jr, R. Tedrake, L. Kaelbling, and T. Lozano-Perez. Belief space planning assuming maximum likelihood observations. 2010. [Platt Jr et al., 2010b] R. Platt Jr, R. Tedrake, L. Kaelbling, and T. Lozano-Perez. Belief space planning assuming maximum likelihood observations. Robotics: Science and Systems (R: SS), 2010. [Porta et al., 2006] J. M. Porta, N. Vlassis, M. T. Spaan, and P. Poupart. Point-based value iteration for continuous POMDPs. Journal of Machine Learning Research, 7:2329–2367, 2006. [Poupart et al., 2006] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 697–704, 2006. [Ross and Chaib-Draa, 2007] S. Ross and B. Chaib-Draa. AEMS: an anytime online search algorithm for approximate policy refinement in large POMDPs. In Proceedings of the 20th international joint conference on Artifical intelligence, page 25922598, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc. ACM ID: 1625693. [Ross et al., 2008] S. Ross, J. Pineau, S. Paquet, and B. Chaib-draa. Online planning algorithms for POMDPs. Journal of Artificial Intelligence Research, 32(1):663704, 2008. [Roy and Thrun, 1999] N. Roy and S. Thrun. Coastal Navigation with Mobile Robots. Advances in Neural Information Processing Systems, 12:1043—-1049, 1999. [Shani et al., 2007] G. Shani, R. I. Brafman, and S. E. Shimony. Forward search value iteration for POMDPs. In Proceedings of the 20th international joint conference on Artifical intelligence, page 26192624, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc. ACM ID: 1625697. [Silver and Veness, 2010] D. Silver and J. Veness. Monte-Carlo planning in large POMDPs. In Advances in Neural Information Processing Systems (NIPS), 2010. [Smart and Kaelbling, 2000] W. D. Smart and L. P. Kaelbling. Practical reinforcement learning in continuous spaces. In International Conference on Machine Learning, pages 903–910, 2000. 122 BIBLIOGRAPHY [Smith and Simmons, 2004] T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In Proc. Int. Conf. on Uncertainty in Artificial Intelligence (UAI), pages 520–527, Ban↵, Canada, 2004. AUAI Press. [Smith and Simmons, 2005] T. Smith and R. Simmons. Point-based POMDP algorithms: Improved analysis and implementation. In Proc. Uncertainty in Artificial Intelligence, 2005. [Somani et al., 2013] A. Somani, N. Ye, D. Hsu, and W. S. Lee. Online POMDP planning with regularization. In Advances in Neural Information Processing Systems, 2013. [Sondik, 1971] E. J. Sondik. The optimal control of partially observable Markov processes. PhD thesis, Stanford University, 1971. [Spaan and Vlassis, 2004] M. T. J. Spaan and N. Vlassis. A point-based POMDP algorithm for robot planning. Proc. IEEE Int. Conf. on Robotics & Automation, 2004. [Spaan and Vlassis, 2005] M. T. Spaan and N. Vlassis. Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research, 24:195220, 2005. [Spong, 1995] M. Spong. The swing up control problem for the Acrobot. IEEE Control Systems Magazine, 15(1):49–55, 1995. [Sutton and Barto, 1998] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction, volume 9. MIT Press, Cambridge, MA, 1998. [Temizer et al., 2009] S. Temizer, M. J. Kochenderfer, L. P. Kaelbling, T. Lozano-Prez, and J. K. Kuchar. Unmanned aircraft collision avoidance using partially observable markov decision processes. Project Report ATC-356, MIT Lincoln Laboratory, Advanced Concepts Program, Lexington, Massachusetts, USA, September 2009. [Temizer et al., 2010] S. Temizer, M. J. Kochenderfer, L. P. Kaelbling, T. Lozano-Prez, and J. K. Kuchar. Collision avoidance for unmanned aircraft using markov decision processes. In Proceedings of the American Institute of Aeronautics and Astronautics (AIAA) Guidance, Navigation, and Control Conference, Sheraton Centre Toronto, Toronto, Ontario, Canada, August 2010. [Thrun et al., 2001] S. Thrun, D. Fox, W. Burgard, and F. Dellaert. Robust Monte Carlo localization for mobile robots. Artificial Intelligence, 128:99–141, 2001. 123 BIBLIOGRAPHY [Thrun et al., 2005] S. Thrun, W. Burgard, and D. Fox. Robotics. The MIT Press, September 2005. Probabilistic [Thrun, 2000a] S. Thrun. Monte carlo POMDPs. In S. Solla, T. Leen, and K. Mller, editors, Advances in Neural Information Processing Systems 12, pages 1064 – 1070. MIT Press, 2000. [Thrun, 2000b] S. Thrun. Monte Carlo POMDPs. Advances in Neural Information Processing Systems 12, 12:1064–1070, 2000. [Thrun, 2002] S. Thrun. Particle filters in robotics. In Proceedings of the 17th Annual Conference on Uncertainty in AI (UAI), volume 1, pages 511–518, 2002. [van den Berg et al., 2011] J. van den Berg, P. Abbeel, and K. Goldberg. LQG-MP: Optimized path planning for robots with motion uncertainty and imperfect state information. The International Journal of Robotics Research, 30(7):895–913, June 2011. [Wang et al., 2012] Y. Wang, K. Won, D. Hsu, and W. Lee. Monte carlo bayesian reinforcement learning. In Int. Conf. on Machine Learning, 2012. [Wang et al., 2013] H. Wang, H. Kurniawati, S. Singh, and M. Srinivasan. Animal locomotion in silico: a POMDP-based tool to study mid-air collision avoidance strategies in flying animals. In Australasian Conference on Robotics and Automation (ACRA), pages 1–8. Australian Robotics and Automation Association, 2013. 124 [...]... approaches for continuous POMDPs, other related uncertainty planning approaches, and robotic tasks modelled as POMDPs Our algorithms are designed upon the foundation of point-based value iteration algorithms, and also share many ideas with other continuous POMDP algorithms 6 Chapter 1 Introduction Chapter 3 presents Monte Carlo value iteration (MCVI), our algorithm for solving continuous- state POMDPs The... avoids the information loss 16 Chapter 2 Background Figure 2.2: Model expressiveness and solution optimality of POMDP planning algorithms 2.2.3 Continuous POMDPs Continuous POMDP algorithms address the problems of handling continuous states, continuous observations and continuous actions Our works focus on continuous states and observations Continuous action space is a less significant issue in robotics,... These sensor readings form a high dimensional continuous observation space, which is often discretized using feature extraction and quantization Discretizing the observation space su↵ers the similar issue of information loss as discretizing states Discrete POMDP algorithms are the foundation for our work on continuous POMDPs Comparing with discrete POMDPs, directly solving continuous POMDPs simplifies model... expressing complex non-linear dynamics, such as the dynamics of car, aircraft, and robot arm Therefore, POMDPs are suitable for a broad range of robotic tasks The challenge of applying POMDPs to robotic tasks has two aspects The first is model design The model must correctly capture the essential behaviors of the robotic system, including dynamics and perception, not only the norm behavior, but also their... the robot’s current state The assumption of perfect observations is often impractical for robotic tasks However, MDPs could be seen as a special case of POMDPs and are much easier to solve Solutions for MDPs are often used as a heuristic for solving POMDPs [Smith and Simmons, 2004; Kurniawati et al., 2008] While POMDPs introduce partial observability into the model, the high-dimensional belief spaces... independently The solution for LQG uses Kalman filter for state estimation and LQR policy for control Therefore, LQG problems are relatively easier to solve and the solution is a closed form linear control policy However, most robotic tasks cannot be easily described as linear systems Linear-quadratic-Gaussian motion planning (LQG-MP) [van den Berg et al., 2011] applies LQG to robotic motion planning under... information and thus would be suboptimal for many uncertainty planning problems In general, the beliefs in robotic tasks are often multi-modal or having sharp edges because of the complex environments For example, when navigating indoor, the beliefs are bounded by the wall of the room and thus may have sharp edges Therefore, the approaches of simplifying the belief are only applicable for certain robotic. .. bounded approximation errors to the optimal policy This leads to a better performance on robotic tasks • Our algorithms are computationally scalable They are fast for simple problems, and can gracefully scale to di cult problems We developed several algorithms to handle POMDPs with continuous states, continuous observations and continuous model parameters Based on the success of point-based value iteration... avoidance using continuous- state POMDPs In Proc Robotics: Science and Systems, 2011 [5] Haoyu Bai, David Hsu, Wee Sun Lee, and Vien Ngo Monte carlo value iteration for continuous- state POMDPs In Proc Workshop on the Algorithmic Foundations of Robotics, 2010 v vi List of Tables 3.1 Comparison of Perseus and MCVI on the navigation task 42 3.2 Sensor parameters 53 3.3 Performance comparison... handles continuous observation space but is limited to discrete state space Solving POMDPs with both continuous states and continuous observations is a more di cult challenge Based on our success of solving continuous- state POMDPs, we extend the policy graph to handle continuous observations 2.2.4 O✏ine and Online Planning Our works focus on o✏ine planning which pre-computes a policy o✏ine for fast . dynamics of car, aircraft, and robot arm. Therefore, POMDPs are suitabl e for a broad ran ge of robotic tasks. The challenge of applying POMDPs to robotic tasks has two aspects. The first is model design t i c progress in solving discrete POMDPs, progress on continuous POMDPs has been limited. However, it is often much more natural to model robotic tasks in a continuous space. We developed several. avoidance using continuous- state POMDPs. In Proc. Robotics: Science and Systems,2011. [ 5 ] Haoyu Bai, David Hsu, Wee Sun Lee, and Vien Ngo. Monte carlo value iteration for continuous- state POMDPs. In

Định dạng
Số trang	140
Dung lượng	6,55 MB