Advanced machine perception model for activity aware application

National Chung Cheng University Electrical Engineering of Department ADVANCED MACHINE PERCEPTION MODEL FOR ACTIVITY-AWARE APPLICATION Student: Manh-Hung Ha Advisor: Professor Oscal Tzu Chiang Chen June 2021 i Acknowledgments It would have been impossible to work on a thesis without the support of many people who all have my deep and sincere gratitude First of all, I would like to thank Professor Oscal T.-C Chen, my academic supervisor, for having taken me on as a PhD student, and having trusted me for years I could not find any better adviser than Professor Chen, who gave me ideas, advice, motivation, and, above all, letting me pursue my thoughts freely He was an excellent theoretical point of reference and critical to testing my theories He continues to impress me with his systematic style, with compassion and humility, above all, to become a better researcher as well as a better person Thank you for introducing me to the world of computer vision and research and for taking me as a young researcher in my formative years Thank you to my thesis committee - Prof Wei-Yang Lin, Prof Oscal T.-C Chen, Prof Wen-Nung Lie, Prof Sung-Nien Yu and Prof Rachel Chiang for their time, kind advice and feedback for my thesis You have taught me a lot and guided me on improving the quality of my research I was incredibly happy to have amazing mentors at CCU over the years I am grateful to Professors Wen Nung Lie and Professor Gerald Rau who took me to learn the ropes in the first few years I am thankful to Professor Alan Liu, Professor Sung Nien Yu, Professor Norbert Michael Mayer and Professor Rachel Chiang for taking me on my academic courses, helping me to expand my field of study and teaching me to be a systematic experimental experimentalist I'd like to express my gratitude to all of my friends and colleagues in the Department of Electrical Engineering There were great staffs who taught me everything during my Ph.D I didn't have as many chances as I wanted to engage with them, but I was always motivated by their ideas and study Also, thanks to other great ii researchers in my field: Hoang Tran, Hung Nguyen, Luan Tran, and many others; I thank them for working with them, and hope that there are even more opportunities to learn from them in the future I would like to thank my many laboratory partners in VLSI DSP group, including Wei-Chih Lai, Ching-Han Tsai, Yi Lun Lee, Yen-Cheng Lu, Han-Wen Liu, Yu-Ting Liu, etc I learned many things from each of them and their support and guidance helped me overcome some stressful times Finally, thank you for sharing your thoughts, documentations, data sets and coding programs All of this work wasn't even considered and thanks to many of my colleagues working with computer vision and machine learning My many amazing co-authors, I want to thank as well I was also incredibly lucky to have made many friends in my time in CCU Also, thanks to the badminton team in EE department, VSACCU badminton team showed me all the awesome activities around the PhD journey! Finally, without the valuable support, encouragement, love and patience of my family, especially my parents Ha Sinh Hai and Tran Thi Theu, from my first years as a student, they trained me for this by showing me the importance of hard work, critical thinking and patience I thank them for their support, trust and innumerable sacrifices; I have worked on this study thousands of kilometers from home and missed many significant life events And speaking of love and encouragement, my wife Nguyen Hong Quyen and my daughter Ha Hai Ngan are grateful for all our many wonderful years and many well-wrought weekends and vacations, and above all for staying by my side, even when a long distance was between us Thank you for raising me today and for having always been there and trusting me iii Abstract People are one of the most important entities that computer vision systems would need to understand to be useful and omnipresent in various applications Most of this awareness is based on the recognition of human activities for the homecare systems which observe people and support elderly people Human beings this well on their own: we look at others and describe each action in detail Moreover, we can reason about those actions over time, even predict the possible actions in the future On the other hand, computer vision algorithms were well behind the challenge In this study, my research aim is to create learning models, which can automatically induce representations of human activities, especially their structure and feature meaning, in order to solve several higher-level action tasks and approach to context-aware engine for various action recognition In this dissertation, we explore techniques to improve human action understanding from video inputs which are common and may be found in daily activities such as surveillance, traffic, education, movies, sports, etc on challenging large-scale benchmark datasets and our own panoramic video dataset This dissertation targets the action recognition and action detection of humans in videos The most important insight is that actions depend on global features parameterized by a scene, objects, and others, apart from their own local features parameterized by body pose characteristics Additionally, modeling the temporal features by optical flow from motions of people and objects in the scene can further help in recognizing human actions These dependencies are exploited in five key fords: (1) Detecting moving subjects using the background subtraction scheme, tracking extrcated subjects using the Kalman filter, and using the handcraft features to perform classification via traditional machine learning (GMM, SVM) (2) Developing a computation-affordable recognition iv system with a lightweight model capable of learning from a portable device; (3) Using capsule networks and skeleton-based map generation to attend to the subjects, and building their correlation and attention context; (4) Exploring the integrated action recognition model based on correlations and attention of subjects and scene; (5) Developing systems based on the refined highway aggregating model In summary, this dissertation presents several novel and significant solutions for efficient DNN architecture analysis, acquisition, and distribution on large-scale video data We show that the DNNs using multiple streams, combined model, hybrid structure on conditional context, feature input representation, global features, local features, spatiotemporal attention and the modified belief Capsnet have efficiently achieved high quality results The consistent improvements from using these components of our DNNs are addressed to achieve state-of-the-art results on popularly-used datasets Furthermore, we also observe that the largest improvements are indeed achieved in action classes involving human-to-human and human-to-object interactions, and visualizations of our network show that it is focusing on scene context that is intuitively relevant to action recognition Keywords: attention mechanism, activity recognition, action detection, deep neural network, convolutional neural network, recurrent neural network, capsule network, spatiotemporal attention, skeleton v TABLE OF CONTENTS PAGES ACKNOWLEDGMENTS i ABSTRACT iii TABLE OF CONTENTS v LIST OF FIGURES viii LIST OF TABLES xi I INTRODUCTION II MULTI-MODAL MACHINE LEARNING APPROACHES LOCALLY ON SINGLE OR MULTIPLE SUBJECTS FOR HOMECARE 2.1 Introduction………………………………………………………………… 2.2 Technical Approach 11 2.2.1 Handcraft feature extraction by locally body subject estimation 12 2.2.2 Proposed action recognition on single subject 17 2.2.3 Proposed action recognition on multiple subjects 20 2.3 Experiment Results and Discussion 27 2.3.1 Effectiveness of our proposal to single subject on action recognition28 2.3.2 Effectiveness of our proposal to multiple subjects on action recognition 31 2.4 Summary and Discussion 35 III ACTION RECOGNITION USING A LIGHTWEIGH MODEL 38 3.1 Introduction 37 3.2 Related Work 40 3.3 Action recognition by a lightweight model 41 3.4 Experiments and Results 45 vi 3.5 Summary and Discussion 49 IV ATTENTIVE RECOGNITION LOCALLY, GLOBALLY, TEMPORALLY, USING DNN AND CAPSNET 50 4.1 Introduction 51 4.2 Related Previous Work 55 4.2.1 Diverse Spatio-Temporal Feature Generation 55 4.2.2 Capsule Neural Network 59 4.3 Proposed DNNs for Action Recognition 59 4.3.1 Proposed Generic DNN with Spatiotemporal Attentions 59 4.3.2 Proposed CapsNet-Based DNNs 68 4.4 Experiments, Comparisons of Proposed DNN 72 4.4.1 Datasets and Parameter Setup for Simulations 72 4.4.2 Analyses and Comparisons of Experimental Results 74 4.4.3 Analyses of Computational Time, and Cost 84 4.4.4 Visualization 85 4.5 Summary and Discussion 88 V ACTION RECOGNITION ENHANCED BY CORRELATIONS AND ATTENTION OF SUBJECTS AND SCENE 89 5.1 Introduction 89 5.2 Related work 91 5.3 Proposed DNN 92 5.3.1 Projection of SBB to ERB in the Feature Domain 92 5.3.2 Map Convolutional Fused-Depth Layer 93 5.3.3 Attention Mechanisms in SA and TE Layers 93 5.3.4 Generation of Subject Feature Maps 95 5.4 Experiments and Discussion 97 vii 5.4.1 Datasets and Parameter Setup for Implements Details 97 5.4.2 Analyses and Comparisons of Experimental Results 98 5.5 Summary and Discussion 100 VI SPATIO-TEMPORALLY WITH AND WITHOUT LOCALIZATION ON MULTIPLE LABELS FOR ACTION PERCEPTION, USING VIDEO CONTEXT 101 6.1 Introduction 102 6.2 Related work 107 6.2.1 Action Recognition with DNNs 107 6.2.2 Attention Mechanisms 108 6.2.3 Bounding Boxes Detector for Action Detection 109 6.3 Proposed Methodology 110 6.3.1 Action Refined-Highway Network 112 6.3.2 Action Detection .118 6.3.3 End-to-End Network Architecture on Action Detection 123 6.4 Experimental Results and Discussion 123 6.4.1 Datasets 123 6.4.2 Implementation Details 125 6.4.3 Ablation Studies 127 6.5 Summary and Discussion 133 VII CONCLUSION AND FUTURE WORK 135 REFERENCES 139 APPENDIX A 152 viii LIST OF FIGURES Figures Pages 2.1 Schematic diagram of height estimation 13 2.2 Distance estimation at the situation (I) 14 2.3 Estimated Distance at the situation (II) 15 2.4 Distance curve pattern of the measure of the standing subject 16 2.5 Proposal flow chart of our action recognition system 18 2.6 Flowchart detection of TV on/off 19 2.7 Proposed activity recognition system 21 2.8 Example of shape BLOBs generation from forground 22 2.9 Illustration of tracking by the Kalman filter method 23 2.10 Proposed FSM 24 2.11 Estimates of activity states in the overlapping interval 26 2.12 Proposed incremental majority voting 26 2.13 Room layout and experiment scenario 27 2.14 Examples of five activities recorded from the panoramic camera 28 2.15 Total accuracy rate (A) versus p and r 32 2.16 Total accuracy (A) versus p at r=0.001, 0.01, 0.05, 0.1, and 0.2 32 3.1 Proposed recognition system 41 3.2 Functional blocks of the proposed MoBiG 42 3.3 Proposed finite state machine 43 3.4 Proposed incremental majority voting 44 3.5 Confusion matrix of MoBiG identifying four activities 48 4.1 CapsNets integrated in a generic DNN 55 4.2 Block diagram of the proposed generic DNN 58 ix 4.3 Three skeleton channels 61 4.4 One example of the transformed skeleton maps from an input segment 63 4.5 Block diagrams of the proposed AJA and AJM 64 4.6 Block diagram of the proposed A_RNN 67 4.7 Proposed capsule network for TC_DNN and MC_DNN 69 4.8 Block diagrams of the proposed CapsNet-based DNNs 71 4.9 Examples of the panoramic videos about 12 actions where subjects maked by red rectangular dash-line boxes for observing only 74 4.10 Visualization of the outputs from the intermediate layers of the proposed TC_DNN 87 4.11 Visualization of the outputs from the intermediate layers of two A_RNNs 87 5.1 Block diagram of the proposed deep neural network 92 5.2 Block diagram of the SA generation layer 95 5.3 We plot the comparison performance of the AFS and ROS stream for each action classes 5.4 JHMDB21 confusion matrix 99 6.1 Refined highway block for 3D attention 104 6.2 Overview of the proposed architecture for action recognition and detection 110 6.3 Temporal bilinear inception module 113 6.4 RH block in RNN structures, like the standard RNN, LSTM, GRU and variant 98 RNNs 114 6.5 Schematic recurrent 3D Refined-Highway depth by three RH block 116 6.6 3DConvDM layer correlating the feature map X 118 6.7 The details of GAA module 122 6.8 mAP for per-category on AVA 132 6.9 Possible locations associated with proposal regions of a subject 132 6.10 Visualization of R_FSRH module on the UCF101-24 validation set 132 x ... show that the new model using a panoramic camera can enable several novel applications in home-care systems Computation-affordable recognition system with a lightweight model: Activity recognition... This doesn't allow for high computation Previous work has successfully developed various DNN models for action recognition, but these models are computationally intensive and therefore inefficient... handcraft features to perform classification via traditional machine learning (GMM, SVM) (2) Developing a computation-affordable recognition iv system with a lightweight model capable of learning

Định dạng
Số trang	166
Dung lượng	6,24 MB