Advanced machine perception model for activity aware application

National Chung Cheng University Electrical Engineering of Department ADVANCED MACHINE PERCEPTION MODEL FOR ACTIVITY-AWARE APPLICATION Student: Manh-Hung Ha Advisor: Professor Oscal Tzu Chiang Chen June 2021 i Tai ngay!!! Ban co the xoa dong chu nay!!! Acknowledgments It would have been impossible to work on a thesis without the support of many people who all have my deep and sincere gratitude First of all, I would like to thank Professor Oscal T.-C Chen, my academic supervisor, for having taken me on as a PhD student, and having trusted me for years I could not find any better adviser than Professor Chen, who gave me ideas, advice, motivation, and, above all, letting me pursue my thoughts freely He was an excellent theoretical point of reference and critical to testing my theories He continues to impress me with his systematic style, with compassion and humility, above all, to become a better researcher as well as a better person Thank you for introducing me to the world of computer vision and research and for taking me as a young researcher in my formative years Thank you to my thesis committee - Prof Wei-Yang Lin, Prof Oscal T.-C Chen, Prof Wen-Nung Lie, Prof Sung-Nien Yu and Prof Rachel Chiang for their time, kind advice and feedback for my thesis You have taught me a lot and guided me on improving the quality of my research I was incredibly happy to have amazing mentors at CCU over the years I am grateful to Professors Wen Nung Lie and Professor Gerald Rau who took me to learn the ropes in the first few years I am thankful to Professor Alan Liu, Professor Sung Nien Yu, Professor Norbert Michael Mayer and Professor Rachel Chiang for taking me on my academic courses, helping me to expand my field of study and teaching me to be a systematic experimental experimentalist I'd like to express my gratitude to all of my friends and colleagues in the Department of Electrical Engineering There were great staffs who taught me everything during my Ph.D I didn't have as many chances as I wanted to engage with them, but I was always motivated by their ideas and study Also, thanks to other great ii researchers in my field: Hoang Tran, Hung Nguyen, Luan Tran, and many others; I thank them for working with them, and hope that there are even more opportunities to learn from them in the future I would like to thank my many laboratory partners in VLSI DSP group, including Wei-Chih Lai, Ching-Han Tsai, Yi Lun Lee, Yen-Cheng Lu, Han-Wen Liu, Yu-Ting Liu, etc I learned many things from each of them and their support and guidance helped me overcome some stressful times Finally, thank you for sharing your thoughts, documentations, data sets and coding programs All of this work wasn't even considered and thanks to many of my colleagues working with computer vision and machine learning My many amazing co-authors, I want to thank as well I was also incredibly lucky to have made many friends in my time in CCU Also, thanks to the badminton team in EE department, VSACCU badminton team showed me all the awesome activities around the PhD journey! Finally, without the valuable support, encouragement, love and patience of my family, especially my parents Ha Sinh Hai and Tran Thi Theu, from my first years as a student, they trained me for this by showing me the importance of hard work, critical thinking and patience I thank them for their support, trust and innumerable sacrifices; I have worked on this study thousands of kilometers from home and missed many significant life events And speaking of love and encouragement, my wife Nguyen Hong Quyen and my daughter Ha Hai Ngan are grateful for all our many wonderful years and many well-wrought weekends and vacations, and above all for staying by my side, even when a long distance was between us Thank you for raising me today and for having always been there and trusting me iii Abstract People are one of the most important entities that computer vision systems would need to understand to be useful and omnipresent in various applications Most of this awareness is based on the recognition of human activities for the homecare systems which observe people and support elderly people Human beings this well on their own: we look at others and describe each action in detail Moreover, we can reason about those actions over time, even predict the possible actions in the future On the other hand, computer vision algorithms were well behind the challenge In this study, my research aim is to create learning models, which can automatically induce representations of human activities, especially their structure and feature meaning, in order to solve several higher-level action tasks and approach to context-aware engine for various action recognition In this dissertation, we explore techniques to improve human action understanding from video inputs which are common and may be found in daily activities such as surveillance, traffic, education, movies, sports, etc on challenging large-scale benchmark datasets and our own panoramic video dataset This dissertation targets the action recognition and action detection of humans in videos The most important insight is that actions depend on global features parameterized by a scene, objects, and others, apart from their own local features parameterized by body pose characteristics Additionally, modeling the temporal features by optical flow from motions of people and objects in the scene can further help in recognizing human actions These dependencies are exploited in five key fords: (1) Detecting moving subjects using the background subtraction scheme, tracking extrcated subjects using the Kalman filter, and using the handcraft features to perform classification via traditional machine learning (GMM, SVM) (2) Developing a computation-affordable recognition iv system with a lightweight model capable of learning from a portable device; (3) Using capsule networks and skeleton-based map generation to attend to the subjects, and building their correlation and attention context; (4) Exploring the integrated action recognition model based on correlations and attention of subjects and scene; (5) Developing systems based on the refined highway aggregating model In summary, this dissertation presents several novel and significant solutions for efficient DNN architecture analysis, acquisition, and distribution on large-scale video data We show that the DNNs using multiple streams, combined model, hybrid structure on conditional context, feature input representation, global features, local features, spatiotemporal attention and the modified belief Capsnet have efficiently achieved high quality results The consistent improvements from using these components of our DNNs are addressed to achieve state-of-the-art results on popularly-used datasets Furthermore, we also observe that the largest improvements are indeed achieved in action classes involving human-to-human and human-to-object interactions, and visualizations of our network show that it is focusing on scene context that is intuitively relevant to action recognition Keywords: attention mechanism, activity recognition, action detection, deep neural network, convolutional neural network, recurrent neural network, capsule network, spatiotemporal attention, skeleton v TABLE OF CONTENTS PAGES ACKNOWLEDGMENTS i ABSTRACT iii TABLE OF CONTENTS v LIST OF FIGURES viii LIST OF TABLES xi I INTRODUCTION II MULTI-MODAL MACHINE LEARNING APPROACHES LOCALLY ON SINGLE OR MULTIPLE SUBJECTS FOR HOMECARE 2.1 Introduction………………………………………………………………… 2.2 Technical Approach 11 2.2.1 Handcraft feature extraction by locally body subject estimation 12 2.2.2 Proposed action recognition on single subject 17 2.2.3 Proposed action recognition on multiple subjects 20 2.3 Experiment Results and Discussion 27 2.3.1 Effectiveness of our proposal to single subject on action recognition28 2.3.2 Effectiveness of our proposal to multiple subjects on action recognition 31 2.4 Summary and Discussion 35 III ACTION RECOGNITION USING A LIGHTWEIGH MODEL 38 3.1 Introduction 37 3.2 Related Work 40 3.3 Action recognition by a lightweight model 41 3.4 Experiments and Results 45 vi 3.5 Summary and Discussion 49 IV ATTENTIVE RECOGNITION LOCALLY, GLOBALLY, TEMPORALLY, USING DNN AND CAPSNET 50 4.1 Introduction 51 4.2 Related Previous Work 55 4.2.1 Diverse Spatio-Temporal Feature Generation 55 4.2.2 Capsule Neural Network 59 4.3 Proposed DNNs for Action Recognition 59 4.3.1 Proposed Generic DNN with Spatiotemporal Attentions 59 4.3.2 Proposed CapsNet-Based DNNs 68 4.4 Experiments, Comparisons of Proposed DNN 72 4.4.1 Datasets and Parameter Setup for Simulations 72 4.4.2 Analyses and Comparisons of Experimental Results 74 4.4.3 Analyses of Computational Time, and Cost 84 4.4.4 Visualization 85 4.5 Summary and Discussion 88 V ACTION RECOGNITION ENHANCED BY CORRELATIONS AND ATTENTION OF SUBJECTS AND SCENE 89 5.1 Introduction 89 5.2 Related work 91 5.3 Proposed DNN 92 5.3.1 Projection of SBB to ERB in the Feature Domain 92 5.3.2 Map Convolutional Fused-Depth Layer 93 5.3.3 Attention Mechanisms in SA and TE Layers 93 5.3.4 Generation of Subject Feature Maps 95 5.4 Experiments and Discussion 97 vii 5.4.1 Datasets and Parameter Setup for Implements Details 97 5.4.2 Analyses and Comparisons of Experimental Results 98 5.5 Summary and Discussion 100 VI SPATIO-TEMPORALLY WITH AND WITHOUT LOCALIZATION ON MULTIPLE LABELS FOR ACTION PERCEPTION, USING VIDEO CONTEXT 101 6.1 Introduction 102 6.2 Related work 107 6.2.1 Action Recognition with DNNs 107 6.2.2 Attention Mechanisms 108 6.2.3 Bounding Boxes Detector for Action Detection 109 6.3 Proposed Methodology 110 6.3.1 Action Refined-Highway Network 112 6.3.2 Action Detection .118 6.3.3 End-to-End Network Architecture on Action Detection 123 6.4 Experimental Results and Discussion 123 6.4.1 Datasets 123 6.4.2 Implementation Details 125 6.4.3 Ablation Studies 127 6.5 Summary and Discussion 133 VII CONCLUSION AND FUTURE WORK 135 REFERENCES 139 APPENDIX A 152 viii LIST OF FIGURES Figures Pages 2.1 Schematic diagram of height estimation 13 2.2 Distance estimation at the situation (I) 14 2.3 Estimated Distance at the situation (II) 15 2.4 Distance curve pattern of the measure of the standing subject 16 2.5 Proposal flow chart of our action recognition system 18 2.6 Flowchart detection of TV on/off 19 2.7 Proposed activity recognition system 21 2.8 Example of shape BLOBs generation from forground 22 2.9 Illustration of tracking by the Kalman filter method 23 2.10 Proposed FSM 24 2.11 Estimates of activity states in the overlapping interval 26 2.12 Proposed incremental majority voting 26 2.13 Room layout and experiment scenario 27 2.14 Examples of five activities recorded from the panoramic camera 28 2.15 Total accuracy rate (A) versus p and r 32 2.16 Total accuracy (A) versus p at r=0.001, 0.01, 0.05, 0.1, and 0.2 32 3.1 Proposed recognition system 41 3.2 Functional blocks of the proposed MoBiG 42 3.3 Proposed finite state machine 43 3.4 Proposed incremental majority voting 44 3.5 Confusion matrix of MoBiG identifying four activities 48 4.1 CapsNets integrated in a generic DNN 55 4.2 Block diagram of the proposed generic DNN 58 ix 4.3 Three skeleton channels 61 4.4 One example of the transformed skeleton maps from an input segment 63 4.5 Block diagrams of the proposed AJA and AJM 64 4.6 Block diagram of the proposed A_RNN 67 4.7 Proposed capsule network for TC_DNN and MC_DNN 69 4.8 Block diagrams of the proposed CapsNet-based DNNs 71 4.9 Examples of the panoramic videos about 12 actions where subjects maked by red rectangular dash-line boxes for observing only 74 4.10 Visualization of the outputs from the intermediate layers of the proposed TC_DNN 87 4.11 Visualization of the outputs from the intermediate layers of two A_RNNs 87 5.1 Block diagram of the proposed deep neural network 92 5.2 Block diagram of the SA generation layer 95 5.3 We plot the comparison performance of the AFS and ROS stream for each action classes 5.4 JHMDB21 confusion matrix 99 6.1 Refined highway block for 3D attention 104 6.2 Overview of the proposed architecture for action recognition and detection 110 6.3 Temporal bilinear inception module 113 6.4 RH block in RNN structures, like the standard RNN, LSTM, GRU and variant 98 RNNs 114 6.5 Schematic recurrent 3D Refined-Highway depth by three RH block 116 6.6 3DConvDM layer correlating the feature map X 118 6.7 The details of GAA module 122 6.8 mAP for per-category on AVA 132 6.9 Possible locations associated with proposal regions of a subject 132 6.10 Visualization of R_FSRH module on the UCF101-24 validation set 132 x REFERENCES [1] O T.-C Chen, C.-H Tsai, H H Manh, and W.-C Lai, "Activity recognition using a panoramic camera for homecare." in Proceedings of 14th IEEE International Conference on Advanced Video and Signal Based Surveillance, pp 1-6, 2017 [2] O T.-C Chen, H H Manh, and W.-C Lai, “Activity recognition of multiple subjects for homecare,” in Proceedings of the 10th International Conference on Knowledge and Smart Technology, pp 242–247, 2018 [3] O T.-C Chen, M.-H Ha, and Y L Lee, "Computation-affordable recognition system for activity identification using a smart phone at home," in Proceedings of the IEEE International Symposium on Circuits and Systems, pp 1-5, 2020 [4] M.-H Ha, and Oscal T.-C Chen, “Deep Neural Networks Using Capsule Networks and Skeleton-Based Attentions for Action Recognition,” IEEE Access, vol 9, pp 6164-6178, 2021 [5] M.-H Ha, and Oscal T.-C Chen, “Action recognition improved by correlations and attention of subjects and scene” IEEE International Conference on Visual Communications and Image Processing, 2021 (Submitted) [6] M.-H Ha, and Oscal T.-C Chen, “Enhancement Models Using Refined Highway for Human Actions Recognition,” IEEE Transaction on Multimedia, 2021 (Under submition) [7] K Xu, X Jiang, and T Sun, “Human activity recognition based on pose points selection." In Proceedings of the IEEE International Conference on Image Processing, pp 2930-2934, 2015 [8] W H Chen, and H P Ma, “A fall detection system based on infrared array sensors with tracking capability for the elderly at home,” In Proceedings of the IEEE 17th International Conference on E-health Networking, Application & Services, pp 428-434, 2015 [9] A Jalal, Y Kim, and D Kim, “Ridge body parts features for human pose estimation and recognition from RGB-D video data,” In Proceedings of the IEEE International Conference on Computing, Communication and Networking Technologies, pp 1-6, 2014 [10] A Jalal, S Kamal, A Farooq, and D Kim, “A spatiotemporal motion variation features extraction approach for human tracking and pose-based 139 action recognition." In Proceedings of the IEEE International Conference on Informatics, Electronics & Vision, pp 1-6, 2015 [11] M D Bengalur, “Human activity recognition using body pose features and support vector machine,” In Proceedings of the IEEE International Conference on Advances in Computing, Communications and Informatics, pp 1970-1975, 2013 [12] M Papakostas, T Giannakopoulos, F Makedon, and V Karkaletsis, "Short- term recognition of human activities using convolutional neural networks," In Proceedings of the IEEE International Conference on Signal-Image Technology & Internet-Based Systems, pp 302-307, 2016 [13] O T.-C Chen, and C H Tsai, “Home activity log and life care using panoramic videos,” In Proceedings of the ISG 10th World Conference of Gerontechnology, pp 9s, 2016 [14] Massimiliano Albanese et al., "A constrained probabilistic petri net framework for human activity detection in video," In Proceedings of IEEE Transactions on Multimedia, pp 1429-1443, vol 10, iss 8, 2008 [15] S Amri, W Barhoumi, and E Zagrouba, “Detection and matching of multiple occluded moving people for human tracking in colour video sequences,” In Proceedings of International Journal of Signal and Imaging Systems Engineering, pp 153-163, 2011 [16] A Milan et al., “Joint tracking and segmentation of multiple targets,” In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 5397 – 5406, 2015 [17] J Zhang, L Presti, and S Sclaroff, “Online multi-person tracking by tracker hierarchy,” In Proceedings of IEEE International Conference on Advanced Video and Signal based Surveillance, pp 379-385, 2012 [18] F Rotaru et al., “Video processing for rat behavior analysis,” In Proceedings of International Symposium on Signals, Circuits and Systems, pp 1-4, 2017 [19] R Acevedo-Avila et al "A linked list-based algorithm for blob detection on embedded vision-based sensors,” Sensors, 2016 (doi: 10.3390/s16060782) [20] A.Yilmaz, O Javed, and M Shah, “Object tracking: a survey,” ACM Computing Surveys, vol 38, no 4, article 13, Dec 2006 [21] Z Zivkovic, and F van der Heijden "Efficient adaptive density estimation per image pixel for the task of background subtraction." Pattern recognition letters, pp 773-780, 2006 140 [22] C Maurer, R Qi , and V Raghavan, “A linear time algorithm for computing exact Euclidean distance transforms of binary images in arbitrary dimensions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 25, no 2, pp 265-270, 2003 [23] C Chang, and C Lin, “LIBSVM: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol 2, no 3, pp 1-39, 2011 [24] S Garg, and S Kumar, “Mean-shift based object tracking algorithm using SURF features,” Recent Advances in Circuits, Communications and Signal Processing, 187-194, 2013 [25] T Al Hanai, M Ghassemi, and J Glass, "Detecting depression with audio/text sequence modeling of interviews," Interspeech, pp 1716-1720, 2018 [26] B F Smaradottir, J A Håland, and S G Martinez, “User evaluation of the smartphone screen reader voice over with visually disabled participants,” Mobile Information Systems, 2018 [27] N Larburu, A Artetxe, V Escolar, A Lozano, and J Kerexeta, “Artificial intelligence to prevent mobile heart failure patients decompensation in real time: monitoring-based predictive model,” Mobile Information Systems, 2018 [28] F Rotaru, S Bejinariu, M Luca, R Luca, and C Nita, "Video processing for rat behavior analysis," in Proc of International Symposium on Signals, Circuits and Systems, pp 1-4, 2017 [29] L Wang, Y Xiong, Z Wang, Y Qiao, D Lin, X Tang, and L Van Gool, "Temporal segment networks: Towards good practices for deep action recognition," in Proc of European Conference on Computer Vision, pp 2036, 2016 [30] C Szegedy et al., "Going deeper with convolutions," in Proc of IEEE Conf on Computer Vision and Pattern Recognition, pp 1-9, 2015 [31] Z Wu, Y.-G Jiang, X Wang, H Ye, and X Xue, “Multi-stream multi-class fusion of deep networks for video classification,” ACM Multimedia, pp 791– 800, 2016 [32] A Krizhevsky, I Sutskever, and G Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems, pp 1097-1105, 2012 [33] K Simonyan, and A Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc of ICLR, pp 1-14, 2015 141 [34] J Dean et al., “Large scale distributed deep networks,” Advances in Neural Information Processing Systems, pp 1223-1231, 2012 [35] D Ravì, C Wong, B Lo and G Yang, "A deep learning approach to on- node sensor data analytics for mobile or wearable devices," IEEE Journal of Biomedical and Health Informatics, vol 21, no 1, pp 56-64, Jan 2017 [36] N D Lane et al., "DeepX: A software accelerator for low-power deep learning inference on mobile devices," in Proc of 15th ACM/IEEE International Conference on Information Processing in Sensor Networks, pp 1-12, 2016 [37] A Grushin, D D Monner, J A Reggia and A Mishra, "Robust human action recognition via long short-term memory," in Proc of International Joint Conference on Neural Networks, pp 1-8, 2013 [38] M Baccouche, F Mamalet, C Wolf, C Garcia, and A Baskurt, “Action classification in soccer videos with long short-term memory recurrent neural networks,” in Proc of 20th International Conf on Artificial Neural Networks, vol 6353, pp 154–159, 2010 [39] V Veeriah, N Zhuang and G Qi, "Differential recurrent neural networks for action recognition," in Proc of IEEE International Conference on Computer Vision, pp 4041-4049, 2015 [40] K Cho et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proc of Conf on Empirical Methods in Natural Language Proc., pp 1724-1734, 2014 [41] S Ioffe, and C J Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc of International Conference on Machine Learning, pp 448-456, 2015 [42] F N Iandola et al., “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size,” arXiv:1602.07360v4 [cs.CV] Nov 2016 [43] K He, X Zhang, S Ren, and J Sun, "Deep residual learning for image recognition," in Proc of IEEE Conference on Computer Vision and Pattern Recognition, pp 770-778, 2016 [44] J Hu, L Shen, and G Sun, "Squeeze-and-excitation networks," in Proc of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7132-7141, 2018 [45] X Zhang, X Zhou, M Lin, and J Sun, "ShuffleNet: An extremely efficient convolutional neural network for mobile devices," in Proc of IEEE/CVF 142 Conference on Computer Vision and Pattern Recognition, pp 6848-6856, 2018 [46] A Howard et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861v1 [cs.CV] 17 Apr 2017 [47] M Sandler, A Howard, M Zhu, A Zhmoginov, and L Chen, "MobileNetV2: Inverted residuals and linear bottlenecks," in Proc of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4510-4520, 2018 [48] F Chollet, "Xception: Deep learning with depthwise separable convolutions," in Proc of IEEE Conference on Computer Vision and Pattern Recognition, pp 1800-1807, 2017 [49] TensorFlow.org, “Transfer learning using pretrained ConvNets,” 2019 (https: //www.tensorflow.org/beta/tutorials/images/transfer_learning) [50] C Szegedy, V Vanhoucke, S Ioffe, J Shlens, and Z Wojna, "Rethinking the inception architecture for computer vision." in Proc of IEEE Conference on Computer Vision and Pattern Recognition, pp 2818-2826, 2016 [51] H Shih, "A survey of content-aware video analysis for sports," IEEE Transactions on Circuits and Systems for Video Technology, vol 28, no 5, pp 1212-1231, May 2018 [52] A Karpathy, G Toderici, S Shetty, T Leung, R Sukthankar, and Fei-Fei Li, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1725–1732, 2014 [53] J Rajasegaran et al "DeepCaps: Going deeper with capsule networks," In Proceedings of Conference on Computer Vision and Pattern Recognition, pp 10717-10725, 2019 [54] D Luvizon, D Picard, and H Tabia, “2D/3D pose estimation and action recognition using multitask deep learning ,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5137–5146, 2018 [55] L Wang et al “Appearance-and-relation networks for video classification,” In Proceedings of the IEEE conference on computer vision and pattern recognition, p 1430-1439, 2018 [56] Y Chen et al “Multi-fiber networks for video recognition,” in Proceedings of the European Conference on Computer Vision pp 352-367, 2018 143 [57] S Sabour, N Frosst, and G Hinton, “Dynamic routing between capsules,” in Proceedings of Advances in Neural Information Processing Systems, pp 3856–3866, 2017 [58] Q Qi, S Zhao, J Shen, and K Lam, "Multi-scale capsule attention-based salient object detection with multi-crossed layer connections," in Procceding of IEEE International Conference on Multimedia and Expo, pp 1762-1767, 2019 [59] J Choi, H Seo, S Im, and M Kang, “Attention routing between capsules,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019 [60] Z Cao, T Simon, S Wei, and Y Sheikh, “Realtime multi-person 2D pose estimation using part affinity fields,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7291–7299, 2017 [61] B Li, Y Dai, X Cheng, H Chen, Y Lin, and M He, "Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN," in Proceedings of IEEE International Conference on Multimedia & Expo Workshops pp 601-604 2017 [62] K Xu, J Ba, R Kiros, K Cho, A Courville, R Salakhudinov, R Zemel, and Y Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proceedings of International Conference on Machine Learning, pp 2048–2057, 2015 [63] S Sharma, R Kiros, and R Salakhutdinov, “Action recognition using visual attention,” arXiv preprint arXiv:1511.04119, 2015 [64] E Mansimov, N Srivastava, and R Salakhutdinov “Initialization strategies of spatio-temporal convolutional arXiv:1503.07274, 2015 neural networks,” arXiv preprint [65] D Li et al “Unified spatio-temporal attention networks for action recognition in videos,” IEEE Transactions on Multimedia, vol 21, no 2, pp 416-428, 2018 [66] K Soomro, R Amir, and M Shah "UCF101: A dataset of 101 human actions classes from videos in the wild," arXiv preprint arXiv:1212.0402, 2012 [67] H Kuehne, H Jhuang, E Garrote, T Poggio, and T Serre, “HMDB: A large video database for human motion recognition,” in Proceedings of the IEEE International Conference on Computer Vision, pp 2556–2563, 2011 [68] J Carreira, and A Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6299–6308, 2017 144 [69] N Crasto, P Weinzaepfel, K Alahari, and C Schmid, “Mars: Motion- augmented RGB stream for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7882–7891, 2019 [70] C Feichtenhofer, A Pinz, and A Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1933–1941, 2016 [71] L Wang, P Koniusz, and D Huynh, “Hallucinating IDT descriptors and I3D optical flow features for action recognition with CNNs,” in Proceedings of the IEEE International Conference on Computer Vision, pp 8698–8708, 2019 [72] L Wang, Y Xiong, Z Wang, Y Qiao, D Lin, X Tang, and L Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in Proceedings of European Conference on Computer Vision, Springer, pp 20–36, 2016 [73] L Sun, K Jia, K Chen, D.-Y Yeung, B E Shi, and S Savarese, “Lattice long short-term memory for human action recognition,” in Proceedings of the IEEE International Conference on Computer Vision, pp 2147–2156, 2017 [74] G Rogez, P Weinzaepfel, and C Schmid, “LCR-Net++: Multi-person 2D and 3D pose detection in natural images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 42, no 5, pp 1146-1161, May 2020 [75] V Choutas, P Weinzaepfel, J Revaud, and C Schmid, “Potion: Pose motion representation for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7024–7033, 2018 [76] Z Yang, Y Li, J Yang, and J Luo, "Action recognition with spatio–temporal visual attention on skeleton image sequences," IEEE Transactions on Circuits and Systems for Video Technology, vol 29, no 8, pp 2405-2415, Aug 2019 [77] Q Ke, M Bennamoun, S An, F Sohel, and F Boussaid, “A new representation of skeleton sequences for 3d action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3288-3297, 2017 [78] C Caetano, F Brémond, and W Schwartz, "Skeleton image representation for 3D action recognition based on tree structure and reference joints," in Proceedings of International Conference on Graphics, Patterns and Images, pp 16-23, 2019 145 [79] C Caetano, J Sena, F Brémond, J Dos Santos, and W Schwartz, "SkeleMotion: A new representation of skeleton joint sequences based on motion information for 3D action recognition," in Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance, pp 1-8, 2019 [80] Ahmad, Tasweer, Lianwen Jin, Xin Zhang, Luojun Lin, and Guozhi Tang "Graph Convolutional Neural Network for Action Recognition: A Comprehensive Survey." IEEE Transactions on Artificial Intelligence, 2021 [81] Zhang, Xikun, Chang Xu, Xinmei Tian, and Dacheng Tao "Graph edge convolutional neural networks for skeleton-based action recognition." IEEE transactions on neural networks and learning systems 31, no 8, 3047-3060, 2019 [82] Li, Maosen, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng Wang, and Qi Tian "Symbiotic graph neural networks for 3d skeleton-based human action recognition and motion prediction." IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021 [83] Wang, Minsi, Bingbing Ni, and Xiaokang Yang "Learning Multi-View Interactional Skeleton Graph for Action Recognition." IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020 [84] G E Hinton, S Sabour, and N Frosst, “Matrix capsules with EM routing,” in Proceedings of International Conference on Learning Representations, 2018 [85] C Xiang, L Zhang, Y Tang, W Zou, and C Xu, "MS-CapsNet: A novel multi-scale capsule network," IEEE Signal Processing Letters, vol 25, no 12, pp 1850-1854, Dec 2018 [86] K Duarte, Y Rawat, and M Shah, “Videocapsulenet: A simplified network for action detection,” in Proceedings of Advances in Neural Information Processing Systems, pp 7610–7619, 2018 [87] C Zach, T Pock, and H Bischof, "A duality based approach for realtime tv- l optical flow," in Proceedings of Joint Pattern Recognition Symposium, pp 214-223, 2007 [88] Z Qiu, Ting Yao, and Tao Mei "Learning spatio-temporal representation with pseudo-3d residual networks." In Proceedings of the IEEE International Conference on Computer Vision, p 5533-5541 2017 [89] D Tran et al "Convnet architecture search for spatiotemporal feature learning," arXiv preprint arXiv:1708.05038, 2017 146 [90] S Xie et al “Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification,” In Proceedings of the European Conference on Computer Vision, pp 305-321, 2018 [91] Z Lan, M Lin, X Li, A Hauptmann, and B Raj, “Beyond gaussian pyramid: Multi-skip feature stacking for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 204–212, 2015 [92] Z Qiu et al "Learning spatio-temporal representation with local and global diffusion." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 12056-12065, 2019 [93] A Gupta K He X Wang, R Girshick, “Non-local neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp 7794–7803 [94] N Parmar J Uszkoreit L Jones A N Gomez L Kaiser A Vaswani, N Shazeer and I Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, pp 5998–6008, 2017 [95] C Vondrick K Murphy R Sukthankar C Sun, A Shrivastava and C Schmid, “Actor-centric relation network,” in Proceedings of the European Conference on Computer Vision, 2018, pp 318–334 [96] Z Zhang J Dai H Hu, J Gu and Y Wei, “Relation networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp 3588–3597 [97] A Kolesnikov et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in arXiv preprint arXiv:2010.11929, 2020 L Jiao et al., "A Survey of Deep Learning-Based Object Detection," IEEE Access, vol 7, pp 128837-128868, 2019 [98] A Bochkovskiy, “Darknet: Open source neural networks in python,” in Available online: https://github.com/AlexeyAB/darknet, 2020 [99] [100] Z Li A Yan, Y Wang, and Y Qiao, “Pa3d: Pose-action 3d machine for video recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp 7922–7931 [101] Y Wang W Du and Y Qiao, “Rpan: An end-to-end recurrent poseattention network for action recognition in videos,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp 3725–3734 [102] Z Guo R Zhi B Wang Z Fang, W Zhang and F Flohr, “Traffic police gesture recognition by pose graph convolutional networks,” in Proc of IEEE Intelligent Vehicles Symposium, 2020, pp 1833–1838 147 X He, J He, C Zhang, and R J N Dong, “Visual recognition of traffic police gestures with convolutional pose machine and handcrafted features,” Neurocomputing, vol 390, pp 248–259, 2020 [103] A Radford, R Child, S Gray, and I J a p a Sutskever, “Generating long sequences with sparse transformers,” in arXiv preprint arXiv:1904.10509, 2019 [104] K He et al "Mask r-cnn." In Proceedings of the IEEE International Conference on Computer Vision, 2017, pp 2961-2969 [105] Zhang, Han, et al "Self-attention generative adversarial networks." in Proc of International conference on machine learning, 2019 pp 7354-7363 [106] C Chen, R Hou and M J a p a Shah, “An end-to-end 3D convolutional neural network for action detection and segmentation in videos,” in arXiv preprint arXiv:1712.01111, 2017 [107] M S Ibrahim, S Muralidharan, Z Deng, A Vahdat, and G Mori, "A hierarchical deep temporal model for group activity recognition." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1971-1980, 2016 [108] D Tran, H Wang, L Torresani, J Ray, Y LeCun, and M Paluri, "A closer look at spatiotemporal convolutions for action recognition." In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 6450-6459, 2018 [109] Manh-Hung Ha and Osacl T C Chen, “Deep neural networks using capsule networks and skeleton-based attentions for action recognition,” IEEE Access, vol 9, pp 6164–6178, January 2021 [110] J Wei, H Wang, Y Yi, Q Li, and D Huang, "P3d-ctn: Pseudo-3d convolutional tube network for spatio-temporal action detection in videos." In [111] Proceedings of IEEE International Conference on Image Processing, pp 300304, 2019 [112] A Yan, Y Wang, Z Li, and Y Qiao, "PA3D: Pose-action 3D machine for video recognition." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7922-7931, 2019 [113] S Xie, C Sun, J Huang, Z Tu, and K Murphy, "Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification." In Proceedings of the European Conference on Computer Vision, pp 305-321, 2018 148 [114] Y Chen, Y Kalantidis, J Li, S Yan, and J Feng, "Multi-fiber networks for video recognition." In Proceedings of the European Conference on Computer Vision, pp 352-367, 2018 J Lin, C Gan, and S Han, "Tsm: Temporal shift module for efficient video understanding." In Proceedings of the IEEE International Conference on Computer Vision, pp 7083-7093, 2019 [115] [116] Srivastava, Rupesh Kumar, Klaus Greff, and Jürgen Schmidhuber "Highway networks." arXiv preprint arXiv:1505.00387, 2015 [117] Y Kim, Y Jernite, D Sontag, and A Rush, "Character-aware neural language models." In Proceedings of the AAAI conference on artificial intelligence, Vol 30, 2016 [118] C Feichtenhofer, H Fan, J Malik, and K He, "Slowfast networks for video recognition." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 6202-6211, 2019 M Kim, T Kim, and D Kim, "Spatio-Temporal Slowfast Self-Attention Network For Action Recognition." In Procceeding of the IEEE International Conference on Image Processing, pp 2206-2210, 2020 [119] [120] K He, G Gkioxari, P Dollár, and R Girshick, "Mask r-cnn." In Proceedings of the IEEE international conference on computer vision, pp 2961-2969, 2017 [121] O Ulutan, S Rallapalli, M Srivatsa, C Torres, and B Manjunath, "Actor conditioned attention maps for video action detection." In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 527536, 2020 C Gu, C Sun, D A Ross, C Vondrick, C Pantofaru, Y Li, S Vijayanarasimhan, G Toderici, S Ricco, and R Sukthankar, "Ava: A video [122] dataset of spatio-temporally localized atomic visual actions." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6047-6056, 2018 G A Sigurdsson, G Varol, X Wang, A Farhadi, I Laptev, and A Gupta, "Hollywood in homes: Crowdsourcing data collection for activity understanding." In Proceding of the European Conference on Computer Vision, pp 510-526, 2016 [123] Z Qiu, T Yao, and T Mei, "Learning spatio-temporal representation with pseudo-3d residual networks." In proceedings of the IEEE International Conference on Computer Vision, pp 5533-5541, 2017 [124] 149 H Xu, A Das, and K Saenko, "R-c3d: Region convolutional 3d network for temporal activity detection." In Proceedings of the IEEE International Conference on Computer Vision, pp 5783-5792, 2017 [125] R R A Pramono, Y.-T Chen, and W.-H Fang, "Hierarchical selfattention network for action localization in videos." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 61-70, 2019 [126] I Bello, B Zoph, A Vaswani, J Shlens, and Q V Le, "Attention augmented convolutional networks." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3286-3295, 2019 [127] N Carion, F Massa, G Synnaeve, N Usunier, A Kirillov, and S Zagoruyko, "End-to-end object detection with transformers." In Proceedings of European Conference on Computer Vision, pp 213-229, 2020 [128] J Li, X Liu, W Zhang, M Zhang, J Song, and N J I T o M Sebe, “Spatio-temporal attention networks for action recognition and detection,” IEEE Transactions on Multimediavol 22, no 11, pp 2990-3001, 2020 [129] W Du, Y Wang, and Y Qiao, "Rpan: An end-to-end recurrent poseattention network for action recognition in videos." In Proceedings of the IEEE International Conference on Computer Vision, pp 3725-3734, 2017 [130] Z Huang, X Wang, L Huang, C Huang, Y Wei, and W Liu, "Ccnet: Criss-cross attention for semantic segmentation." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 603-612, 2019 [131] J Hu, L Shen, and G Sun, "Squeeze-and-excitation networks." In Proceedings of the IEEE conference on Computer Cision and Pattern Recognition, pp 7132-7141, 2018 [132] X Wang, R Girshick, A Gupta, and K He, "Non-local neural networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7794-7803, 2018 [133] R Dai, S Das, L Minciullo, L Garattoni, G Francesca, and F Bremond, "PDAN: Pyramid Dilated Attention Network for Action Detection." In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 2970-2979, 2021 [134] A Dosovitskiy, L Beyer, A Kolesnikov, D Weissenborn, X Zhai, T Unterthiner, M Dehghani, M Minderer, G Heigold, and S J a p a Gelly, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020 [135] 150 R Girdhar, J Carreira, C Doersch, and A Zisserman, "Video action transformer network." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 244-253, 2019 [136] C.-Y Wu, C Feichtenhofer, H Fan, K He, P Krahenbuhl, and R Girshick, "Long-term feature banks for detailed video understanding." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 284-293, 2019 [137] V Kalogeiton, P Weinzaepfel, V Ferrari, and C Schmid, "Action tubelet detector for spatio-temporal action localization." In Proceedings of the IEEE International Conference on Computer Vision, pp 4405-4413, 2017 [138] D Li, Z Qiu, Q Dai, T Yao, and T Mei, "Recurrent tubelet proposal and recognition networks for action detection." In Proceedings of the European Conference on Computer Vision, pp 303-318, 2018 [139] [140] M Najibi, M Rastegari, and L S Davis, "G-cnn: an iterative grid based object detector." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2369-2377, 2016 [141] Z Cai, and N Vasconcelos, "Cascade r-cnn: Delving into high quality object detection." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6154-6162, 2018 J Carreira, E Noland, A Banki-Horvath, C Hillier, and A J a p a Zisserman, “A short note about kinetics-600,” arXiv preprint arXiv:1808.01340, 2018 [142] 151 APPENDIX A Fig A.1 12 category visualization of panoramic camera data in two and three dimensions with the scatter plot filter (a) Fig A.2 (b) (c) T-SNE test data visualization where each of data points represented by the shots of a frame sequence on UCF101 (a) and (b) Proposed methods' feature representations in which data points being easier to separate and classify (c) t-SNE visualization testing data from 12 classes of panoramic video data in the inference layer Data visualization: The t-SNE is a non-linear dimension reduction scheme that is used to view highdimensional data It reduces multidimensional data (such as two or three) to manageable measurements The high-dimensional data of panoramic video clips are visualized by 12 classes, as displayed in Fig A.1, which form the scatter plots at two and three 152 dimensions, to well acquire their pairwise distances Additionally, the cluster distribution of 12 categories is depicted in Fig A.2(c) where the distinctions among these 12 classes are observed by t-SNE Fig A.2 show that our DNN can find a feature space in which data points of different classes in UCF101 are separated TABLE A.1 COMPARISON OF THE FEATURES AND PERFORMANCE OF OUR PROPOSED AND GENERIC ACTIVITITY RECOGNITION SYSTEMS ON PANORAMIC VIDEO DATASET Work [12] [1] Activities Years falling, standing, sitting falling, standing, watching_TV, walking, sitting Methods Accuracies 2016 GMM 80% 2017 SVM 89.6% [2] standing, walking, sitting, falling 2018 [3] standing, walking, sitting, falling 2020 GMM, FSM, Tracking MOBIG, FSM, IMV 92.9% 93.1% drinking, falling, reading_newspaper, standing_up, eating_snack, going_out, [4] sitting_down, telephoning, entering_room, pour_tea, smoking, watching_TV 2020 3DCNN, Capsule networks , Attention 95.3% As listed in Table A.1, the accuracies of the experimental results of dissertation research on our own panoramic video dataset has increased over time, based on the method used Because we not pay much attention to the design of deep neural networks in the proposed architecture, we design the action recognition system concerned with traditional machine learning like GMM and SVM that work well with small datasets and a small number of catergories because input denoising is independent of action recognition A potential solution would be to build a complicated structure using attention mechanisms, multiple streams, and capsnet in feature-level to replace input-level 153

Định dạng
Số trang	166
Dung lượng	6,05 MB