A study on deep learning techniques for human action representation and recognition with skeleton data

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY PHAM DINH TAN A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA DOCTORAL DISSERTATION IN COMPUTER ENGINEERING Hanoi−2022 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY PHAM DINH TAN A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA Major: Computer Engineering Code: 9480106 DOCTORAL DISSERTATION IN COMPUTER ENGINEERING SUPERVISORS: Assoc Prof Vu Hai Assoc Prof Le Thi Lan Hanoi−2022 DECLARATION OF AUTHORSHIP I, Pham Dinh Tan, declare that the dissertation titled "A study on deep learning techniques for human action representation and recognition with skeleton data" has been entirely composed by myself I assure some points as follows: This work was done wholly or mainly while in candidature for a Ph.D research degree at Hanoi University of Science and Technology The work has not been submitted for any other degree or qualifications at Hanoi University of Science and Technology or any other institution Appropriate acknowledgment has been given within this dissertation, where reference has been made to the published work of others The dissertation submitted is my own, except where work in the collaboration has been included The collaborative contributions have been indicated Hanoi, May 08, 2022 Ph.D Student Pham Dinh Tan SUPERVISORS Assoc Prof Vu Hai Assoc Prof Le Thi Lan i ACKNOWLEDGEMENT This dissertation is composed during my Ph.D at the Computer Vision Department, MICA Institute, Hanoi University of Science and Technology I am grateful to all people who contribute in different ways to my Ph.D journey First, I would like to express sincere thanks to my supervisors Assoc Prof Vu Hai and Assoc Prof Le Thi Lan for their guidance and support I would like to thank all MICA members for their help during my Ph.D study My sincere thanks to Dr Nguyen Viet Son, Assoc Prof Dao Trung Kien, and Assoc Prof Tran Thi Thanh Hai for giving me a lot of support and valuable advice Many thanks to Dr Nguyen Thuy Binh, Nguyen Hong Quan, Hoang Van Nam, Nguyen Tien Nam, Pham Quang Tien, and Nguyen Tien Thanh for their support I would like to thank my colleagues at the Hanoi University of Mining and Geology for their support during my Ph.D study Special thanks to my family for understanding my hours glued to the computer screen Hanoi, May 08, 2022 Ph.D Student ii ABSTRACT Human action recognition (HAR) from color and depth sensors (RGB-D), especially derived information such as skeleton data, receives the research community’s attention due to its wide applications HAR has many practical applications such as abnormal event detection from camera surveillance, gaming, human-machine interaction, elderly monitoring, and virtual/augmented reality In addition to the advantages of fast computation, low storage, and immutability with human appearance, skeleton data have shortcomings The shortcomings include pose estimation errors, skeleton noise in complex actions, and incompleteness due to occlusion Moreover, action recognition remains challenging due to the diversity of human actions, intra-class variations, and inter-class similarities The dissertation focuses on improving action recognition performance using the skeleton data The proposed methods are evaluated using public skeleton datasets collected by RGB-D sensors Especially, they consist of MSRAction3D/MICA-Action3D - datasets with high-quality skeleton data, CMDFALL a challenging dataset with noise in skeleton data, and NTU RGB+D - a worldwide benchmark among the large-scale datasets Therefore, these datasets cover different dataset scales as well as the quality of skeleton data To overcome the limitations of the skeleton data, the dissertation presents techniques in different approaches First, as joints have different levels of engagement in each action, techniques for selecting joints that play an important role in human actions are proposed, including both preset joint subset selection and automatic joint subset selection Two frameworks are evaluated to show the performance of using a subset of joints for action representation The first framework employs Dynamic Time Warping (DTW) and Fourier Temporal Pyramid (FTP), while the second one uses Covariance Descriptors extracted on joint position and velocity Experimental results show that joint subsect selection helps improve action recognition performance on datasets with noise in skeleton data However, HAR using handcrafted feature extraction could not exploit the inherent graph structure of the human skeleton Recent Graph Convolution Networks (GCNs) are studied to handle these issues Among GCN models, Attention-enhanced Adaptive Convolutional Network (AAGCN) is used as the baseline model AAGCN achieves state-of-the-art performance on large-scale datasets such as NTU RGB+D and Kinetics However, AAGCN employs only joint information Therefore, a Feature Fusion (FF) module is proposed in this dissertation The new model is named FF-AAGCN The performance of FF-AAGCN is evaluated on the large-scale dataset NTU RGB+D and CMDFALL The evaluation results show that the proposed method is robust to iii noise and invariant to the skeleton translation Particularly, FF-AAGCN achieves remarkable results on challenging datasets Finally, as the computing capacity of edge devices is limited, a lightweight deep learning model is expected for application deployment A lightweight GCN architecture is proposed to show that the complexity of GCN architecture can still be reduced depending on the dataset’s characteristics The proposed lightweight model is suitable for application development on edge devices Hanoi, May 08, 2022 Ph.D Student iv CONTENTS DECLARATION OF AUTHORSHIP i ACKNOWLEDGEMENT ii ABSTRACT iii CONTENTS viii ABBREVIATIONS viii SYMBOLS x LIST OF TABLES xiii LIST OF FIGURES xvi INTRODUCTION CHAPTER LITERATURE REVIEW 1.1 Introduction 1.2 An overview of action recognition 1.3 Data modalities for action recognition 1.3.1 Color data 10 1.3.2 1.3.3 1.3.4 1.3.5 Depth data Skeleton data Other modalities Multi-modality 10 11 11 13 1.4 Skeleton data collection 1.4.1 Data collection from motion capture systems 1.4.2 Data collection from RGB+D sensors 1.4.3 Data collection from pose estimation 14 14 14 16 1.5 Benchmark datasets 1.5.1 MSR-Action3D 1.5.2 MICA-Action3D 1.5.3 CMDFALL 1.5.4 NTU RGB+D 18 18 19 19 19 1.6 Skeleton-based action recognition methods 20 1.6.1 Handcraft-based methods 20 1.6.1.1 Joint-based action recognition 1.6.1.2 Body part-based action recognition 22 25 v 1.6.2 Deep learning-based methods 1.6.2.1 Convolutional Neural Networks 1.6.2.2 Recurrent Neural Networks 28 28 30 1.7 Research on action recognition in Vietnam 33 1.8 Conclusion of the chapter 35 CHAPTER JOINT SUBSET SELECTION FOR SKELETON-BASED HUMAN ACTION RECOGNITION 36 2.1 Introduction 36 2.2 Proposed methods 2.2.1 Preset Joint Subset Selection 37 37 2.2.1.1 Spatial-Temporal Representation 2.2.1.2 Dynamic Time Warping 39 39 2.2.1.3 Fourier Temporal Pyramid 2.2.2 Automatic Joint Subset Selection 2.2.2.1 Joint weight assignment 2.2.2.2 Most informative joint selection 2.2.2.3 Human action recognition based on MIJ joints 40 40 41 42 43 2.3 Experimental results 45 2.3.1 Evaluation metrics 2.3.2 Preset Joint Subset Selection 2.3.3 Automatic Joint Subset Selection 45 46 48 2.4 Conclusion of the chapter 57 CHAPTER FEATURE FUSION FOR THE GRAPH CONVOLUTIONAL NETWORK 58 3.1 Introduction 58 3.2 Related work on Graph Convolutional Networks 58 3.3 Proposed method 65 3.4 Experimental results 71 3.5 Discussion 81 3.6 Conclusion of the chapter 84 CHAPTER THE PROPOSED LIGHTWEIGHT GRAPH CONVOLUTIONAL NETWORK 85 4.1 Introduction 85 4.2 Related work on Lightweight Graph Convolutional Networks 85 4.3 Proposed method 87 vi 4.4 Experimental results 89 4.5 Application demonstration 97 4.6 Conclusion of the chapter 101 CONCLUSION AND FUTURE WORKS 103 PUBLICATIONS 105 BIBLIOGRAPHY 106 vii ABBREVIATIONS No Abbreviation Meaning 2D Two-Dimensional 3D Three-Dimensional AAGCN Attention-enhanced Adaptive Graph Convolutional Network AGCN Adaptive Graph Convolutional Network AMIJ Adaptive number of Most Informative Joints AS Action Set AS-GCN Actional-Structural Graph Convolutional Network BN Batch Normalization BPL Body Part Location 10 CAM Channel Attention Module 11 CCTV Close-Circuit Television 12 CNN Convolutional Neural Network 13 CovMIJ Covariance Descriptor on Most Informative Joints 14 CPU Central Processing Unit 15 CS Cross-Subject 16 CV Cross-View 17 DFT Discrete Fourier Transform 18 DTW Dynamic Time Warping 19 FC Fully Connected 20 FF Feature Fusion 21 FLOP Floating Point OPeration 22 FMIJ Fixed number of Most Informative Joints 23 fps f rames per second 24 FTP Fourier Temporal Pyramid 25 GCN Graph Convolutional Network 26 GCNN Graph-based Convolutional Neural Network 27 GPU Graphical Processing Unit 28 GRU Gated Recurrent Unit 29 HAR Human Action Recognition 30 HCI Human-Computer Interaction viii are required for real-time applications Further research will aim at designing lightweight models from the garden of deep learning models in the literature • Study on the interpretability of action recognition using graph-based deep learning Deep learning approaches, dominant in the present literature, have excellent performance at the expense of the learning process’s understandability Handcrafted learning can be deemed less generalizable and more data-type specific in general They are, however, more intelligible from a human standpoint The optimum trade-off strategy is still an open question [24] • For skeleton-based action recognition, performance is strongly determined by the quality of the skeleton data Improving the quality of pose estimation is important for high-performance action recognition • Evaluate the proposed methods on other datasets such as NTU RGB+D 120 [37], UAV-Human [38] • Study on key frame selection for action recognition The combination of key frame selection and JSS should be considered • Further extend the study of graph theory on action recognition, such as graph node prediction for handling noise and incompleteness in the skeleton data Long-Term Perspectives • Extend the proposed methods to continuous skeleton-based human action recognition The proposed methods are currently evaluated on datasets with segmented skeleton sequences Action segmentation is required for continuous action recognition • Extend the study of Graph Convolutional Networks to Geometric Deep Learning There is a garden of deep learning models in the literature, including CNNs, RNNs, GCNs, and many others This leads to a requirement to construct a general mathematical framework for all these models Geometric Deep Learning is an approach to unifying these deep learning models by exploring the common mathematics in these models • Develop applications using the proposed methods for human action recognition such as elderly remote monitoring in healthcare or camera surveillance for abnormal behavior detection 104 PUBLICATIONS Conferences [C1] Tien-Nam Nguyen, Dinh-Tan Pham, Thi-Lan Le, Hai Vu, and Thanh-Hai Tran (2018), Novel Skeleton-based Action Recognition Using Covariance Descriptors on Most Informative Joints, Proceedings of International Conference on Knowledge and Systems Engineering (KSE 2018), IEEE, Vietnam, ISBN: 978-1-5386-6113-0, pp.50-55, 2018 [C2] Dinh-Tan Pham, Tien-Nam Nguyen, Thi-Lan Le, and Hai Vu (2019), Analyzing Role of Joint Subset Selection in Human Action Recognition, Proceedings of NAFOSTED Conference on Information and Computer Science (NICS 2019), IEEE, Vietnam, ISBN: 978-1-7281-5163-2, pp.61-66, 2019 [C3] Dinh-Tan Pham, Tien-Nam Nguyen, Thi-Lan Le, and Hai Vu (2020), SpatioTemporal Representation for Skeleton-based Human Action Recognition, Proceedings of International Conference on Multimedia Analysis and Pattern Recognition (MAPR 2020), IEEE, Vietnam, ISBN: 978-1-7281-6555-4, pp.1-6, 2020 Journals [J1] Dinh-Tan Pham, Quang-Tien Pham, Thi-Lan Le, and Hai Vu (2021), An Efficient Feature Fusion of Graph Convolutional Networks and Its Application for RealTime Traffic Control Gestures Recognition, IEEE Access, ISSN: 2169-3536, pp.121930 - 121943, 2021 (ISI, Q1) [J2] Van-Toi Nguyen, Tien-Nam Nguyen, Thi-Lan Le, Dinh-Tan Pham, and Hai Vu (2020), Adaptive most joint selection and covariance descriptions for a robust skeleton-based human action recognition, Multimedia Tools and Applications (MTAP), Springer, DOI: 10.1007/s11042-021-10866-4, pp.1-27, 2021 (ISI, Q1) [J3] Dinh Tan Pham, Thi Phuong Dang, Duc Quang Nguyen, Thi Lan Le, and Hai Vu (2021), Skeleton-based Action Recognition Using Feature Fusion for SpatialTemporal Graph Convolutional Networks, Journal of Science and Technique, Le Quy Don Technical University (LQDTU-JST), ISSN 1859-0209, pp.7-24, 2021 105 Bibliography [1] Hoang V.N., Le T.L., Tran T.H., Nguyen V.T., et al (2019) 3D skeleton-based action recognition with convolutional neural networks In 2019 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pp 1–6 IEEE [2] Johansson G (1973) Visual perception of biological motion and a model for its analysis Perception and psychophysics, 14(2):pp 201–211 [3] Google Mediapipe https://github.com/google/mediapipe [Online; accessed 01-October-2021] [4] Chen X et al (2014) Real-time Action Recognition for RGB-D and Motion Capture Data Ph.D thesis, Aalto University [5] Zhang J., Li W., Ogunbona P.O., Wang P., and Tang C (2016) RGB-D-based action recognition datasets: A survey Pattern Recognition, 60:pp 86–105 [6] Herath S., Harandi M., and Porikli F (2017) Going deeper into action recognition: A survey Image and vision computing, 60:pp 4–21 [7] Wang Q (2016) A survey of visual analysis of human motion and its applications arXiv preprint arXiv:1608.00700 , pp 1–6 [8] Presti L.L and La Cascia M (2016) 3D skeleton-based human action classification: A survey Pattern Recognition, 53:pp 130–147 [9] Kong Y and Fu Y (2022) Human action recognition and prediction: A survey International Journal of Computer Vision, 130(5):pp 1366–1401 [10] Biliński P.T (2014) Human action recognition in videos Ph.D thesis, Université Nice Sophia Antipolis [11] Bux A (2017) Vision-based human action recognition using machine learning techniques Ph.D thesis, Lancaster University (United Kingdom) [12] Beddiar D.R., Nini B., Sabokrou M., and Hadid A (2020) Vision-based human activity recognition: a survey Multimedia Tools and Applications, 79(41):pp 30509–30555 [13] Agarwal P and Alam M (2020) A lightweight deep learning model for human activity recognition on edge devices Procedia Computer Science, 167:pp 2364– 2373 106 [14] Das S (2020) Spatio-Temporal Attention Mechanism for Activity Recognition Ph.D thesis, Université Côte d’Azur [15] Koperski M (2017) Human action recognition in videos with local representation Ph.D thesis, COMUE Université Côte d’Azur (2015-2019) [16] Wang P (2017) Action recognition from RGB-D data Ph.D thesis, University of Wollongong [17] Li W., Zhang Z., and Liu Z (2010) Action recognition based on a bag of 3D points In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp 9–14 IEEE [18] Ren B., Liu M., Ding R., and Liu H (2020) A survey on 3D skeleton-based action recognition using learning method arXiv preprint arXiv:2002.05907 , pp 1–8 [19] Sun Z., Liu J., Ke Q., Rahmani H., Bennamoun M., and Wang G (2020) Human action recognition from various data modalities: A review arXiv preprint arXiv:2012.11866 , pp 1–20 [20] Tran T.H., Le T.L., Pham D.T., Hoang V.N., Khong V.M., Tran Q.T., Nguyen T.S., and Pham C (2018) A multi-modal multi-view dataset for human fall analysis and preliminary investigation on modality In 2018 24th International Conference on Pattern Recognition (ICPR), pp 1947–1952 IEEE [21] Adama D.O.A (2020) Fuzzy Transfer Learning in Human Activity Recognition Ph.D thesis, Nottingham Trent University (United Kingdom) [22] Shotton J., Fitzgibbon A., Cook M., Sharp T., Finocchio M., Moore R., Kipman A., and Blake A (2011) Real-time human pose recognition in parts from single depth images In CVPR 2011 , pp 1297–1304 IEEE [23] Al-Akam R (2021) Human Action Recognition in Video Data using Color and Distance Ph.D thesis, University of Koblenz and Landau [24] Angelini F (2020) Novel methods for posture-based human action recognition and activity anomaly detection Ph.D thesis, Newcastle University [25] Cao Z., Simon T., Wei S.E., and Sheikh Y (2017) Realtime multi-person 2D pose estimation using part affinity fields In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7291–7299 [26] Google Mediapipe pose https://google.github.io/mediapipe/solutions/ pose [Online; accessed 15-September-2021] 107 [27] Fang H.S., Xie S., Tai Y.W., and Lu C (2017) RMPE: Regional multi-person pose estimation In Proceedings of the IEEE international conference on computer vision, pp 2334–2343 [28] Vonstad E.K., Su X., Vereijken B., Bach K., and Nilsen J.H (2020) Comparison of a deep learning-based pose estimation system to marker-based and Kinect systems in exergaming for balance training Sensors, 20(23):pp 1–16 [29] Yokoyama N Pose estimation and person description using convolutional neural networks http://naoki.io/portfolio/person_descrip.html [Online; accessed 15-September-2021] [30] Kuehne H., Jhuang H., Garrote E., Poggio T., and Serre T (2011) HMDB: a large video database for human motion recognition In 2011 International conference on computer vision, pp 2556–2563 IEEE [31] Soomro K., Zamir A.R., and Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild arXiv preprint arXiv:1212.0402 , pp 1–6 [32] Xia L., Chen C.C., and Aggarwal J.K (2012) View invariant human action recognition using histograms of 3D joints In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp 20–27 IEEE [33] Ofli F., Chaudhry R., Kurillo G., Vidal R., and Bajcsy R (2013) Berkeley MHAD: A comprehensive multimodal human action database In 2013 IEEE Workshop on Applications of Computer Vision (WACV), pp 53–60 IEEE [34] Oreifej O and Liu Z (2013) HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 716–723 [35] Shahroudy A., Liu J., Ng T.T., and Wang G (2016) NTU RGB+D: A large scale dataset for 3D human activity analysis In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019 [36] Kay W., Carreira J., Simonyan K., Zhang B., Hillier C., Vijayanarasimhan S., Viola F., Green T., Back T., Natsev P., et al (2017) The Kinetics human action video dataset arXiv preprint arXiv:1705.06950 , pp 1–22 [37] Liu J., Shahroudy A., Perez M., Wang G., Duan L.Y., and Kot A.C (2019) NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding IEEE transactions on pattern analysis and machine intelligence, 42(10):pp 2684– 2701 108 [38] Li T., Liu J., Zhang W., Ni Y., Wang W., and Li Z (2021) UAV-Human: A large benchmark for human behavior understanding with unmanned aerial vehicles In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 16266–16275 [39] Bloom V (2015) Multiple Action Recognition for Video Games (MARViG) Ph.D thesis, Kingston University [40] Mazari A (2020) Deep Learning for Action Recognition in Videos Ph.D thesis, Sorbonne Université [41] Wang L (2021) Analysis and evaluation of Kinect-based action recognition algorithms arXiv preprint arXiv:2112.08626 , pp 1–22 [42] Xing Y and Zhu J (2021) Deep learning-based action recognition with 3D skeleton: A survey CAAI Transactions on Intelligence Technology, pp 80–92 [43] Wang J., Liu Z., Wu Y., and Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp 1290–1297 IEEE [44] Yang X and Tian Y.L (2012) Eigenjoints-based action recognition using NaiveBayes-Nearest-Neighbor In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp 14–19 IEEE [45] Hussein M.E., Torki M., Gowayyed M.A., and El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations In the Proceeding of Twenty-Third International Joint Conference on Artificial Intelligence, pp 2466–2472 [46] Gaglio S., Re G.L., and Morana M (2014) Human activity recognition process using 3-D posture data IEEE Transactions on Human-Machine Systems, 45(5):pp 586–597 [47] Müller M and Röder T (2006) Motion templates for automatic classification and retrieval of motion capture data In Proceedings of the 2006 ACM SIGGRAPH/Eurographics symposium on Computer animation, pp 137–146 [48] Ofli F., Chaudhry R., Kurillo G., Vidal R., and Bajcsy R (2014) Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition Journal of Visual Communication and Image Representation, 25(1):pp 24–38 [49] Wang J., Liu Z., Wu Y., and Yuan J (2013) Learning actionlet ensemble for 3D human action recognition IEEE transactions on pattern analysis and machine intelligence, 36(5):pp 914–927 109 [50] Zanfir M., Leordeanu M., and Sminchisescu C (2013) The moving pose: An efficient 3D kinematics descriptor for low-latency action recognition and detection In Proceedings of the IEEE international conference on computer vision, pp 2752–2759 [51] Ghorbel E., Boutteau R., Boonaert J., Savatier X., and Lecoeuche S (2015) 3D real-time human action recognition using a spline interpolation approach In 2015 International Conference on Image Processing Theory, Tools and Applications (IPTA), pp 61–66 IEEE [52] Wang C., Wang Y., and Yuille A.L (2013) An approach to pose-based action recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 915–922 [53] Wei P., Zheng N., Zhao Y., and Zhu S.C (2013) Concurrent action detection with structural prediction In Proceedings of the IEEE International Conference on Computer Vision, pp 3136–3143 [54] Zhou L., Li W., Zhang Y., Ogunbona P., Nguyen D.T., and Zhang H (2014) Discriminative key pose extraction using extended LC-KSVD for action recognition In 2014 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp 1–8 IEEE [55] Eweiwi A., Cheema M.S., Bauckhage C., and Gall J (2014) Efficient posebased action recognition In Asian conference on computer vision, pp 428–443 Springer [56] Cippitelli E., Gasparrini S., Gambi E., and Spinsante S (2016) A human activity recognition system using skeleton data from RGBD sensors Computational intelligence and neuroscience, 2016:pp 1–15 [57] El-Ghaish H.A., Shoukry A.A., and Hussein M.E (2018) Covp3dj: Skeletonparts-based-covariance descriptor for human action recognition In VISIGRAPP (5: VISAPP), pp 343–350 [58] Vemulapalli R., Arrate F., and Chellappa R (2014) Human action recognition by representing 3D skeletons as points in a Lie group In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595 [59] Cai X., Zhou W., Wu L., Luo J., and Li H (2015) Effective active skeleton representation for low latency human action recognition IEEE Transactions on Multimedia, 18(2):pp 141–154 [60] Boujebli M., Drira H., Mestiri M., and Farah I.R (2020) Rate-invariant modeling in Lie algebra for activity recognition Electronics, 9(11):pp 1–16 110 [61] Jalal A., Kim Y.H., Kim Y.J., Kamal S., and Kim D (2017) Robust human activity recognition from depth video using spatiotemporal multi-fused features Pattern recognition, 61:pp 295–308 [62] Yan S., Xiong Y., and Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition In Thirty-second AAAI conference on artificial intelligence, pp 1–20 [63] Ke Q., Bennamoun M., An S., Sohel F., and Boussaid F (2017) A new representation of skeleton sequences for 3D action recognition In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297 [64] Li C., Zhong Q., Xie D., and Pu S (2017) Skeleton-based action recognition with convolutional neural networks In 2017 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pp 597–600 IEEE [65] Li B., Dai Y., Cheng X., Chen H., Lin Y., and He M (2017) Skeleton based action recognition using translation-scale invariant image mapping and multiscale deep CNN In 2017 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pp 601–604 IEEE [66] Kim T.S and Reiter A (2017) Interpretable 3D human action analysis with temporal convolutional networks In 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 1623–1631 IEEE [67] Liu M., Liu H., and Chen C (2017) Enhanced skeleton visualization for view invariant human action recognition Pattern Recognition, 68:pp 346–362 [68] Li C., Zhong Q., Xie D., and Pu S (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation In Proceedings of the 27th International Joint Conference on Artificial Intelligence, p 786–792 [69] Caetano C., Sena J., Brémond F., Dos Santos J.A., and Schwartz W.R (2019) Skelemotion: A new representation of skeleton joint sequences based on motion information for 3D action recognition In 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp 1–8 IEEE [70] Du Y., Wang W., and Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118 [71] Liu J., Shahroudy A., Xu D., and Wang G (2016) Spatio-temporal LSTM with trust gates for 3D human action recognition In European Conference on Computer Vision, pp 816–833 Springer 111 [72] Song S., Lan C., Xing J., Zeng W., and Liu J (2017) An end-to-end spatiotemporal attention model for human action recognition from skeleton data In Proceedings of the AAAI conference on artificial intelligence, volume 31, pp 1–7 [73] Zhang P., Lan C., Xing J., Zeng W., Xue J., and Zheng N (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data In Proceedings of the IEEE International Conference on Computer Vision, pp 2117–2126 [74] Si C., Jing Y., Wang W., Wang L., and Tan T (2018) Skeleton-based action recognition with spatial reasoning and temporal stack learning In Proceedings of the European Conference on Computer Vision (ECCV), pp 103–118 [75] Yang Z., Li Y., Yang J., and Luo J (2018) Action recognition with spatio– temporal visual attention on skeleton image sequences IEEE Transactions on Circuits and Systems for Video Technology, 29(8):pp 2405–2415 [76] Zheng W., Li L., Zhang Z., Huang Y., and Wang L (2019) Relational network for skeleton-based action recognition In 2019 IEEE International Conference on Multimedia and Expo (ICME), pp 826–831 IEEE [77] Zhu W., Lan C., Xing J., Zeng W., Li Y., Shen L., and Xie X (2016) Cooccurrence feature learning for skeleton based action recognition using regularized deep LSTM networks In Proceedings of the AAAI conference on artificial intelligence, volume 30, pp 1–7 [78] Veeriah V., Zhuang N., and Qi G.J (2015) Differential recurrent neural networks for action recognition In Proceedings of the IEEE international conference on computer vision, pp 4041–4049 [79] Liu J., Wang G., Hu P., Duan L.Y., and Kot A.C (2017) Global context-aware attention LSTM networks for 3D action recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1647–1656 [80] Wang H and Wang L (2017) Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 499–508 [81] Li S., Li W., Cook C., Zhu C., and Gao Y (2018) Independently recurrent neural network (IndRNN): Building a longer and deeper RNN In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5457–5466 [82] Li S., Li W., Cook C., and Gao Y (2019) Deep independently recurrent neural network (IndRNN) arXiv preprint arXiv:1910.06251 , pp 1–18 112 [83] Ngoc N.T and Cuong P.V (2015) Nhan dang hoat dong cua nguoi bang dien thoai thong minh Chuyen san cac cong trinh nghien cuu, phat trien va ung dung cong nghe thong tin va Truyen thong, pp 33–33 [84] Diep N.N (2016) Nghien cuu phuong phap hoc may cho nhan dang hoat dong su dung cam bien mang tren nguoi Ph.D thesis, PTIT [85] Tran D.N., Nguyen V.A., Do V.M., Bui T.N., and Tran V.T (2020) Nhan dang hoat dong cho nguoi dua tren viec su dung cam bien gia toc ba chieu In Proceedings of the Selected Topics in Information Technology and Communications, pp 324–328 [86] Pham C.H., Le Q.K., and Le T.H (2014) Human action recognition using dynamic time warping and voting algorithm VNU Journal of Science: Computer Science and Communication Engineering, 30(3):pp 22–30 [87] Viet V.H., Ngoc L.Q., Son T.T., and Hoang P.M (2015) Multiple kernel learning and optical flow for action recognition in RGB-D video In 2015 Seventh International Conference on Knowledge and Systems Engineering (KSE), pp 222–227 IEEE [88] Nguyen T.N., Vo D.H., Huynh H.H., and Meunier J (2014) Geometry-based static hand gesture recognition using support vector machine In 2014 13th International Conference on Control Automation Robotics and Vision (ICARCV), pp 769–774 IEEE [89] Nguyen D.D and Le H.S (2015) Kinect gesture recognition: SVM vs RVM In 2015 Seventh International Conference on Knowledge and Systems Engineering (KSE), pp 395–400 IEEE [90] Pham C., Nguyen L., Nguyen A., Nguyen N., and Nguyen V.T (2021) Combining skeleton and accelerometer data for human fine-grained activity recognition and abnormal behaviour detection with deep temporal convolutional networks Multimedia Tools and Applications, pp 1–22 [91] Phan H.H., Vu N.S., Nguyen V.L., and Quoy M (2018) Action recognition based on motion of oriented magnitude patterns and feature selection IET Computer Vision, 12(5):pp 735–743 [92] Phan H.H and Vu N.S (2019) Information theory based pruning for CNN compression and its application to image classification and action recognition In 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp 1–8 IEEE 113 [93] Phan H.H., Ha C.T., and Nguyen T.T (2020) Improving the efficiency of human action recognition using deep compression In 2020 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pp 1–6 IEEE [94] Nguyen H.T and Nguyen T.O (2021) Attention-based network for effective action recognition from multi-view video Procedia Computer Science, 192:pp 971–980 [95] Wu D (2019) Video-based similar gesture action recognition using deep learning and GAN-based approaches Ph.D thesis, University of Technology Sydney [96] Bergelin V (2017) Human activity recognition and behavioral prediction using wearable sensors and deep learning Master’s thesis, Linköpings Universitet [97] Song Y.F., Zhang Z., and Wang L (2019) Richly activated graph convolutional network for action recognition with incomplete skeletons In 2019 IEEE International Conference on Image Processing (ICIP), pp 1–5 IEEE [98] Tang Y., Tian Y., Lu J., Li P., and Zhou J (2018) Deep progressive reinforcement learning for skeleton-based action recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5323–5332 [99] Li M., Chen S., Chen X., Zhang Y., Wang Y., and Tian Q (2019) Actionalstructural graph convolutional networks for skeleton-based action recognition In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3595–3603 [100] Li B., Li X., Zhang Z., and Wu F (2019) Spatio-temporal graph routing for skeleton-based action recognition In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp 8561–8568 [101] Shi L., Zhang Y., Cheng J., and Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 12026–12035 [102] Liu Z., Zhang H., Chen Z., Wang Z., and Ouyang W (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 143–152 [103] Song Y.F., Zhang Z., Shan C., and Wang L (2022) Constructing stronger and faster baselines for skeleton-based action recognition IEEE Transactions on Pattern Analysis and Machine Intelligence 114 [104] Shi L., Zhang Y., Cheng J., and Lu H (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks IEEE Transactions on Image Processing, pp 9532–9545 [105] Buffelli D (2019) A Deep Learning Model for Personalised Human Activity Recognition Master’s thesis, University of Padua, Italy [106] Woo S., Park J., Lee J.Y., and Kweon I.S (2018) Cbam: Convolutional block attention module In Proceedings of the European conference on computer vision (ECCV), pp 3–19 [107] HarisIqbal88 Plotneuralnet https://github.com/HarisIqbal88/ PlotNeuralNet [Online; accessed 06-August-2021] [108] Duan H., Zhao Y., Chen K., Shao D., Lin D., and Dai B (2021) Revisiting skeleton-based action recognition arXiv preprint arXiv:2104.13586 , pp 1–16 [109] Vu T.H Bai 7: Gradient descent (phan 1/2) https://machinelearningcoban com/2017/01/12/gradientdescent/ [Online; accessed 05-May-2022] [110] Ke Q., Liu J., Bennamoun M., Rahmani H., An S., Sohel F., and Boussaid F (2018) Global regularizer and temporal-aware cross-entropy for skeleton-based early action recognition In Asian Conference on Computer Vision, pp 729–745 Springer [111] Thi-Lan Le Cao-Cuong Than H.Q.N and Pham V.C (2020) Adaptive graph convolutional network with richly activated for skeleton-based human activity recognition In International Conference on Communications and Electronics (ICCE), pp 1–6 [112] Van der Maaten L and Hinton G (2008) Visualizing data using t-SNE Journal of machine learning research, 9(11):pp 2579–2605 [113] Song Y.F., Zhang Z., Shan C., and Wang L (2020) Richly activated graph convolutional network for robust skeleton-based action recognition IEEE Transactions on Circuits and Systems for Video Technology, 31(5):pp 1915–1925 [114] Matplotlib Choosing colormaps in matplotlib https://matplotlib.org/ stable/tutorials/colors/colormaps.html [Online; accessed 28-November2021] [115] Heidari N and Iosifidis A (2021) Progressive spatio-temporal graph convolutional network for skeleton-based human action recognition In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3220–3224 IEEE 115 [116] Song Y.F., Zhang Z., Shan C., and Wang L (2020) Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition In Proceedings of the 28th ACM International Conference on Multimedia (ACMMM), pp 1625–1633 ISBN 9781450379885 [117] Zhang H., Hou Y., Wang P., Guo Z., and Li W (2020) SAR-NAS: Skeleton-based action recognition via neural architecture searching Journal of Visual Communication and Image Representation, 73:pp 1–6 [118] Shi F., Lee C., Qiu L., Zhao Y., Shen T., Muralidhar S., Han T., Zhu S.C., and Narayanan V (2021) STAR: Sparse transformer-based action recognition arXiv preprint arXiv:2107.07089 , pp 1–11 [119] Zuo Q., Zou L., Fan C., Li D., Jiang H., and Liu Y (2020) Whole and part adaptive fusion graph convolutional networks for skeleton-based action recognition Sensors, 20(24):pp 1–20 [120] Peng W., Hong X., and Zhao G (2021) Tripool: Graph triplet pooling for 3D skeleton-based action recognition Pattern Recognition, 115:pp 1–12 [121] Liu Y., Zhang H., Xu D., and He K (2022) Graph transformer network with temporal kernel attention for skeleton-based action recognition Knowledge-Based Systems, pp 1–16 [122] Yep T torchinfo https://github.com/TylerYep/torchinfo [Online; accessed 20-April-2021] [123] Molchanov P., Tyree S., Karras T., Aila T., and Kautz J (2016) Pruning convolutional neural networks for resource efficient inference In 2017 International Conference on Learning Representations (ICLR 2017), pp 1–17 [124] Bulat A pthflops https://github.com/1adrianb/pytorch-estimate-flops [Online; accessed 22-July-2021] [125] Moliner O., Huang S., and Åström K (2022) Bootstrapped representation learning for skeleton-based action recognition arXiv preprint arXiv:2202.02232 , pp 1–11 [126] Nakkiran P Deep double descent https://openai.com/blog/ deep-double-descent/ [Online; accessed 06-December-2021] [127] Li C., Wang P., Wang S., Hou Y., and Li W (2017) Skeleton-based action recognition using LSTM and CNN In 2017 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pp 585–590 IEEE 116 [128] Xiao R., Hou Y., Guo Z., Li C., Wang P., and Li W (2019) Self-attention guided deep features for action recognition In 2019 IEEE International Conference on Multimedia and Expo (ICME), pp 1060–1065 IEEE [129] Google On-device, Real-time Body Pose Tracking with MediaPipe BlazePose https://ai.googleblog.com/2020/08/ on-device-real-time-body-pose-tracking.html [Online; accessed 01October-2021] 117 Table 1.5: List of actions in NTU RGB+D ID 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Action Name drink water eat meal brush teeth brush hair drop pick up throw sit down stand up clapping reading writing tear up paper put on jacket take off jacket put on a shoe take off a shoe put on glasses take off glasses put on a hat/cap take off a hat/cap cheer up hand waving kicking something reach into pocket hopping jump up phone call play with phone/tablet type on a keyboard ID 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Action Name point to something taking a selfie check time (watch) rub two hands nod head/bow shake head wipe face salute put palms together cross hands in front sneeze/cough staggering falling down headache chest pain back pain neck pain nausea/vomiting fan self punch/slap kicking pushing pat on back point finger hugging giving object touch pocket shaking hands walking towards walking apart as descriptors, then classifying them using machine learning techniques like SVMs [40] The important features from a sequence of skeleton frames are retrieved using feature descriptors Some approaches focused on handcrafted spatial and temporal characteristics extracted from the skeleton sequences The spatial information primarily pertains to the skeleton’s structure in a single frame, whereas the temporal information refers to the dependence information across frames According to the feature extraction, handcrafted skeleton-based action recognition may be categorized into joint-based and body part-based methods [41] Traditional skeleton-based methods require extracting motion patterns from a certain skeleton sequence, which has led to a lot of research on handmade aspects Handcrafted features are always dataset-dependent [42] 21 ... dissertation is on action recognition with skeleton data In skeleton- based action recognition, actions are represented as skeleton sequences Sample frames of the hammer action in MSR -Action3 D are... representation than the traditional color data Skeleton data have a lot of advantages, such as computation efficiency and robustness against variations in clothing texture and background Skeleton data are... OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY PHAM DINH TAN A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA Major: