Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 130 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
130
Dung lượng
16,94 MB
Nội dung
MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY PHAM DINH TAN A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA DOCTORAL DISSERTATION IN COMPUTER ENGINEERING Hanoi−2022 ho tro tai file : luanvanchat@gmail.com MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY PHAM DINH TAN A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA Major: Computer Engineering Code: 9480106 DOCTORAL DISSERTATION IN COMPUTER ENGINEERING SUPERVISORS: Assoc Prof Vu Hai Assoc Prof Le Thi Lan Hanoi−2022 ho tro tai file : luanvanchat@gmail.com DECLARATION OF AUTHORSHIP I, Pham Dinh Tan, declare that the dissertation titled "A study on deep learning techniques for human action representation and recognition with skeleton data" has been entirely composed by myself I assure some points as follows: This work was done wholly or mainly while in candidature for a Ph.D research degree at Hanoi University of Science and Technology The work has not been submitted for any other degree or qualifications at Hanoi University of Science and Technology or any other institution Appropriate acknowledgment has been given within this dissertation, where reference has been made to the published work of others The dissertation submitted is my own, except where work in the collaboration has been included The collaborative contributions have been indicated Hanoi, March 08, 2022 Ph.D Student SUPERVISORS Assoc Prof Vu Hai Assoc Prof Le Thi Lan i ho tro tai file : luanvanchat@gmail.com ACKNOWLEDGEMENT This dissertation is composed during my Ph.D at the Computer Vision Department, MICA Institute, Hanoi University of Science and Technology I am grateful to all people who contribute in different ways to my Ph.D journey First, I would like to express sincere thanks to my supervisors Assoc Prof Vu Hai and Assoc Prof Le Thi Lan for their guidance and support I would like to thank all MICA members for their help during my Ph.D study My sincere thank to Dr Nguyen Viet Son, Assoc Prof Dao Trung Kien, and Assoc Prof Tran Thi Thanh Hai for giving me a lot of support and valuable advice Many thanks to Dr Nguyen Thuy Binh, Nguyen Hong Quan, Hoang Van Nam, Nguyen Tien Nam, and Pham Quang Tien for their support I would like to thank colleagues at Hanoi University of Mining and Geology for all support during my Ph.D study Special thanks to my family for understanding my hours glued to the computer screen Hanoi, March 08, 2022 Ph.D Student ii ho tro tai file : luanvanchat@gmail.com ABSTRACT Human action recognition (HAR) from color and depth sensors (RGB-D), especially derived information such as skeleton data, is receiving the research community’s attention due to its wide range of applications HAR has many practical applications such as abnormal event detection from camera surveillance, gaming, human-machine interaction, elderly monitoring, and virtual/augmented reality In addition to the advantages in fast computation, low storage, and immutability with human appearance, skeleton data have shortcomings The shortcomings include pose estimation errors, skeleton noise in complex actions, and incompleteness due to occlusion Moreover, action recognition remains challenging due to the diversity of human actions, intraclass variations, and inter-class similarities The dissertation focuses on methods to improve the performances of action recognition using the skeleton data The proposed methods are evaluated using public skeleton datasets collected by RGB-D sensors Especially, they consist of MSR-Action3D/MICA-Action3D - datasets with high-quality skeleton data, CMDFALL - a challenging dataset with noise in skeleton data, and NTU RGB+D - a worldwide benchmark among the large-scale datasets Therefore, these datasets cover different dataset scales as well as the quality of skeleton data To overcome the limitations of the skeleton data, the dissertation presents techniques in different approaches First, as joints have different levels of engagement in each action, techniques for selecting joints that play an important role in human actions are proposed, including both Preset joint subset selection and automatic joint subset selection Two frameworks are evaluated to show the performance of using a subset of joints for action representation The first framework employs Dynamic Time Warping (DTW) and Fourier Temporal Pyramid (FTP), while the second one applies Covariance Descriptors extracted on both joint position and joint velocity Experimental results show that joint subsect selection helps improve action recognition performance on datasets with noise in skeleton data However, HAR based on hand-designed features could not exploit the inherent graph structure of the human skeleton Recent Graph Convolution Networks (GCNs) are studied to handle these issues Among GCN models, Attention-enhanced Adaptive Convolutional Network (AAGCN) is used as the baseline model AAGCN achieves state-of-the-art performance on large-scale datasets such as NTU-RGBD and Kinetics However, AAGCN employs only joint information Therefore, a Feature Fusion (FF) module is proposed in this dissertation The new model is named FF-AAGCN The performance of FF-AAGCN is evaluated on the large-scale dataset NTU-RGBD and CMDFALL The evaluation results show that the proposed method is robust to noise iii ho tro tai file : luanvanchat@gmail.com and invariant to the skeleton translation Particularly, FF-AAGCN achieves remarkable results on challenging datasets Finally, as the computing capacity of edge devices is limited, a lightweight deep learning model is expected for application deployment A lightweight GCN architecture is proposed to show that the complexity of GCN architecture can still be reduced depending on the dataset’s characteristics The proposed lightweight model is suitable for application development on edge devices Hanoi, March 08, 2022 Ph.D Student iv ho tro tai file : luanvanchat@gmail.com CONTENTS DECLARATION OF AUTHORSHIP i ACKNOWLEDGEMENT ii ABSTRACT iii CONTENTS viii ABBREVIATIONS viii SYMBOLS x LIST OF TABLES xiii LIST OF FIGURES xvi INTRODUCTION CHAPTER LITERATURE REVIEW 1.1 Introduction 1.2 An overview on action recognition 1.3 Data modalities for action recognition 1.3.1 Color data 10 1.3.2 Depth data 10 1.3.3 Skeleton data 1.3.4 Other modalities 11 11 1.3.5 Multi-modality 13 1.4 Skeleton data collection 1.4.1 Data collection from motion capture systems 13 13 1.4.2 Data collection from RGB+D sensors 14 1.4.3 Data collection from pose estimation 15 1.5 Benchmark datasets 17 1.5.1 MSR-Action3D 18 1.5.2 MICA-Action3D 19 1.5.3 CMDFALL 1.5.4 NTU RGB+D 19 19 1.6 Skeleton-based action recognition methods 21 1.6.1 Handcraft-based methods 21 1.6.1.1 Joint-based action recognition 1.6.1.2 Body part-based action recognition 22 24 v ho tro tai file : luanvanchat@gmail.com 1.6.2 Deep learning-based methods 27 1.6.2.1 Convolutional Neural Networks 28 1.6.2.2 Recurrent Neural Networks 29 1.7 Research on action recognition in Vietnam 32 1.8 Conclusion of the chapter 34 CHAPTER JOINT SUBSET SELECTION FOR SKELETON-BASED HUMAN ACTION RECOGNITION 35 2.1 Proposed methods 2.1.1 Preset Joint Subset Selection 36 36 2.1.1.1 Spatial-Temporal Representation 38 2.1.1.2 Dynamic Time Warping 38 2.1.1.3 Fourier Temporal Pyramid 2.1.2 Automatic Joint Subset Selection 39 39 2.1.2.1 Joint weight assignment 40 2.1.2.2 Most informative joint selection 41 2.1.2.3 Human action recognition based on MIJ joints 41 2.2 Experimental results 44 2.2.1 Evaluation metrics 44 2.2.2 Preset Joint Subset Selection 2.2.3 Automatic Joint Subset Selection 45 46 2.3 Conclusion of the chapter 55 CHAPTER FEATURE FUSION FOR THE GRAPH CONVOLUTIONAL NETWORK 56 3.1 Introduction 56 3.2 Related work on Graph Convolutional Networks 56 3.3 Proposed method 63 3.4 Experimental results 68 3.5 Discussion 77 3.6 Conclusion of the chapter 81 CHAPTER THE PROPOSED LIGHTWEIGHT GRAPH CONVOLUTIONAL NETWORK 82 4.1 Introduction 82 4.2 Related work on Lightweight Graph Convolutional Networks 82 4.3 Proposed method 84 4.4 Experimental results 86 vi ho tro tai file : luanvanchat@gmail.com 4.5 Application demonstration 95 4.6 Conclusion of the chapter 97 CONCLUSION AND FUTURE WORKS 99 PUBLICATIONS 101 BIBLIOGRAPHY 102 vii ho tro tai file : luanvanchat@gmail.com ABBREVIATIONS No Abbreviation Meaning 2D Two-Dimensional 3D Three-Dimensional AAGCN Attention-enhanced Adaptive Graph Convolutional Network AMIJ Adaptive number of Most Informative Joints AGCN Adaptive Graph Convolutional Network AS Action Set AS-GCN Actional-Structural Graph Convolutional Network BN Batch Normalization BPL Body Part Location 10 CAM Channel Attention Module 11 CCTV Close-Circuit Television 12 CNN Convolutional Neural Network 13 CovMIJ Covariance Descriptor on Most Informative Joints 14 CPU Central Processing Unit 15 CS Cross-Subject 16 CV Cross-View 17 DFT Discrete Fourier Transform 18 DTW Dynamic Time Warping 19 FC Fully Connected 20 FF Feature Fusion 21 FLOP Floating Point OPeration 22 FMIJ Fixed number of Most Informative Joints 23 fps f rames per second 24 FTP Fourier Temporal Pyramid 25 GCN Graph Convolutional Network 26 GCNN Graph-based Convolutional Neural Network 27 GPU Graphical Processing Unit 28 GRU Gated Recurrent Unit 29 HAR Human Action Recognition 30 HCI Human-Computer Interaction viii ho tro tai file : luanvanchat@gmail.com connecting symmetrical joints achieves excellent performance on CMDFALL with fewer model parameters and FLOPs The number of parameters is reduced using Preset JSS and layer pruning Experimental results show that the lightweight model with graph type B (JSS-B) outperforms the baseline AAGCN on challenging datasets with trainable parameters 5.6 times fewer than the baseline The computation complexity in FLOPs of the proposed model is 3.5 times lower than that of the baseline on CMDFALL A study is conducted to evaluate the performance of LW-FF-AAGCN with different dataset sizes LW-FF-AAGCN is an efficient deep learning model for HAR implementation in real-world applications A demo is presented using the proposed method for human action recognition Results in this chapter have been submitted to the Multimedia Tools and Applications (MTAP) journal in the paper "A Lightweight Graph Convolutional Network for Skeleton-based Action Recognition" 98 ho tro tai file : luanvanchat@gmail.com CONCLUSION AND FUTURE WORKS Conclusions Noise in the skeleton data can degrade the performance of action recognition Joint subset selection (JSS), feature combining, and the use of graph-based deep learning networks are proposed to improve representation efficiency and recognition performance It is found in the first contribution of the dissertation is that joint subset selection with both preset configuration and automatic schemes help improve the performance of action recognition In the second contribution, a Feature Fusion module is coupled with AAGCN to form FF-AAGCN The Feature Fusion is a simple and efficient data pre-processing module for graph-based deep learning, especially for noisy skeleton data The proposed method FF-AAGCN outperforms the baseline AAGCN on CMDFALL, a challenging dataset with noise in skeleton data On the large-scale dataset NTU RGB+D, FF-AAGCN obtains competitive results compared to AAGCN The third contribution is a lightweight model LW-FF-AAGCN The number of model parameters in LW-FF-AAGCN is 5.6 times less than the baseline The proposed lightweight model is suitable for application development for edge devices with limited computation capacity LW-FF-AAGCN outperforms both AAGCN and FF-AAGCN on CMDFALL A trade-off in the performance of the lightweight model is observed on the large-scale dataset NTU RGB+D These results suggest directions for future research in both short-term and long-term perspectives Future work Short-Term Perspectives • Study on noise in the skeleton data caused by pose estimation errors using RGB-D sensors Standard calibrated Mocap system is required for evaluation • Study different statistical metrics for Joint Subset Selection, such as the variance of joint angles Other JSS methods should be implemented on graph-based deep learning networks • Develop graph-based lightweight models for application development on edge devices As computation capacity is limited on edge devices, lightweight models are required for real-time applications Further research will aim at designing lightweight models from the garden of deep learning models in the literature • Study on the interpretability of action recognition using graph-based deep learning Deep learning approaches, which are prevalent in the present literature, have 99 ho tro tai file : luanvanchat@gmail.com excellent performance at the expense of the learning process’s understandability Handcrafted learning can be deemed less generalizable and more data-type specific in general They are, however, more intelligible from a human standpoint The optimum trade-off strategy is still an open question [24] • Improve the quality of pose estimation for high-performance action recognition Long-Term Perspectives • Extend the proposed methods to continuous skeleton-based human action recognition • Extend the study of Graph Convolutional Networks to Geometric Deep Learning There is a garden of deep learning models in the literature, including CNNs, RNNs, GCNs, and many others This leads to a requirement to construct a general mathematical framework for all these models Geometric Deep Learning is such an approach to unify these deep learning models by exploring the common mathematics in these models • Develop applications using the proposed models for human action recognition such as elderly remote monitoring in healthcare or camera surveillance for abnormal behavior detection 100 ho tro tai file : luanvanchat@gmail.com PUBLICATIONS Conferences [C1] Tien-Nam Nguyen, Dinh-Tan Pham, Thi-Lan Le, Hai Vu, and Thanh-Hai Tran (2018), Novel Skeleton-based Action Recognition Using Covariance Descriptors on Most Informative Joints, Proceedings of International Conference on Knowledge and Systems Engineering (KSE 2018), IEEE, Vietnam, ISBN: 978-1-5386-6113-0, pp.50-55, 2018 [C2] Dinh-Tan Pham, Tien-Nam Nguyen, Thi-Lan Le, and Hai Vu (2019), Analyzing Role of Joint Subset Selection in Human Action Recognition, Proceedings of NAFOSTED Conference on Information and Computer Science (NICS 2019), IEEE, Vietnam, ISBN: 978-1-7281-5163-2, pp.61-66, 2019 [C3] Dinh-Tan Pham, Tien-Nam Nguyen, Thi-Lan Le, and Hai Vu (2020), SpatioTemporal Representation for Skeleton-based Human Action Recognition, Proceedings of International Conference on Multimedia Analysis and Pattern Recognition (MAPR 2020), IEEE, Vietnam, ISBN: 978-1-7281-6555-4, pp.1-6, 2020 Journals [J1] Dinh-Tan Pham, Quang-Tien Pham, Thi-Lan Le, and Hai Vu (2021), An Efficient Feature Fusion of Graph Convolutional Networks and its application for RealTime Traffic Control Gestures Recognition, IEEE Access, ISSN: 2169-3536, pp.121930 - 121943, 2021 (ISI, Q1) [J2] Van-Toi Nguyen, Tien-Nam Nguyen, Thi-Lan Le, Dinh-Tan Pham, and Hai Vu (2020), Adaptive most joint selection and covariance descriptions for a robust skeleton-based human action recognition, Multimedia Tools and Applications (MTAP), Springer, DOI: 10.1007/s11042-021-10866-4, pp.1-27, 2021 (ISI, Q1) [J3] Dinh Tan Pham, Thi Phuong Dang, Duc Quang Nguyen, Thi Lan Le, and Hai Vu (2021), Skeleton-based Action Recognition Using Feature Fusion for SpatialTemporal Graph Convolutional Networks, Journal of Science and Technique, Le Quy Don Technical University (LQDTU-JST), ISSN 1859-0209, pp.7-24, 2021 101 ho tro tai file : luanvanchat@gmail.com Bibliography [1] Hoang V.N., Le T.L., Tran T.H., Nguyen V.T., et al (2019) 3D skeleton-based action recognition with convolutional neural networks In 2019 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pp 1–6 IEEE [2] Johansson G (1973) Visual perception of biological motion and a model for its analysis Perception & psychophysics, 14(2):pp 201–211 [3] Google Mediapipe https://github.com/google/mediapipe [Online; accessed 01-October-2021] [4] Chen X et al (2014) Real-time Action Recognition for RGB-D and Motion Capture Data Ph.D thesis, Aalto University [5] Zhang J., Li W., Ogunbona P.O., Wang P., and Tang C (2016) RGB-D-based action recognition datasets: A survey Pattern Recognition, 60:pp 86–105 [6] Herath S., Harandi M., and Porikli F (2017) Going deeper into action recognition: A survey Image and vision computing, 60:pp 4–21 [7] Wang Q (2016) A survey of visual analysis of human motion and its applications arXiv preprint arXiv:1608.00700 [8] Presti L.L and La Cascia M (2016) 3D skeleton-based human action classification: A survey Pattern Recognition, 53:pp 130–147 [9] Kong Y and Fu Y (2018) Human action recognition and prediction: A survey arXiv preprint arXiv:1806.11230 [10] Biliński P.T (2014) Human action recognition in videos Ph.D thesis, Université Nice Sophia Antipolis [11] Bux A (2017) Vision-based human action recognition using machine learning techniques Lancaster University (United Kingdom) [12] Beddiar D.R., Nini B., Sabokrou M., and Hadid A (2020) Vision-based human activity recognition: a survey Multimedia Tools and Applications, 79(41):pp 30509–30555 [13] Agarwal P and Alam M (2020) A lightweight deep learning model for human activity recognition on edge devices Procedia Computer Science, 167:pp 2364– 2373 102 ho tro tai file : luanvanchat@gmail.com [14] Das S (2020) Spatio-Temporal Attention Mechanism for Activity Recognition Ph.D thesis, Université Côte d’Azur [15] Koperski M (2017) Human action recognition in videos with local representation Ph.D thesis, COMUE Université Côte d’Azur (2015-2019) [16] Wang P (2017) Action recognition from RGB-D data Ph.D thesis, University of Wollongong [17] Li W., Zhang Z., and Liu Z (2010) Action recognition based on a bag of 3D points In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp 9–14 IEEE [18] Ren B., Liu M., Ding R., and Liu H (2020) A survey on 3D skeleton-based action recognition using learning method arXiv preprint arXiv:2002.05907 [19] Sun Z., Liu J., Ke Q., Rahmani H., Bennamoun M., and Wang G (2020) Human action recognition from various data modalities: A review arXiv preprint arXiv:2012.11866 [20] Tran T.H., Le T.L., Pham D.T., Hoang V.N., Khong V.M., Tran Q.T., Nguyen T.S., and Pham C (2018) A multi-modal multi-view dataset for human fall analysis and preliminary investigation on modality In 2018 24th International Conference on Pattern Recognition (ICPR), pp 1947–1952 IEEE [21] Adama D.O.A (2020) Fuzzy Transfer Learning in Human Activity Recognition Nottingham Trent University (United Kingdom) [22] Shotton J., Fitzgibbon A., Cook M., Sharp T., Finocchio M., Moore R., Kipman A., and Blake A (2011) Real-time human pose recognition in parts from single depth images In CVPR 2011 , pp 1297–1304 IEEE [23] Al-Akam R (2021) Human Action Recognition in Video Data using Color and Distance Ph.D thesis, University of Koblenz and Landau [24] Angelini F (2020) Novel methods for posture-based human action recognition and activity anomaly detection Ph.D thesis, Newcastle University [25] Cao Z., Hidalgo Martinez G., Simon T., Wei S., and Sheikh Y.A (2019) OpenPose: Realtime multi-person 2D pose estimation using part affinity fields IEEE Transactions on Pattern Analysis and Machine Intelligence, pp 1–1 [26] Google Mediapipe pose https://google.github.io/mediapipe/solutions/ pose [Online; accessed 15-September-2021] 103 ho tro tai file : luanvanchat@gmail.com [27] Yokoyama N Pose estimation and person description using convolutional neural networks http://naoki.io/portfolio/person_descrip.html [Online; accessed 15-September-2021] [28] Kuehne H., Jhuang H., Garrote E., Poggio T., and Serre T (2011) Hmdb: a large video database for human motion recognition In 2011 International conference on computer vision, pp 2556–2563 IEEE [29] Soomro K., Zamir A.R., and Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild arXiv preprint arXiv:1212.0402 [30] Xia L., Chen C.C., and Aggarwal J.K (2012) View invariant human action recognition using histograms of 3D joints In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp 20–27 IEEE [31] Ofli F., Chaudhry R., Kurillo G., Vidal R., and Bajcsy R (2013) Berkeley mhad: A comprehensive multimodal human action database In 2013 IEEE Workshop on Applications of Computer Vision (WACV), pp 53–60 IEEE [32] Oreifej O and Liu Z (2013) HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 716–723 [33] Shahroudy A., Liu J., Ng T.T., and Wang G (2016) NTU RGB+D: A large scale dataset for 3D human activity analysis In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019 [34] Kay W., Carreira J., Simonyan K., Zhang B., Hillier C., Vijayanarasimhan S., Viola F., Green T., Back T., Natsev P., et al (2017) The Kinetics human action video dataset arXiv preprint arXiv:1705.06950 [35] Liu J., Shahroudy A., Perez M., Wang G., Duan L.Y., and Kot A.C (2019) NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding IEEE transactions on pattern analysis and machine intelligence, 42(10):pp 2684– 2701 [36] Li T., Liu J., Zhang W., Ni Y., Wang W., and Li Z (2021) UAV-Human: A large benchmark for human behavior understanding with unmanned aerial vehicles In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition [37] Bloom V (2015) Multiple Action Recognition for Video Games (MARViG) Ph.D thesis, Kingston University [38] Mazari A (2020) Deep Learning for Action Recognition in Videos Ph.D thesis, Sorbonne Université 104 ho tro tai file : luanvanchat@gmail.com [39] Wang L (2021) Analysis and evaluation of kinect-based action recognition algorithms arXiv preprint arXiv:2112.08626 [40] Wang J., Liu Z., Wu Y., and Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp 1290–1297 IEEE [41] Yang X and Tian Y.L (2012) Eigenjoints-based action recognition using NaiveBayes-Nearest-Neighbor In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp 14–19 IEEE [42] Hussein M.E., Torki M., Gowayyed M.A., and El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations In the Proceeding of Twenty-Third International Joint Conference on Artificial Intelligence [43] Gaglio S., Re G.L., and Morana M (2014) Human activity recognition process using 3-D posture data IEEE Transactions on Human-Machine Systems, 45(5):pp 586–597 [44] Zanfir M., Leordeanu M., and Sminchisescu C (2013) The moving pose: An efficient 3D kinematics descriptor for low-latency action recognition and detection In Proceedings of the IEEE international conference on computer vision, pp 2752–2759 [45] Ghorbel E., Boutteau R., Boonaert J., Savatier X., and Lecoeuche S (2015) 3D real-time human action recognition using a spline interpolation approach In 2015 International Conference on Image Processing Theory, Tools and Applications (IPTA), pp 61–66 IEEE [46] Wang C., Wang Y., and Yuille A.L (2013) An approach to pose-based action recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 915–922 [47] Wei P., Zheng N., Zhao Y., and Zhu S.C (2013) Concurrent action detection with structural prediction In Proceedings of the IEEE International Conference on Computer Vision, pp 3136–3143 [48] Zhou L., Li W., Zhang Y., Ogunbona P., Nguyen D.T., and Zhang H (2014) Discriminative key pose extraction using extended LC-KSVD for action recognition In 2014 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp 1–8 IEEE [49] Eweiwi A., Cheema M.S., Bauckhage C., and Gall J (2014) Efficient posebased action recognition In Asian conference on computer vision, pp 428–443 Springer 105 ho tro tai file : luanvanchat@gmail.com [50] Cippitelli E., Gasparrini S., Gambi E., and Spinsante S (2016) A human activity recognition system using skeleton data from rgbd sensors Computational intelligence and neuroscience, 2016 [51] El-Ghaish H.A., Shoukry A., and Hussein M.E (2018) CovP3DJ: Skeletonparts-based-covariance descriptor for human action recognition In VISIGRAPP (5: VISAPP), pp 343–350 [52] Vemulapalli R., Arrate F., and Chellappa R (2014) Human action recognition by representing 3D skeletons as points in a lie group In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595 [53] Cai X., Zhou W., Wu L., Luo J., and Li H (2015) Effective active skeleton representation for low latency human action recognition IEEE Transactions on Multimedia, 18(2):pp 141–154 [54] Ofli F., Chaudhry R., Kurillo G., Vidal R., and Bajcsy R (2014) Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition Journal of Visual Communication and Image Representation, 25(1):pp 24–38 [55] Boujebli M., Drira H., Mestiri M., and Farah I.R (2020) Rate-invariant modeling in lie algebra for activity recognition Electronics, 9(11):p 1888 [56] Jalal A., Kim Y.H., Kim Y.J., Kamal S., and Kim D (2017) Robust human activity recognition from depth video using spatiotemporal multi-fused features Pattern recognition, 61:pp 295–308 [57] Yan S., Xiong Y., and Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition In Thirty-second AAAI conference on artificial intelligence [58] Ke Q., Bennamoun M., An S., Sohel F., and Boussaid F (2017) A new representation of skeleton sequences for 3D action recognition In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297 [59] Li C., Zhong Q., Xie D., and Pu S (2017) Skeleton-based action recognition with convolutional neural networks In 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp 597–600 IEEE [60] Li B., Dai Y., Cheng X., Chen H., Lin Y., and He M (2017) Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN In 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp 601–604 IEEE 106 ho tro tai file : luanvanchat@gmail.com [61] Kim T.S and Reiter A (2017) Interpretable 3D human action analysis with temporal convolutional networks In 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 1623–1631 IEEE [62] Liu M., Liu H., and Chen C (2017) Enhanced skeleton visualization for view invariant human action recognition Pattern Recognition, 68:pp 346–362 [63] Li C., Zhong Q., Xie D., and Pu S (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation arXiv preprint arXiv:1804.06055 [64] Caetano C., Sena J., Brémond F., Dos Santos J.A., and Schwartz W.R (2019) Skelemotion: A new representation of skeleton joint sequences based on motion information for 3D action recognition In 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp 1–8 IEEE [65] Du Y., Wang W., and Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118 [66] Liu J., Shahroudy A., Xu D., and Wang G (2016) Spatio-temporal LSTM with trust gates for 3D human action recognition In European Conference on Computer Vision, pp 816–833 Springer [67] Song S., Lan C., Xing J., Zeng W., and Liu J (2016) An end-to-end spatiotemporal attention model for human action recognition from skeleton data arXiv preprint arXiv:1611.06067 [68] Zhang P., Lan C., Xing J., Zeng W., Xue J., and Zheng N (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data In Proceedings of the IEEE International Conference on Computer Vision, pp 2117–2126 [69] Si C., Jing Y., Wang W., Wang L., and Tan T (2018) Skeleton-based action recognition with spatial reasoning and temporal stack learning In Proceedings of the European Conference on Computer Vision (ECCV), pp 103–118 [70] Yang Z., Li Y., Yang J., and Luo J (2018) Action recognition with spatio– temporal visual attention on skeleton image sequences IEEE Transactions on Circuits and Systems for Video Technology, 29(8):pp 2405–2415 [71] Zheng W., Li L., Zhang Z., Huang Y., and Wang L (2019) Relational network for skeleton-based action recognition In 2019 IEEE International Conference on Multimedia and Expo (ICME), pp 826–831 IEEE 107 ho tro tai file : luanvanchat@gmail.com [72] Zhu W., Lan C., Xing J., Zeng W., Li Y., Shen L., and Xie X (2016) Cooccurrence feature learning for skeleton based action recognition using regularized deep lstm networks In Proceedings of the AAAI conference on artificial intelligence, volume 30 [73] Veeriah V., Zhuang N., and Qi G.J (2015) Differential recurrent neural networks for action recognition In Proceedings of the IEEE international conference on computer vision, pp 4041–4049 [74] Liu J., Wang G., Hu P., Duan L.Y., and Kot A.C (2017) Global context-aware attention lstm networks for 3D action recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1647–1656 [75] Wang H and Wang L (2017) Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 499–508 [76] Song S., Lan C., Xing J., Zeng W., and Liu J (2017) An end-to-end spatiotemporal attention model for human action recognition from skeleton data In Proceedings of the AAAI conference on artificial intelligence, volume 31 [77] Li S., Li W., Cook C., Zhu C., and Gao Y (2018) Independently recurrent neural network (IndRNN): Building a longer and deeper RNN In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5457–5466 [78] Li S., Li W., Cook C., and Gao Y (2019) Deep independently recurrent neural network (IndRNN) arXiv preprint arXiv:1910.06251 [79] Ngoc N.T and Cuong P.V (2015) Nhan dang hoat dong cua nguoi bang dien thoai thong minh Chuyen san cac cong trinh nghien cuu, phat trien va ung dung cong nghe thong tin va Truyen thong, pp 33–33 [80] Diep N.N (2016) Nghien cuu phuong phap hoc may cho nhan dang hoat dong su dung cam bien mang tren nguoi [81] Tran D.N., Nguyen V.A., Do V.M., Bui T.N., and Tran V.T (2020) Nhan dang hoat dong cho nguoi dua tren viec su dung cam bien gia toc ba chieu In Proceedings of the Selected Topics in Information Technology and Communications [82] Pham C.H., Le Q.K., and Le T.H (2014) Human action recognition using dynamic time warping and voting algorithm VNU Journal of Science: Computer Science and Communication Engineering, 30(3) [83] Viet V.H., Ngoc L.Q., Son T.T., and Hoang P.M (2015) Multiple kernel learning and optical flow for action recognition in rgb-d video In 2015 Seventh Interna108 ho tro tai file : luanvanchat@gmail.com tional Conference on Knowledge and Systems Engineering (KSE), pp 222–227 IEEE [84] Nguyen T.N., Vo D.H., Huynh H.H., and Meunier J (2014) Geometry-based static hand gesture recognition using support vector machine In 2014 13th International Conference on Control Automation Robotics & Vision (ICARCV), pp 769–774 IEEE [85] Nguyen D.D and Le H.S (2015) Kinect gesture recognition: Svm vs rvm In 2015 Seventh International Conference on Knowledge and Systems Engineering (KSE), pp 395–400 IEEE [86] Pham C., Nguyen L., Nguyen A., Nguyen N., and Nguyen V.T (2021) Combining skeleton and accelerometer data for human fine-grained activity recognition and abnormal behaviour detection with deep temporal convolutional networks Multimedia Tools and Applications, pp 1–22 [87] Phan H.H., Vu N.S., Nguyen V.L., and Quoy M (2018) Action recognition based on motion of oriented magnitude patterns and feature selection IET Computer Vision, 12(5):pp 735–743 [88] Phan H.H and Vu N.S (2019) Information theory based pruning for cnn compression and its application to image classification and action recognition In 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp 1–8 IEEE [89] Phan H.H., Ha C.T., and Nguyen T.T (2020) Improving the efficiency of human action recognition using deep compression In 2020 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pp 1–6 IEEE [90] Nguyen H.T and Nguyen T.O (2021) Attention-based network for effective action recognition from multi-view video Procedia Computer Science, 192:pp 971–980 [91] Wu D (2019) Video-based similar gesture action recognition using deep learning and GAN-based approaches Ph.D thesis, University of Technology Sydney [92] Song Y.F., Zhang Z., and Wang L (2019) Richly activated graph convolutional network for action recognition with incomplete skeletons In 2019 IEEE International Conference on Image Processing (ICIP), pp 1–5 IEEE [93] El-Ghaish H., Shoukry A., and Hussein M (01 2018) CovP3DJ: Skeleton-partsbased-covariance descriptor for human action recognition VISAPP doi:10.5220/ 0006625703430350 109 ho tro tai file : luanvanchat@gmail.com [94] Tang Y., Tian Y., Lu J., Li P., and Zhou J (2018) Deep progressive reinforcement learning for skeleton-based action recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5323–5332 [95] Li M., Chen S., Chen X., Zhang Y., Wang Y., and Tian Q (2019) Actionalstructural graph convolutional networks for skeleton-based action recognition In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3595–3603 [96] Li B., Li X., Zhang Z., and Wu F (2019) Spatio-temporal graph routing for skeleton-based action recognition In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp 8561–8568 [97] Shi L., Zhang Y., Cheng J., and Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 12026–12035 [98] Liu Z., Zhang H., Chen Z., Wang Z., and Ouyang W (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 143–152 [99] Song Y.F., Zhang Z., Shan C., and Wang L (2021) Constructing stronger and faster baselines for skeleton-based action recognition arXiv preprint arXiv:2106.15125 [100] Shi L., Zhang Y., Cheng J., and Lu H (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks IEEE Transactions on Image Processing, pp 9532–9545 [101] Woo S., Park J., Lee J.Y., and Kweon I.S (2018) Cbam: Convolutional block attention module In Proceedings of the European conference on computer vision (ECCV), pp 3–19 [102] HarisIqbal88 Plotneuralnet https://github.com/HarisIqbal88/ PlotNeuralNet [Online; accessed 06-August-2021] [103] Ke Q., Liu J., Bennamoun M., Rahmani H., An S., Sohel F., and Boussaid F (2018) Global regularizer and temporal-aware cross-entropy for skeleton-based early action recognition In Asian Conference on Computer Vision, pp 729–745 Springer [104] Thi-Lan Le Cao-Cuong Than H.Q.N and Pham V.C (2020) Adaptive graph convolutional network with richly activated for skeleton-based human activity recognition In International Conference on Communications and Electronics (ICCE), pp 1–6 110 ho tro tai file : luanvanchat@gmail.com [105] Van der Maaten L and Hinton G (2008) Visualizing data using t-sne Journal of machine learning research, 9(11) [106] Song Y.F., Zhang Z., Shan C., and Wang L (2020) Richly activated graph convolutional network for robust skeleton-based action recognition IEEE Transactions on Circuits and Systems for Video Technology, 31(5):pp 1915–1925 [107] Matplotlib Choosing colormaps in matplotlib https://matplotlib.org/ stable/tutorials/colors/colormaps.html [Online; accessed 28-November2021] [108] Heidari N and Iosifidis A (2021) Progressive spatio-temporal graph convolutional network for skeleton-based human action recognition In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3220–3224 IEEE [109] Song Y.F., Zhang Z., Shan C., and Wang L (2020) Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition In Proceedings of the 28th ACM International Conference on Multimedia (ACMMM), pp 1625–1633 ISBN 9781450379885 [110] Zhang H., Hou Y., Wang P., Guo Z., and Li W (2020) SAR-NAS: Skeleton-based action recognition via neural architecture searching Journal of Visual Communication and Image Representation, 73:p 102942 [111] Shi F., Lee C., Qiu L., Zhao Y., Shen T., Muralidhar S., Han T., Zhu S.C., and Narayanan V (2021) STAR: Sparse transformer-based action recognition arXiv preprint arXiv:2107.07089 [112] Zuo Q., Zou L., Fan C., Li D., Jiang H., and Liu Y (2020) Whole and part adaptive fusion graph convolutional networks for skeleton-based action recognition Sensors, 20(24):p 7149 [113] Yep T torchinfo https://github.com/TylerYep/torchinfo [Online; accessed 20-April-2021] [114] Molchanov P., Tyree S., Karras T., Aila T., and Kautz J (2016) Pruning convolutional neural networks for resource efficient inference arXiv preprint arXiv:1611.06440 [115] Bulat A pthflops https://github.com/1adrianb/pytorch-estimate-flops [Online; accessed 22-July-2021] [116] Li C., Wang P., Wang S., Hou Y., and Li W (2017) Skeleton-based action recognition using LSTM and CNN In 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp 585–590 IEEE 111 ho tro tai file : luanvanchat@gmail.com [117] Xiao R., Hou Y., Guo Z., Li C., Wang P., and Li W (2019) Self-attention guided deep features for action recognition In 2019 IEEE International Conference on Multimedia and Expo (ICME), pp 1060–1065 IEEE [118] Nakkiran P Deep double descent https://openai.com/blog/ deep-double-descent/ [Online; accessed 06-December-2021] [119] Google diapipe On-device, real-time blazepose body pose tracking with me- https://ai.googleblog.com/2020/08/ on-device-real-time-body-pose-tracking.html October-2021] [Online; accessed 01- 112 ho tro tai file : luanvanchat@gmail.com