A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA.A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA.A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA.A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA.A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA.A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA.A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA.
MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY PHAM DINH TAN A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA DOCTORAL DISSERTATION IN COMPUTER ENGINEERING Hanoi−2022 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY PHAM DINH TAN A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA Major: Computer Engineering Code: 9480106 DOCTORAL DISSERTATION IN COMPUTER ENGINEERING SUPERVISORS: Assoc Prof Vu Hai Assoc Prof Le Thi Lan Hanoi−2022 DECLARATION OF AUTHORSHIP I, Pham Dinh Tan, declare that the dissertation titled "A study on deep learning techniques for human action representation and recognition with skeleton data" has been entirely composed by myself I assure some points as follows: □ This work was done wholly or mainly while in candidature for a Ph.D research degree at Hanoi University of Science and Technology □ The work has not been submitted for any other degree or qualifications at Hanoi University of Science and Technology or any other institution □ Appropriate acknowledgment has been given within this dissertation, where ref- erence has been made to the published work of others □ The dissertation submitted is my own, except where work in the collaboration has been included The collaborative contributions have been indicated Hanoi, March 08, 2022 Ph.D Student SUPERVISORS Assoc Prof Vu Hai Assoc Prof Le Thi Lan i ACKNOWLEDGEMENT This dissertation is composed during my Ph.D at the Computer Vision Department, MICA Institute, Hanoi University of Science and Technology I am grateful to all people who contribute in different ways to my Ph.D journey First, I would like to express sincere thanks to my supervisors Assoc Prof Vu Hai and Assoc Prof Le Thi Lan for their guidance and support I would like to thank all MICA members for their help during my Ph.D study My sincere thank to Dr Nguyen Viet Son, Assoc Prof Dao Trung Kien, and Assoc Prof Tran Thi Thanh Hai for giving me a lot of support and valuable advice Many thanks to Dr Nguyen Thuy Binh, Nguyen Hong Quan, Hoang Van Nam, Nguyen Tien Nam, and Pham Quang Tien for their support I would like to thank colleagues at Hanoi University of Mining and Geology for all support during my Ph.D study Special thanks to my family for understanding my hours glued to the computer screen Hanoi, March 08, 2022 Ph.D Student ii ABSTRACT Human action recognition (HAR) from color and depth sensors (RGB-D), especially derived information such as skeleton data, is receiving the research community’s at- tention due to its wide range of applications HAR has many practical applications such as abnormal event detection from camera surveillance, gaming, human-machine interaction, elderly monitoring, and virtual/augmented reality In addition to the ad- vantages in fast computation, low storage, and immutability with human appearance, skeleton data have shortcomings The shortcomings include pose estimation errors, skeleton noise in complex actions, and incompleteness due to occlusion Moreover, action recognition remains challenging due to the diversity of human actions, intra- class variations, and inter-class similarities The dissertation focuses on methods to improve the performances of action recognition using the skeleton data The proposed methods are evaluated using public skeleton datasets collected by RGB-D sensors Es- pecially, they consist of MSR-Action3D/MICA-Action3D - datasets with high-quality skeleton data, CMDFALL - a challenging dataset with noise in skeleton data, and NTU RGB+D - a worldwide benchmark among the large-scale datasets Therefore, these datasets cover different dataset scales as well as the quality of skeleton data To overcome the limitations of the skeleton data, the dissertation presents techniques in different approaches First, as joints have different levels of engagement in each action, techniques for selecting joints that play an important role in human actions are proposed, including both Preset joint subset selection and automatic joint subset selection Two frameworks are evaluated to show the performance of using a subset of joints for action representation The first framework employs Dynamic Time Warping (DTW) and Fourier Temporal Pyramid (FTP), while the second one applies Covari- ance Descriptors extracted on both joint position and joint velocity Experimental results show that joint subsect selection helps improve action recognition performance on datasets with noise in skeleton data However, HAR based on hand-designed features could not exploit the inherent graph structure of the human skeleton Recent Graph Convolution Networks (GCNs) are studied to handle these issues Among GCN models, Attention-enhanced Adaptive Convolutional Network (AAGCN) is used as the baseline model AAGCN achieves state-of-the-art performance on large-scale datasets such as NTU-RGBD and Kinetics However, AAGCN employs only joint information Therefore, a Feature Fusion (FF) module is proposed in this dissertation The new model is named FF-AAGCN The performance of FF-AAGCN is evaluated on the large-scale dataset NTU-RGBD and CMDFALL The evaluation results show that the proposed method is robust to noise and invariant to the skeleton translation Particularly, FF-AAGCN achieves remark- able results on challenging datasets Finally, as the computing capacity of edge devices is limited, a lightweight deep learning model is expected for application deployment A lightweight GCN architecture is proposed to show that the complexity of GCN archi- tecture can still be reduced depending on the dataset’s characteristics The proposed lightweight model is suitable for application development on edge devices Hanoi, March 08, 2022 Ph.D Student iii CONTENTS DECLARATION OF AUTHORSHIP i ACKNOWLEDGEMENT ii ABSTRACT iii CONTENTS viii ABBREVIATIONS viii SYMBOLS x LIST OF TABLES .xiii LIST OF FIGURES .xvi INTRODUCTION CHAPTER LITERATURE REVIEW 1.1 .Introduction 1.2 .An overview on action recognition 1.3 .Data modalities for action recognition 1.3.1 Color data 10 1.3.2 Depth data 10 1.3.3 Skeleton data 11 1.3.4 Other modalities 11 1.3.5 Multi-modality 13 1.4 Skeleton data collection 13 1.4.1 Data collection from motion capture systems 13 1.4.2 Data collection from RGB+D sensors 14 1.4.3 Data collection from pose estimation 15 1.5 Benchmark datasets 17 1.5.1 MSR-Action3D 18 1.5.2 MICA-Action3D 19 1.5.3 CMDFALL 19 1.5.4 NTU RGB+D 19 1.6 Skeleton-based action recognition methods 21 1.6.1 Handcraft-based iv methods 21 1.6.1.1 Joint-based action recognition 22 1.6.1.2 Body part-based action recognition 24 1.6.2 Deep learningbased methods 27 1.6.2.1 Convolutional Neural Networks 28 1.6.2.2 Recurrent Neural Networks 29 1.7 Research on action recognition in Vietnam 32 1.8 Conclusion of the chapter 34 CHAPTER JOINT SUBSET SELECTION FOR SKELETON-BASED HUMAN ACTION RECOGNITION .35 2.1 Proposed methods 36 2.1.1 Preset Joint Subset Selection 36 2.1.1.1 Spatial-Temporal Representation 38 2.1.1.2 Dynamic Time Warping 38 2.1.1.3 Fourier Temporal Pyramid 39 2.1.2 Automatic Joint Subset Selection 39 2.1.2.1 Joint weight assignment 40 2.1.2.2 Most informative joint selection 41 2.1.2.3 Human action recognition based on MIJ joints 41 2.2 Experimental results 44 2.2.1 Evaluation metrics 44 2.2.2 Preset Joint Subset Selection 45 2.2.3 Automatic Joint Subset Selection 46 2.3 Conclusion of the chapter 55 CHAPTER FEATURE FUSION FOR THE GRAPH CONVOLUTIONAL NETWORK 56 v 3.1 Introduction 56 3.2 Related work on Graph Convolutional Networks 56 3.3 Proposed method 63 3.4 Experimental results 68 3.5 Discussion 77 3.6 Conclusion of the chapter 81 CHAPTER THE PROPOSED LIGHTWEIGHT GRAPH CONVOLU- TIONAL NETWORK ………………………………………………………82 4.1 Introduction 82 4.2 Related work on Lightweight Graph Convolutional Networks 82 4.3 Proposed method 84 4.4 Experimental results 86 4.5 Application demonstration 4.6 Conclusion of the chapter CONCLUSION AND FUTURE WORKS 99 PUBLICATIONS 101 BIBLIOGRAPHY 102 vi 95 97 ABBREVIATIONS No Abbreviation Meaning 2D Two-Dimensional 3D Three-Dimensional AAGCN AMIJ Attention-enhanced Adaptive Graph Convolutional Network Adaptive number of Most Informative Joints AGCN Adaptive Graph Convolutional Network AS Action Set AS-GCN Actional-Structural Graph Convolutional Network BN Batch Normalization BPL Body Part Location 10 CAM Channel Attention Module 11 CCTV Close-Circuit Television 12 CNN Convolutional Neural Network 13 CovMIJ Covariance Descriptor on Most Informative Joints 14 CPU Central Processing Unit 15 CS Cross-Subject 16 CV Cross-View 17 DFT Discrete Fourier Transform 18 DTW Dynamic Time Warping 19 FC Fully Connected 20 FF Feature Fusion 21 FLOP Floating Point OPeration 22 FMIJ Fixed number of Most Informative Joints 23 fps f rames per second 24 FTP Fourier Temporal Pyramid 25 GCN Graph Convolutional Network 26 GCNN Graph-based Convolutional Neural Network 27 GPU Graphical Processing Unit 28 GRU Gated Recurrent Unit 29 HAR Human Action Recognition 30 HCI Human-Computer Interaction 31 HMM Hidden Markov Model 32 MRF Markov Random Field 33 JA Joint Angle 34 JP Joint Position 35 JSS Joint Subset Selection 36 LARP Lie Algebra Relative Pair 37 LSTM Long-Short Term Memory 38 MIJ Most Informative Joint 39 Mocap Motion capture System 40 MRF Markov Random Field 41 MTLN Multi-Task Learning Network 42 OL OverLapping 43 RA-GCN Richly Activated Graph Convolutional Network 44 ReLU Rectified Linear Unit 45 ResNet Residual Neural Network 46 RJP Relative Joint Position 47 RNN Recurrent Neural Network 48 RVM Relevance Vector Machine 49 SAM Spatial Attention Module 50 SDK Software Development Kit 51 SE Special Euclidean group 52 SO Special Orthogonal group 53 ST-GCN Spatial-Temporal Graph Convolutional Network 54 STC Spatial-Temporal-Channel Attention Module 55 SVM Support Vector Machine 56 t-SNE t-Distributed Stochastic Neighbor Embedding 57 TAM Temporal Attention Module 58 TCD Temporal Covariance Descriptor 59 TCN Temporal Convolutional Network 60 UAV Unmanned Aerial Vehicle 61 VFDT Very Fast Decision Trees [39] Wang L (2021) Analysis and evaluation of kinect-based action recognition algo- rithms arXiv preprint arXiv:2112.08626 [40] Wang J., Liu Z., Wu Y., and Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp 1290–1297 IEEE [41] Yang X and Tian Y.L (2012) Eigenjoints-based action recognition using Naive- Bayes-Nearest-Neighbor In 2012 IEEE Computer Society Conference on Com- puter Vision and Pattern Recognition Workshops, pp 14–19 IEEE [42] Hussein M.E., Torki M., Gowayyed M.A., and El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations In the Proceeding of Twenty-Third International Joint Conference on Artificial Intelligence [43] Gaglio S., Re G.L., and Morana M (2014) Human activity recognition pro- cess using 3-D posture data IEEE Transactions on Human-Machine Systems, 45(5):pp 586–597 [44] Zanfir M., Leordeanu M., and Sminchisescu C (2013) The moving pose: An efficient 3D kinematics descriptor for lowlatency action recognition and detec- tion In Proceedings of the IEEE international conference on computer vision, pp 2752– 2759 [45] Ghorbel E., Boutteau R., Boonaert J., Savatier X., and Lecoeuche S (2015) 3D real-time human action recognition using a spline interpolation approach In 2015 International Conference on Image Processing Theory, Tools and Applications (IPTA), pp 61–66 IEEE [46] Wang C., Wang Y., and Yuille A.L (2013) An approach to posebased action recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 915–922 [47] Wei P., Zheng N., Zhao Y., and Zhu S.C (2013) Concurrent action detection with structural prediction In Proceedings of the IEEE International Conference on Computer Vision, pp 3136– 3143 [48] Zhou L., Li W., Zhang Y., Ogunbona P., Nguyen D.T., and Zhang H (2014) Discriminative key pose extraction using extended LCKSVD for action recogni- tion In 2014 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp 1–8 IEEE [49] Eweiwi A., Cheema M.S., Bauckhage C., and Gall J (2014) Efficient pose- based action recognition In Asian conference on computer vision, pp 428–443 Springer [50] Cippitelli E., Gasparrini S., Gambi E., and Spinsante S (2016) A human ac- tivity recognition system using skeleton data from rgbd sensors Computational intelligence and neuroscience, 2016 [51] El-Ghaish H.A., Shoukry A., and Hussein M.E (2018) CovP3DJ: Skeleton- parts-based-covariance descriptor for human action recognition In VISIGRAPP (5: VISAPP), pp 343–350 [52] Vemulapalli R., Arrate F., and Chellappa R (2014) Human action recognition by representing 3D skeletons as points in a lie group In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595 [53] Cai X., Zhou W., Wu L., Luo J., and Li H (2015) Effective active skeleton representation for low latency human action recognition IEEE Transactions on Multimedia, 18(2):pp 141–154 [54] Ofli F., Chaudhry R., Kurillo G., Vidal R., and Bajcsy R (2014) Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition Journal of Visual Communication and Image Representation, 25(1):pp 24– 38 [55] Boujebli M., Drira H., Mestiri M., and Farah I.R (2020) Rateinvariant modeling in lie algebra for activity recognition Electronics, 9(11):p 1888 [56] Jalal A., Kim Y.H., Kim Y.J., Kamal S., and Kim D (2017) Robust human activity recognition from depth video using spatiotemporal multi-fused features Pattern recognition, 61:pp 295–308 [57] Yan S., Xiong Y., and Lin D (2018) Spatial temporal graph convolutional net- works for skeleton-based action recognition In Thirty-second AAAI conference on artificial intelligence [58] Ke Q., Bennamoun M., An S., Sohel F., and Boussaid F (2017) A new repre- sentation of skeleton sequences for 3D action recognition In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297 [59] Li C., Zhong Q., Xie D., and Pu S (2017) Skeleton-based action recognition with convolutional neural networks In 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp 597–600 IEEE [60] Li B., Dai Y., Cheng X., Chen H., Lin Y., and He M (2017) Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN In 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp 601–604 IEEE [61] Kim T.S and Reiter A (2017) Interpretable 3D human action analysis with temporal convolutional networks In 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 1623–1631 IEEE [62] Liu M., Liu H., and Chen C (2017) Enhanced skeleton visualization for view invariant human action recognition Pattern Recognition, 68:pp 346–362 [63] Li C., Zhong Q., Xie D., and Pu S (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation arXiv preprint arXiv:1804.06055 [64] Caetano C., Sena J., Brémond F., Dos Santos J.A., and Schwartz W.R (2019) Skelemotion: A new representation of skeleton joint sequences based on motion information for 3D action recognition In 2019 16th IEEE International Confer- ence on Advanced Video and Signal Based Surveillance (AVSS), pp 1– IEEE [65] Du Y., Wang W., and Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118 [66] Liu J., Shahroudy A., Xu D., and Wang G (2016) Spatiotemporal LSTM with trust gates for 3D human action recognition In European Conference on Com- puter Vision, pp 816–833 Springer [67] Song S., Lan C., Xing J., Zeng W., and Liu J (2016) An end-toend spatio- temporal attention model for human action recognition from skeleton data arXiv preprint arXiv:1611.06067 [68] Zhang P., Lan C., Xing J., Zeng W., Xue J., and Zheng N (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data In Proceedings of the IEEE International Conference on Computer Vision, pp 2117–2126 [69] Si C., Jing Y., Wang W., Wang L., and Tan T (2018) Skeletonbased action recognition with spatial reasoning and temporal stack learning In Proceedings of the European Conference on Computer Vision (ECCV), pp 103–118 [70] Yang Z., Li Y., Yang J., and Luo J (2018) Action recognition with spatio– temporal visual attention on skeleton image sequences IEEE Transactions on Circuits and Systems for Video Technology , 29(8):pp 2405–2415 [71] Zheng W., Li L., Zhang Z., Huang Y., and Wang L (2019) Relational network for skeleton-based action recognition In 2019 IEEE International Conference on Multimedia and Expo (ICME), pp 826–831 IEEE [72] Zhu W., Lan C., Xing J., Zeng W., Li Y., Shen L., and Xie X (2016) Co- occurrence feature learning for skeleton based action recognition using regularized deep lstm networks In Proceedings of the AAAI conference on artificial intelli- gence, volume 30 [73] Veeriah V., Zhuang N., and Qi G.J (2015) Differential recurrent neural networks for action recognition In Proceedings of the IEEE international conference on computer vision, pp 4041– 4049 [74] Liu J., Wang G., Hu P., Duan L.Y., and Kot A.C (2017) Global context-aware attention lstm networks for 3D action recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1647–1656 [75] Wang H and Wang L (2017) Modeling temporal dynamics and spatial configu- rations of actions using two-stream recurrent neural networks In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 499–508 [76] Song S., Lan C., Xing J., Zeng W., and Liu J (2017) An endto-end spatio- temporal attention model for human action recognition from skeleton data In Proceedings of the AAAI conference on artificial intelligence, volume 31 [77] Li S., Li W., Cook C., Zhu C., and Gao Y (2018) Independently recurrent neural network (IndRNN): Building a longer and deeper RNN In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5457–5466 [78] Li S., Li W., Cook C., and Gao Y (2019) Deep independently recurrent neural network (IndRNN) arXiv preprint arXiv:1910.06251 [79] Ngoc N.T and Cuong P.V (2015) Nhan dang hoat dong cua nguoi bang dien thoai thong minh Chuyen san cac cong trinh nghien cuu, phat trien va ung dung cong nghe thong tin va Truyen thong, pp 33–33 [80] Diep N.N (2016) Nghien cuu phuong phap hoc may cho nhan dang hoat dong su dung cam bien mang tren nguoi [81] Tran D.N., Nguyen V.A., Do V.M., Bui T.N., and Tran V.T (2020) Nhan dang hoat dong cho nguoi dua tren viec su dung cam bien gia toc ba chieu In Proceed- ings of the Selected Topics in Information Technology and Communications [82] Pham C.H., Le Q.K., and Le T.H (2014) Human action recognition using dy- namic time warping and voting algorithm VNU Journal of Science: Computer Science and Communication Engineering , 30(3) [83] Viet V.H., Ngoc L.Q., Son T.T., and Hoang P.M (2015) Multiple kernel learning and optical flow for action recognition in rgb-d video In 2015 Seventh Interna- tional Conference on Knowledge and Systems Engineering (KSE), pp 222–227 IEEE [84] Nguyen T.N., Vo D.H., Huynh H.H., and Meunier J (2014) Geometry-based static hand gesture recognition using support vector machine In 2014 13th Inter- national Conference on Control Automation Robotics & Vision (ICARCV), pp 769–774 IEEE [85] Nguyen D.D and Le H.S (2015) Kinect gesture recognition: Svm vs rvm In 2015 Seventh International Conference on Knowledge and Systems Engineering (KSE), pp 395–400 IEEE [86] Pham C., Nguyen L., Nguyen A., Nguyen N., and Nguyen V.T (2021) Combining skeleton and accelerometer data for human finegrained activity recognition and abnormal behaviour detection with deep temporal convolutional networks Multi- media Tools and Applications, pp 1–22 [87] Phan H.H., Vu N.S., Nguyen V.L., and Quoy M (2018) Action recognition based on motion of oriented magnitude patterns and feature selection IET Computer Vision, 12(5):pp 735–743 [88] Phan H.H and Vu N.S (2019) Information theory based pruning for cnn com- pression and its application to image classification and action recognition In 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp 1–8 IEEE [89] Phan H.H., Ha C.T., and Nguyen T.T (2020) Improving the efficiency of human action recognition using deep compression In 2020 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pp 1–6 IEEE [90] Nguyen H.T and Nguyen T.O (2021) Attention-based network for effective action recognition from multi-view video Procedia Computer Science, 192:pp 971–980 [91] Wu D (2019) Video-based similar gesture action recognition using deep learning and GAN-based approaches Ph.D thesis, University of Technology Sydney [92] Song Y.F., Zhang Z., and Wang L (2019) Richly activated graph convolutional network for action recognition with incomplete skeletons In 2019 IEEE Interna- tional Conference on Image Processing (ICIP), pp 1–5 IEEE [93] El-Ghaish H., Shoukry A., and Hussein M (01 2018) CovP3DJ: Skeleton-parts- based-covariance descriptor for human action recognition VISAPP doi:10.5220/ 0006625703430350 [94] Tang Y., Tian Y., Lu J., Li P., and Zhou J (2018) Deep progressive reinforce- ment learning for skeleton-based action recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5323–5332 [95] Li M., Chen S., Chen X., Zhang Y., Wang Y., and Tian Q (2019) Actional- structural graph convolutional networks for skeleton-based action recognition In Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pp 3595–3603 [96] Li B., Li X., Zhang Z., and Wu F (2019) Spatio-temporal graph routing for skeleton-based action recognition In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp 8561– 8568 [97] Shi L., Zhang Y., Cheng J., and Lu H (2019) Two-stream adaptive graph convo- lutional networks for skeleton-based action recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 12026–12035 [98] Liu Z., Zhang H., Chen Z., Wang Z., and Ouyang W (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 143–152 [99] Song Y.F., Zhang Z., Shan C., and Wang L (2021) Constructing stronger and faster baselines for skeleton-based action recognition arXiv preprint arXiv:2106.15125 [100] Shi L., Zhang Y., Cheng J., and Lu H (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks IEEE Transactions on Image Processing , pp 9532–9545 [101] Woo S., Park J., Lee J.Y., and Kweon I.S (2018) Cbam: Convolutional block attention module In Proceedings of the European conference on computer vision (ECCV), pp 3–19 [102] HarisIqbal88 Plotneuralnet https://github.com/HarisIqbal88/ PlotNeuralNet [Online; accessed 06August-2021] [103] Ke Q., Liu J., Bennamoun M., Rahmani H., An S., Sohel F., and Boussaid F (2018) Global regularizer and temporal-aware crossentropy for skeleton-based early action recognition In Asian Conference on Computer Vision, pp 729–745 Springer [104] Thi-Lan Le Cao-Cuong Than H.Q.N and Pham V.C (2020) Adaptive graph con- volutional network with richly activated for skeleton-based human activity recog- nition In International Conference on Communications and Electronics (ICCE), pp 1–6 [105] Van der Maaten L and Hinton G (2008) Visualizing data using t-sne Journal of machine learning research, 9(11) [106] Song Y.F., Zhang Z., Shan C., and Wang L (2020) Richly activated graph convo- lutional network for robust skeletonbased action recognition IEEE Transactions on Circuits and Systems for Video Technology , 31(5):pp 1915–1925 [107] Matplotlib Choosing colormaps in matplotlib https://matplotlib.org/ stable/tutorials/colors/colormaps.html [Online; accessed 28-November- 2021] [108] Heidari N and Iosifidis A (2021) Progressive spatio-temporal graph convolutional network for skeleton-based human action recognition In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3220–3224 IEEE [109] Song Y.F., Zhang Z., Shan C., and Wang L (2020) Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recogni- tion In Proceedings of the 28th ACM International Conference on Multimedia (ACMMM), pp 1625–1633 ISBN 9781450379885 [110] Zhang H., Hou Y., Wang P., Guo Z., and Li W (2020) SARNAS: Skeleton-based action recognition via neural architecture searching Journal of Visual Commu- nication and Image Representation, 73:p 102942 [111] Shi F., Lee C., Qiu L., Zhao Y., Shen T., Muralidhar S., Han T., Zhu S.C., and Narayanan V (2021) STAR: Sparse transformer-based action recognition arXiv preprint arXiv:2107.07089 [112] Zuo Q., Zou L., Fan C., Li D., Jiang H., and Liu Y (2020) Whole and part adaptive fusion graph convolutional networks for skeletonbased action recogni- tion Sensors, 20(24):p 7149 [113] Yep T torchinfo https://github.com/TylerYep/torchinfo [Online; accessed 20-April-2021] [114] Molchanov P., Tyree S., Karras T., Aila T., and Kautz J (2016) Pruning convolutional neural networks for resource efficient inference arXiv preprint arXiv:1611.06440 [115] Bulat A pthflops https://github.com/1adrianb/pytorch-estimateflops [Online; accessed 22-July-2021] [116] Li C., Wang P., Wang S., Hou Y., and Li W (2017) Skeletonbased action recognition using LSTM and CNN In 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp 585–590 IEEE [117] Xiao R., Hou Y., Guo Z., Li C., Wang P., and Li W (2019) Self-attention guided deep features for action recognition In 2019 IEEE International Conference on Multimedia and Expo (ICME), pp 1060–1065 IEEE [118] Nakkiran P Deep double descent https://openai.com/blog/ deep-double-descent/ [Online; accessed 06-December-2021] [119] Google On-device, real-time body pose tracking with me- diapipe blazepose https://ai.googleblog.com/2020/08/ on-device-real-time-body-pose-tracking.html [Online; accessed 01- October-2021] ... recently Deep learning focuses on learning by layers of representation and abstraction Deep learning- based approaches can process data and automate the feature extraction, representation, and classification... used for action recognition Among data modalities, the skeleton data modality is compact and efficient for action representation so this work focuses on action recognition using skeleton data Section... dissertation is on action recognition with skeleton data In skeleton- based action recognition, actions are represented as skeleton sequences Sample frames of the hammer action in MSR -Action3 D are