Nghiên cứu các kỹ thuật học sâu trong biểu diễn và nhận dạng hoạt động của người từ dữ liệu khung xương.

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề	A Study On Deep Learning Techniques For Human Action Representation And Recognition With Skeleton Data
Tác giả	Pham Dinh Tan
Người hướng dẫn	Assoc. Prof. Dr. Vu Hai, Assoc. Prof. Dr. Le Thi Lan
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Computer Engineering
Thể loại	doctoral dissertation
Năm xuất bản	2022
Thành phố	Hanoi

Định dạng
Số trang	42
Dung lượng	1,43 MB

Nội dung

Nghiên cứu các kỹ thuật học sâu trong biểu diễn và nhận dạng hoạt động của người từ dữ liệu khung xương.Nghiên cứu các kỹ thuật học sâu trong biểu diễn và nhận dạng hoạt động của người từ dữ liệu khung xương.Nghiên cứu các kỹ thuật học sâu trong biểu diễn và nhận dạng hoạt động của người từ dữ liệu khung xương.Nghiên cứu các kỹ thuật học sâu trong biểu diễn và nhận dạng hoạt động của người từ dữ liệu khung xương.Nghiên cứu các kỹ thuật học sâu trong biểu diễn và nhận dạng hoạt động của người từ dữ liệu khung xương.

MINISTRY OF EDUCATION AND TRAINING UNIVERSITY OF SCIENCE AND TECHNOLOGY PHAM DINH TAN A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA Major: Computer Engineering Code: 9480106 ABSTRACT OF DOCTORAL DISSERTATION COMPUTER ENGINEERING Hanoi −2022 This study is completed at: Hanoi University of Science and Technology Supervisors: Assoc Prof Dr Vu Hai Assoc Prof Dr Le Thi Lan Reviewer 1: Reviewer 2: Reviewer 3: This dissertation will be defended before approval committee at Hanoi University of Science and Technology: Time , date month year 2022 This dissertation can be found at: Ta Quang Buu Library - Hanoi University of Science and Technology Vietnam National Library INTRODUCTION Motivation Human action recognition (HAR) can be defined as the art of identifying and naming actions using machine learning techniques from the data collected by various devices Ex- amples of devices include wearable sensors, electronic device sensors like smartphone inertial sensors, camera devices like Microsoft Kinect, and closed-circuit television (CCTV) cameras Action recognition is vital for many applications such as human-computer interaction, camera surveillance, gaming, remote care to the elderly, smart home/office/city, and various monitoring applications Problem formulation This work focuses on skeleton-based action recognition Assume that segmented skeleton sequences and corresponding action labels are provided Suppose that each skeleton sequence only contains one action Action recognition aims to predict the action label from the skeleton data Challenges There are numerous difficulties to action recognition, which may arise due to diversity, intra-class variations, and inter-class similarity of human actions Four major challenges are discussed, including (1) intra-class variations and inter-class similarity, (2) noise in skeleton data, (3) occlusion caused by other body parts or other objects/persons, and (4) insufficient labeled data Objectives The objectives are as follows: • A compact presentation of human actions: Joints in the skeleton model play different roles in each action The first objective is to find action representations that are efficient for action recognition • Improving action recognition performance on noisy skeleton data: The second objective is to design a deep learning architecture that achieves highperformance action recognition on noisy skeleton data • Proposing a lightweight model for action recognition: Computation capacity is limited on edge devices A lightweight deep learning model for action recognition is required for application development Constructing an efficient, lightweight model for action recognition is the third objective of this dissertation Context and constraints In this dissertation, some context and constraints are listed as follows • Three public datasets and one self-collected dataset are used for evaluation The datasets contain segmented skeleton sequences collected by Microsoft Kinect sensors The lists of human actions are predefined from these datasets The datasets contain actions performed by a single person or interactions between two persons Other datasets are not considered and/or evaluated in this work • Only daily-life actions are considered in the dissertation Action classes in art performance or any other specific domains are not in the scope of this work • For all four datasets, training/testing split and evaluation protocols are kept the same as in the relevant works where the datasets are introduced • Cross-subject benchmark is performed on all datasets, with half of the subjects for training and the other half for testing • Cross-view benchmark is performed on the NTU RGB+D dataset Sequences captured by camera numbers and are used for training Sequences from camera number are used for testing Only single-view data are used Multi-view data processing is not considered in this work • The study aims at deploying an application using the proposed methods This application is developed to evaluate the performance of a person who does yoga exercises Pose estimation for a single person is implemented using the public tool Google Medi- aPipe Due to the time limitation, only the result of the action recognition module is introduced The related modules such as action spotting, human pose evaluation, and exercise scoring/assessment are out of the scope of this study Contributions The three main contributions are introduced in this dissertation • Contribution 1: Propose two JSS methods for human action recognition: the Preset JSS method and the automatic MIJ selection method • Contribution 2: Propose a Feature Fusion (FF) module to combine spatial and temporal features for the Attention-enhanced Adaptive Graph Convolutional Network (AAGCN) using the Relative Joint Position and the joint velocity The proposed method is named FF-AAGCN The proposed method outperforms the baseline method on the challenging datasets with noise in the skeleton data • Contribution 3: Propose the lightweight model LW-FF-AAGCN with the number of model parameters much fewer than that of the baseline method with competing performance in action recognition The proposed method enables the deployment of applications using human action recognition on devices with limited computation capacity Dissertation outline Excluding the introduction and conclusion, the dissertation consists of four chapters and is structured as follows: • Introduction: This section provides the motivation, objectives of the dissertation, challenges, constraints, and contributions to the research • Chapter entitled ”Literature Review”: This chapter is a brief on the existing literature to obtain a comprehensive understanding of human action recognition • Chapter entitled ”Joint Subset Selection for Skeleton-based Human Action Recogni- tion”: This chapter presents Preset JSS and automatic MIJ selection • Chapter entitled ”Feature Fusion for the Graph Convolutional Network”: A Feature Fusion (FF) module is proposed for data pre-processing The graph-based deep learning model FF-AAGCN outperforms the baseline model on CMDFALL, a challenging dataset with noisy skeleton data • Chapter entitled ”The Proposed Lightweight Graph Convolutional Network”: The lightweight model LW-FF-AAGCN is proposed with fewer parameters than the baseline AAGCN LW-FF-AAGCN is suitable for application development on edge devices with limited computation capacity • Conclusion and future works: This section summarizes the dissertation’s contributions and introduces directions for future work on human action recognition CHAPTER LITERATURE REVIEW 1.1 An overview on action recognition Due to the wide range of application, human action recognition (HAR) has been studied for decades The mechanism of the human vision is the key direction that researchers have been following for action recognition The human vision system can observe the motion and shape of the human body in a short time The observations are then transferred to the intermediate human perception system, classifying them as walking, jogging, or running The human visual perception system is highly reliable and precise in action recognition Over the last few decades, researchers have aimed at a similar level of performance with a computer-based recognition system Unfortunately, we are still far from the level of the human visual system due to several issues associated with action recognition, such as environmental complexities, intra-class variances, viewpoint variations, occlusions 1.2 Data modalities for action recognition Different data modalities, such as color, depth, skeleton data, and acceleration data etc can be used for action recognition Data modalities can be loosely separated into two groups: visual modalities and non-visual modalities Visual modalities such as color, depth, and skeleton are visually logical for describing actions Visual modalities are popular for action recognition The trajectories of joints are encoded in skeleton data When the action does not include objects or scene context, the skeleton efficiently represents actions Visual modalities have been widely utilized in surveillance systems In robotics and autonomous driving, depth data with distance information is commonly employed for action recognition Meanwhile, nonvisual modalities like acceleration are not visually intuitive for describing human actions However, these modalities can also be used for action recognition in specific cases requiring privacy protection Each data modality has its own advantage, depending on the usage Color data modality is the popular one in early researches in action recognition Recently, with the popularity of low-cost depth sensors and advances in pose estimation, skeleton data are easy to acquire with higher quality 1.3 Skeleton data collection Skeleton data are temporal sequences of joint positions Joints are connected in the kinetic model by the natural structure of the human body So it is convenient to model actions using the kinetic model Skeleton data can be collected using motion capture (Mocap) systems, depth sensors, or color/depth-based pose estimation In motion capture systems, markers are placed on joint positions Skeleton data collected by motion capture systems are of high accuracy However, inconvenient for many practical MoCap systems are expensive and applications So this work focus on the skeleton data collected by the low-cost depth sensors 1.4 Benchmark datasets Many datasets are built to develop and evaluate action recognition methods Four public datasets are used in the dissertation: 1.4.1 MSR-Action3D There are 20 actions performed by ten subjects Each subject acts two or three times There are 20 joints in the skeleton model There are 557 action sequences in total Actions in MSR-Action3D are grouped into three subsets: Action Set (AS1), Action Set (AS2), and Action Set (AS3) Each subset consists of eight actions, so some actions appear in more than one subset 1.4.2 MICA-Action3D MICA-Action3D is the dataset collected at the MICA International Research Institute, Hanoi University of Science and Technology The dataset is built for cross-dataset evaluation with the same list of 20 actions as in MSR-Action3D Action sequences are collected by a Kinect v1 sensor Each action is repeated two or three times by each subject Twenty subjects participate in data collection with a total of 1,196 action samples 1.4.3 CMDFALL The dataset is introduced to evaluate methods for human falling event detection For data collection, seven Kinect v1 sensors are installed across the surroundings Actions are divided into 20 categories in this dataset These actions are performed by 50 people ranging in age from 21 to 40 (with 20 females and 30 males) 1.4.4 NTU RGB+D The NTU RGB+D dataset is introduced with different data modalities collected by Kinect v2 sensors In this dataset, the skeleton model has 25 joints, with one or two people in each scene The dataset is currently the most common large-scale dataset for benchmarking skeleton-based action recognition methods There are 56,880 sequences in the dataset, cate- gorized into 60 action classes The actions are performed by 40 people Three Kinect sensors are mounted at the same height but at various angles The dataset is collected using 17 camera setups with different heights and distances The authors of the dataset recommend two benchmarks: (1) Cross-subject (CS): half of the subjects are used for training, and the other half is used for testing There are 40,320 sequences in the training set and 16,560 sequences in the testing set (2) Cross-view (CV): the training set includes 37,920 sequences (from cameras and 3), and the testing set consists of 18,960 sequences (from camera 1) Table 3.3 Performance evaluation on CMDFALL with Precision, Recall, and F1 scores [%] No Method Cov3DJ Joint Position (JP) Res-TCN CovMIJ CNN CNN-LSTM CNN-Velocity CNN-LSTM-Velocity RA-GCN 10 11 12 AAGCN AS-RAGCN Preset JSS 13 14 Preset JSS using Covariance FMIJ (Chapter 2) 15 AMIJ (Chapter 2) 16 Proposed (FF-AAGCN) Year 201 201 201 201 201 201 201 201 201 2020 2020 201 201 202 202 - Prec (%) - Recall (%) - F1 (%) 61 - - 49.18 - - 39.38 - - 62.5 48.68 41.78 40.34 45.24 40.58 39.24 49.97 47.89 46.13 47.64 46.51 45.23 61.18 59.28 58.63 65.7 75.82 - 65.57 74.81 - 65.11 74.9 52.86 - - 60.2 - - 64 - - 64 77.87 78.52 77.59 Table 3.4 Performance evaluation by accuracy (%) on NTU RGB+D No Method Bi-directional RNN Part-based LSTM ST-LSTM STA-LSTM VA-LSTM ARRN-LSTM IndRNN SRN+TSL Res-TCN Year 201 201 201 201 201 201 201 201 201 CS CV 59.1 64.0 60.7 67.3 69.2 77.7 73.4 81.2 79.2 87.7 80.7 88.8 81.8 88.0 84.8 92.4 74.3 83.1 10 Clip CNN 11 Synthesized CNN 12 Motion CNN 13 Multi-scale CNN 14 ST-GCN 15 GCNN 16 Dense IndRNN 17 AS-GCN 18 AGCN 19 20 21 22 3s RA-GCN AS-RAGCN AAGCN Proposed (FF-AAGCN) 201 79.6 201 80.0 201 83.2 201 85.0 201 81.5 201 83.5 201 86.7 201 86.8 201 87.3 2020 87.3 2020 87.7 2020 88.0 88.2 84.8 87.2 89.3 92.3 88.3 89.8 94.0 94.2 93.7 93.6 92.9 95.1 94.8 3.5 Conclusion of the chapter In this chapter, an action recognition method is proposed based on integration of the Feature Fusion module to the AAGCN The proposed system is named FF-AAGCN RJP and joint velocity are combined in the Feature Fusion module FF-AAGCN outperforms the baseline method AAGCN on the challenging dataset CMDFALL On NTU RGB+D, the proposed method achieves a cross-subject accuracy of 88.2% and a cross-view accuracy of 94.8% This result of the proposed method is competitive to that of AAGCN on NTU RGB+D Improve- ment is observed on Feature Fusion using velocities with different frame offsets Results in this publications [C3], [J1], and [J3] chapter are published in the CHAPTER THE PROPOSED LIGHTWEIGHT GRAPH CONVOLUTIONAL NETWORK 4.1 Introduction As shown in Chapter 2, Joint Subset Selection (JSS) is efficient for action representation The Feature Fusion module is proposed in Chapter to improve the performance of the graph-based network on challenging datasets All those methods mainly focus on improving recognition accuracy However, low-complexity deep learning models are required for application development on devices There are different approaches for reducing the number of parameters of deep learning models In this chapter, a lightweight model is proposed by pruning layers of the deep learning network, combining Preset JSS with the Feature Fusion module as studied in Chapter and Chapter Two graphs are defined using the joints selected by the Preset JSS A demo using the proposed lightweight model for action recognition is presented in this chapter 4.2 Related work on Lightweight Graph Convolutional Networks On large-scale datasets, GCN-based models perform exceptionally well However, these GCN-based models necessitate a lot of computing power Efforts have been made to develop lightweight models 4.3 Proposed method A lightweight model is proposed based on the FF-AAGCN in Chapter The purpose is to design a deep learning model with fewer model parameters The proposed lightweight model is named LW-FF-AAGCN In FF-AAGCN, there are ten basic blocks with different numbers of output channels: four blocks with 64 output channels, three blocks with 128 output channels, and three blocks with 256 channels Only three basic blocks with 128 output channels are used in LW-FF-AAGCN Two graphs based on the Preset JSS are proposed The diagram for the proposed method is shown in Figure 4.1 The Feature Fusion module pre-processes the skeleton data Preset JSS (optional) is applied to select 13 joints from the skeleton model, same as Chapter Two graphs are defined based on the joints selected by the Preset JSS The output of the JSS is fed into the Batch Normalization (BN) layer for data normalization Output data from the BN layer are transferred to three basic blocks, B1, B2, and B3, with 128 output channels The output of B3 is fed into a global average pooling (GAP) layer, which creates a 128-dimensional feature vector The feature vector is passed to a softmax layer for classification The Preset JSS scheme in Chapter is used to select 13 joints (marked in blue) from the Figure 4.1 Diagram of the proposed LW-FF-AAGCN skeleton model of 20 joints as shown in Figure 4.2.a The joints marked in red are not used for action recognition Two graphs are proposed using the subset of 13 blue joints These two graphs are JSS graph type A (JSS-A) and JSS graph type B (JSS-B) In JSS-A, elbow and knee joints are connected to the head joint to form a connected graph from selected joints as shown in Figure 4.2.b For JSS-B, symmetrical connections are added for pairs of elbows, wrists, knees, and ankles, as shown in Figure 4.2.c The connections between these symmetrical joints are important since, in each pair, the joints move in different directions in many actions such as running, walking Figure 4.2 (a) Preset selection of 13 joints (blue) from the skeleton model of 20 joints (b) Graph type A (JSS-A) is defined by the solid edges connecting 13 blue joints (c) Graph type B (JSS-B) is defined by the solid green edges connecting 13 blue joints with edges between symmetrical joints Only the spatial dimension of the graph is shown for simplicity 4.4 Experimental results The lightweight model consists of a Feature Fusion module (FF), lightweight (LW) using layer pruning, and the Preset JSS An ablation study is conducted to see how each component Table 4.1 Ablation study on CMDFALL Performance scores are in percentage Abbreviations in use include Feature Fusion (FF), Lightweight (LW), and Joint Subset Selection (JSS) No Method AAGCN FF ✗ LW ✗ JSS ✗ Precision Recall 65.70 65.57 LW-AAGCN ✗ ✓ ✗ 67.03 66.44 FF-AAGCN (Chapter 3) ✓ ✗ ✗ 77.87 78.52 LW-FF-AAGCN LW-FF-AAGCN with JSS-A LW-FF-AAGCN with JSS-B ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✓ ✓ 80.64 79.73 81.00 81.48 80.20 80.97 F1 65.1 66.3 77.5 80.59 79.56 80.63 of the proposed method contributes to the overall results Comparison results are shown in Table 4.1 Experiments are conducted on a server with an Intel i78700 CPU, 32 GB memory, and a GeForce GTX 1080Ti GPU A minor improvement is observed when only lightweight is applied to the baseline AAGCN with an F1-score of 66.39% Using the Feature Fusion module only, the F1-score is 77.59%, as reported in Chapter When combining both LW and FF, the F1-score of the LW-FF-AAGCN is 80.59% Then, JSS graph types A and B are evaluated The F1-scores are 79.56% and 80.63% for JSS-A and JSS-B, respectively It means that for CMDFALL, adding symmetrical connections helps improve recognition performance Table 4.2 Performance comparison of different methods on CMDFALL Performance scores are in percentage No Method Cov3DJ Joint Position (JP) Res-TCN CovMIJ CNN CNN-LSTM CNN-Velocity CNN-LSTM-Velocity RA-GCN 10 AAGCN Year Precision Recall F1 201 61 201 49.1 201 39.3 201 62.5 201 48.68 41.78 40.34 201 45.24 40.58 39.2 201 49.97 47.89 46.1 201 47.64 46.51 45.2 201 61.18 59.28 58.6 2020 65.7 65.57 65.1 11 12 AS-RAGCN Preset JSS (Chapter 2) 13 14 Preset JSS using Covariance (Chapter 2) FMIJ (Chapter 2) 15 AMIJ (Chapter 2) 16 FF-AAGCN (Chapter 3) 17 18 19 Proposed (LW-FF-AAGCN) Proposed (LW-FF-AAGCN JSS-A) Proposed (LW-FF-AAGCN JSS-B) 2020 201 201 202 202 202 - 75.82 - 74.81 - - - 74.9 52.8 60.2 - - 64 - - 64 77.87 78.52 80.64 79.73 81.00 81.48 80.20 80.97 77.5 80.59 79.56 80.63 Performance comparison between LW-FF-AAGCN with existing methods on CMDFALL is shown in Table 4.2 LW-FF-AAGCN achieves an F1-score of up to 80.59%, which is 3% higher than FF-AAGCN The F1-score of LW-FF-AAGCN is 14.44% higher than the baseline AAGCN Cov3DJ and Joint Position (JP) are the baselines for the methods as proposed in Chapter Res-TCN is the method used in the introductory paper on the CMDFALL dataset CovMIJ is a variant of FMIJ, as discussed in Chapter CNN and CNN-Velocity are methods based on the Convolutional Neural Network CNN-LSTM and CNN-LSTM-Velocity are hybrid methods between Convolutional and Recurrent Neural Networks RA-GCN, AAGCN, and AS- RAGCN are Graph Convolutional Networks The remaining are methods proposed in Chapter and Chapter The numbers of model parameters and computation requirements on CMDFALL are shown in Table 4.3 The computation metric is the number Floating Point Operations (FLOPs) This is the number of operations required to classify an action sample It can be seen that LW helps reduce the number of model parameters up to 5.6 times The lightweight model with JSS requires the number of FLOPs 1.74 times less than the baseline AAGCN Table 4.3 Model parameters and computation requirement on CMDFALL No Method AAGCN LW-AAGCN FF-AAGCN (Chapter 3) LW-FF-AAGCN LW-FF-AAGCN with JSS-A LW-FF-AAGCN with JSS-B FF ✗ ✗ ✓ ✓ ✓ ✓ LW ✗ ✓ ✗ ✓ ✓ ✓ JSS Param FLOPs 3.74M 50.94G ✗ 0.66M 44.81G ✗ 3.75M 50.98G ✗ 0.66M 44.85G ✗ 0.66M 29.15G ✓ 0.66M 29.15G ✓ On the large-scale NTU RGB+D, the AAGCN model has a total of 3.76 million parameters The proposed model has 0.67 million parameters, which is 5.6 times less than the baseline AAGCN The proposed method achieves a trade-off performance compared with AAGCN, as shown in Table 4.4 For the cross-subject benchmark, the accuracy of the proposed method is 86.9%, whereas that of AAGCN is 88.0% For the cross-view benchmark, the accuracy of the proposed method is 92.7%, and the accuracy of AAGCN is 95.1% 4.5 Application Demonstration In this section, a demo is developed to score a person’s performance with actions in MSR-Action3D Due to time limitations, only the result of the action recognition module is introduced The related modules such as action spotting, human pose evaluation to scoring/assessment are out of the scope of this study Table 4.4 Comparison on the numbers of model parameters, FLOPs, and accuracy (%) on NTU RGB+D No 4.6 Method LSTM-CNN SR-TSL HCN ST-GCN DCM AS-GCN RA-GCNv1 AGCN 10 RA-GCNv2 AAGCN 11 12 13 SAR-NAS AS-RAGCN STAR-64 14 STAR-128 15 FF-AAGCN (Chapter 3) 16 LW-FF-AAGCN 17 18 LW-FF-AAGCN JSS-A LW-FF-AAGCN JSS-B Year Param 201 60M 201 19.1M 201 2.64M 201 3.1M 201 10M 201 7.1M 201 6.21M 201 3.47M 2020 6.21M 2020 3.76M FLOPs - CS (%) 82.9 CV (%) 90.1 4.2G 84.8 92.4 - 86.5 91.1 16.32 G - 81.5 88.3 84.5 91.3 35.92 G 32.8G 86.8 94.2 85.9 93.5 87.3 93.7 87.3 88.0 93.6 95.1 86.4 87.7 81.9 94.3 92.9 88.9 2020 2020 202 202 202 - 1.3M 4.88M 0.42M 18.66 G 32.8G 16.43 G 10.2G - 1.26M - 83.4 89.0 3.76M 88.2 94.8 86.9 92.7 - 0.66M 0.66M 16.44 G 14.26 G 7.42G 7.42G 84.1 83.5 90.1 90.1 0.67M Conclusion of the chapter In this chapter, a lightweight model LW-FF-AAGCN is proposed Layer pruning for the deep learning network AAGCN is proposed with a Preset JSS module and a Feature Fusion module Once Preset JSS is enabled, two graph topologies (JSS-A and JSS-B) are defined for the selected joints The graph type B (JSS-B) with the edges connecting symmetrical joints achieves excellent performance on CMDFALL with fewer model parameters and FLOPs The number of parameters is reduced using Preset JSS and layer pruning Experimental results show that the lightweight model with graph type B (JSS-B) outperforms the baseline AAGCN on challenging datasets with trainable parameters 5.6 times fewer than the baseline The computation complexity in FLOPs of the proposed model is 3.5 times lower than that of the baseline on CMDFALL A study is conducted to evaluate the performance of LW-FF-AAGCN with different dataset sizes A demo is presented using the proposed method for human action recognition Results in this chapter have been submitted to the Multimedia Tools and Applications (MTAP), an ISI Q1 journal, in the paper ”A Lightweight Graph Convolutional Network for Skeleton-based Action Recognition” CONCLUSION AND FUTURE WORKS Conclusions In the dissertation, skeleton-based action recognition methods are proposed There are three main contributions The first contribution of the dissertation is on joint subset selection with both preset configuration and automatic schemes that help improve the performance of action recognition In the second contribution, a Feature Fusion module is coupled with AAGCN to form FF-AAGCN The Feature Fusion is a simple and efficient data preprocessing module for graph-based deep learning, especially for noisy skeleton data The proposed method FF-AAGCN outperforms the baseline AAGCN on CMDFALL, a challenging dataset with noise in skeleton data On a large-scale dataset as NTU RGB+D, FF-AAGCN also obtains competitive results compared to AAGCN The third contribution is a lightweight model LW- FF-AAGCN The number of model parameters in LW-FF-AAGCN is 5.6 times less than the baseline The proposed lightweight model is suitable for application development for edge devices with limited computation capacity LW-FFAAGCN outperforms both AAGCN and FF-AAGCN on CMDFALL Future work Short-Term Perspectives • Study on noise in the skeleton data caused by pose estimation errors using RGB-D sensors Standard calibrated Mocap system is required for evaluation • Study different statistical metrics for Joint Subset Selection, such as the variance of joint angles, in graph-based deep learning networks • Develop graph-based lightweight models for application development on edge devices As computation capacity is limited on edge devices, lightweight models are required for real-time applications • Study on the interpretability of action recognition using graph-based deep learning • Improve the quality recognition Long-Term Perspectives of pose estimation for high-performance action • Extend the proposed methods to continuous skeleton-based human action recognition • Extend the study of Graph Convolutional Networks to Geometric Deep Learning Geo- metric Deep Learning is such an approach to unify the deep learning models by exploring the common mathematics in these models • Develop applications using the proposed models for human action recognition such as elderly monitoring in healthcare or camera surveillance for abnormal behavior detection PUBLICATIONS Conferences [C1] Tien-Nam Nguyen, Dinh-Tan Pham, Thi-Lan Le, Hai Vu, and Thanh-Hai Tran (2018), Novel Skeleton-based Action Recognition Using Covariance Descriptors on Most In- formative Joints, Proceedings of International Conference on Knowledge and Systems Engi- neering (KSE 2018), IEEE, Vietnam, ISBN: 9781-5386-6113-0, pp.50-55, 2018 [C2] Dinh-Tan Pham, Tien-Nam Nguyen, Thi-Lan Le, and Hai Vu (2019), Analyzing Role of Joint Subset Selection in Human Action Recognition, Proceedings of NAFOSTED Conference on Information and Computer Science (NICS 2019), IEEE, Vietnam, ISBN: 978- 1-7281-5163-2, pp.61-66, 2019 [C3] Dinh-Tan Pham, Tien-Nam Nguyen, Thi-Lan Le, and Hai Vu (2020), Spatio- Temporal Representation for Skeleton-based Human Action Recognition, Proceedings of Inter- national Conference on Multimedia Analysis and Pattern Recognition (MAPR 2020), IEEE, Vietnam, ISBN: 978-1-7281-6555-4, pp.1-6, 2020 Journals [J1] Dinh-Tan Pham, Quang-Tien Pham, Thi-Lan Le, and Hai Vu (2021), An Efficient Feature Fusion of Graph Convolutional Networks and its application for Real-Time Traffic Control Gestures Recognition, IEEE Access, ISSN: 2169-3536, pp.121930 - 121943, 2021 (ISI, Q1) [J2] Van-Toi Nguyen, Tien-Nam Nguyen, Thi-Lan Le, Dinh-Tan Pham, and Hai Vu (2020), Adaptive most joint selection and covariance descriptions for a robust skeleton-based human action recognition, Multimedia Tools and Applications (MTAP), Springer, DOI: 10 1007/s11042-021-10866-4, pp.1-27, 2021 (ISI, Q1) [J3] Dinh Tan Pham, Thi Phuong Dang, Duc Quang Nguyen, Thi Lan Le, and Hai Vu (2021), Skeleton-based Action Recognition Using Feature Fusion for SpatialTemporal Graph Convolutional Networks, Journal of Science and Technique, Le Quy Don Technical University (LQDTU-JST), ISSN 1859-0209, pp.7-24, 2021

Ngày đăng: 15/03/2022, 21:42