Nghiên cứu các kỹ thuật học sâu trong biểu diễn và nhận dạng hoạt động của người từ dữ liệu khung xương.

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	27
Dung lượng	1,97 MB

Nội dung

Nghiên cứu các kỹ thuật học sâu trong biểu diễn và nhận dạng hoạt động của người từ dữ liệu khung xương.Nghiên cứu các kỹ thuật học sâu trong biểu diễn và nhận dạng hoạt động của người từ dữ liệu khung xương.Nghiên cứu các kỹ thuật học sâu trong biểu diễn và nhận dạng hoạt động của người từ dữ liệu khung xương.Nghiên cứu các kỹ thuật học sâu trong biểu diễn và nhận dạng hoạt động của người từ dữ liệu khung xương.Nghiên cứu các kỹ thuật học sâu trong biểu diễn và nhận dạng hoạt động của người từ dữ liệu khung xương.Nghiên cứu các kỹ thuật học sâu trong biểu diễn và nhận dạng hoạt động của người từ dữ liệu khung xương.Nghiên cứu các kỹ thuật học sâu trong biểu diễn và nhận dạng hoạt động của người từ dữ liệu khung xương.

MINISTRY OF EDUCATION AND TRAINING UNIVERSITY OF SCIENCE AND TECHNOLOGY PHAM DINH TAN A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA Major: Computer Engineering Code: 9480106 ABSTRACT OF DOCTORAL DISSERTATION COMPUTER ENGINEERING Hanoi −2022 This study is completed at: Hanoi University of Science and Technology Supervisors: Assoc Prof Dr Vu Hai Assoc Prof Dr Le Thi Lan Reviewer 1: Reviewer 2: Reviewer 3: This dissertation will be defended before approval committee at Hanoi University of Science and Technology: Time , date month year 2022 This dissertation can be found at: Ta Quang Buu Library - Hanoi University of Science and Technology Vietnam National Library INTRODUCTION Motivation Human action recognition (HAR) can be defined as the art of identifying and naming actions using machine learning techniques from the data collected by various devices Examples of devices include wearable sensors, electronic device sensors like smartphone inertial sensors, camera devices like Microsoft Kinect, and closed-circuit television (CCTV) cameras Action recognition is vital for many applications such as human-computer interaction, camera surveillance, gaming, remote care to the elderly, smart home/office/city, and various monitoring applications Problem formulation This work focuses on skeleton-based action recognition Assume that segmented skeleton sequences and corresponding action labels are provided Suppose that each skeleton sequence only contains one action Action recognition aims to predict the action label from the skeleton data Challenges There are numerous difficulties to action recognition, which may arise due to diversity, intra-class variations, and inter-class similarity of human actions Four major challenges are discussed, including (1) intra-class variations and inter-class similarity, (2) noise in skeleton data, (3) occlusion caused by other body parts or other objects/persons, and (4) insufficient labeled data Objectives The objectives are as follows: • A compact presentation of human actions: Joints in the skeleton model play different roles in each action The first objective is to find action representations that are efficient for action recognition • Improving action recognition performance on noisy skeleton data: The second objective is to design a deep learning architecture that achieves high-performance action recognition on noisy skeleton data • Proposing a lightweight model for action recognition: Computation capacity is limited on edge devices A lightweight deep learning model for action recognition is required for application development Constructing an efficient, lightweight model for action recognition is the third objective of this dissertation Context and constraints In this dissertation, some context and constraints are listed as follows • Three public datasets and one self-collected dataset are used for evaluation The datasets contain segmented skeleton sequences collected by Microsoft Kinect sensors The lists of human actions are pre-defined from these datasets The datasets contain actions performed by a single person or interactions between two persons Other datasets are not considered and/or evaluated in this work • Only daily-life actions are considered in the dissertation Action classes in art performance or any other specific domains are not in the scope of this work • For all four datasets, training/testing split and evaluation protocols are kept the same as in the relevant works where the datasets are introduced • Cross-subject benchmark is performed on all datasets, with half of the subjects for training and the other half for testing • Cross-view benchmark is performed on the NTU RGB+D dataset Sequences captured by camera numbers and are used for training Sequences from camera number are used for testing Only single-view data are used Multi-view data processing is not considered in this work • The study aims at deploying an application using the proposed methods This application is developed to evaluate the performance of a person who does yoga exercises Pose estimation for a single person is implemented using the public tool Google MediaPipe Due to the time limitation, only the result of the action recognition module is introduced The related modules such as action spotting, human pose evaluation, and exercise scoring/assessment are out of the scope of this study Contributions The three main contributions are introduced in this dissertation • Contribution 1: Propose two JSS methods for human action recognition: the Preset JSS method and the automatic MIJ selection method • Contribution 2: Propose a Feature Fusion (FF) module to combine spatial and temporal features for the Attention-enhanced Adaptive Graph Convolutional Network (AAGCN) using the Relative Joint Position and the joint velocity The proposed method is named FF-AAGCN The proposed method outperforms the baseline method on the challenging datasets with noise in the skeleton data • Contribution 3: Propose the lightweight model LW-FF-AAGCN with the number of model parameters much fewer than that of the baseline method with competing performance in action recognition The proposed method enables the deployment of applications using human action recognition on devices with limited computation capacity Dissertation outline Excluding the introduction and conclusion, the dissertation consists of four chapters and is structured as follows: • Introduction: This section provides the motivation, objectives of the dissertation, challenges, constraints, and contributions to the research • Chapter entitled ”Literature Review”: This chapter is a brief on the existing literature to obtain a comprehensive understanding of human action recognition • Chapter entitled ”Joint Subset Selection for Skeleton-based Human Action Recognition”: This chapter presents Preset JSS and automatic MIJ selection • Chapter entitled ”Feature Fusion for the Graph Convolutional Network”: A Feature Fusion (FF) module is proposed for data pre-processing The graph-based deep learning model FF-AAGCN outperforms the baseline model on CMDFALL, a challenging dataset with noisy skeleton data • Chapter entitled ”The Proposed Lightweight Graph Convolutional Network”: The lightweight model LW-FF-AAGCN is proposed with fewer parameters than the baseline AAGCN LW-FF-AAGCN is suitable for application development on edge devices with limited computation capacity • Conclusion and future works: This section summarizes the dissertation’s contributions and introduces directions for future work on human action recognition CHAPTER LITERATURE REVIEW 1.1 An overview on action recognition Due to the wide range of application, human action recognition (HAR) has been studied for decades The mechanism of the human vision is the key direction that researchers have been following for action recognition The human vision system can observe the motion and shape of the human body in a short time The observations are then transferred to the intermediate human perception system, classifying them as walking, jogging, or running The human visual perception system is highly reliable and precise in action recognition Over the last few decades, researchers have aimed at a similar level of performance with a computer-based recognition system Unfortunately, we are still far from the level of the human visual system due to several issues associated with action recognition, such as environmental complexities, intra-class variances, viewpoint variations, occlusions 1.2 Data modalities for action recognition Different data modalities, such as color, depth, skeleton data, and acceleration data etc can be used for action recognition Data modalities can be loosely separated into two groups: visual modalities and non-visual modalities Visual modalities such as color, depth, and skeleton are visually logical for describing actions Visual modalities are popular for action recognition The trajectories of joints are encoded in skeleton data When the action does not include objects or scene context, the skeleton efficiently represents actions Visual modalities have been widely utilized in surveillance systems In robotics and autonomous driving, depth data with distance information is commonly employed for action recognition Meanwhile, non-visual modalities like acceleration are not visually intuitive for describing human actions However, these modalities can also be used for action recognition in specific cases requiring privacy protection Each data modality has its own advantage, depending on the usage Color data modality is the popular one in early researches in action recognition Recently, with the popularity of low-cost depth sensors and advances in pose estimation, skeleton data are easy to acquire with higher quality 1.3 Skeleton data collection Skeleton data are temporal sequences of joint positions Joints are connected in the kinetic model by the natural structure of the human body So it is convenient to model actions using the kinetic model Skeleton data can be collected using motion capture (Mocap) systems, depth sensors, or color/depth-based pose estimation In motion capture systems, markers are placed on joint positions Skeleton data collected by motion capture systems are of high accuracy However, MoCap systems are expensive and inconvenient for many practical applications So this work focus on the skeleton data collected by the low-cost depth sensors 1.4 Benchmark datasets Many datasets are built to develop and evaluate action recognition methods Four public datasets are used in the dissertation: 1.4.1 MSR-Action3D There are 20 actions performed by ten subjects Each subject acts two or three times There are 20 joints in the skeleton model There are 557 action sequences in total Actions in MSR-Action3D are grouped into three subsets: Action Set (AS1), Action Set (AS2), and Action Set (AS3) Each subset consists of eight actions, so some actions appear in more than one subset 1.4.2 MICA-Action3D MICA-Action3D is the dataset collected at the MICA International Research Institute, Hanoi University of Science and Technology The dataset is built for cross-dataset evaluation with the same list of 20 actions as in MSR-Action3D Action sequences are collected by a Kinect v1 sensor Each action is repeated two or three times by each subject Twenty subjects participate in data collection with a total of 1,196 action samples 1.4.3 CMDFALL The dataset is introduced to evaluate methods for human falling event detection For data collection, seven Kinect v1 sensors are installed across the surroundings Actions are divided into 20 categories in this dataset These actions are performed by 50 people ranging in age from 21 to 40 (with 20 females and 30 males) 1.4.4 NTU RGB+D The NTU RGB+D dataset is introduced with different data modalities collected by Kinect v2 sensors In this dataset, the skeleton model has 25 joints, with one or two people in each scene The dataset is currently the most common large-scale dataset for benchmarking skeleton-based action recognition methods There are 56,880 sequences in the dataset, categorized into 60 action classes The actions are performed by 40 people Three Kinect sensors are mounted at the same height but at various angles The dataset is collected using 17 camera setups with different heights and distances The authors of the dataset recommend two benchmarks: (1) Cross-subject (CS): half of the subjects are used for training, and the other half is used for testing There are 40,320 sequences in the training set and 16,560 sequences in the testing set (2) Cross-view (CV): the training set includes 37,920 sequences (from cameras and 3), and the testing set consists of 18,960 sequences (from camera 1) 1.5 Skeleton-based action recognition Skeleton data are used to describe human actions effectively Skeleton data have a lot of advantages, such as robustness against variations in clothing texture and background Skeleton data are simple to obtain thanks to the widespread use of depth sensors and breakthroughs in human pose estimation with color/depth data Due to skeleton data storage and calcula- tion efficiency, skeleton-based action recognition is becoming popular Various methods for skeleton-based action recognition have been proposed in the literature, as shown in Figure 1.1 Joint positions can be used to extract spatial and temporal data The spatial information primarily pertains to the skeleton’s structure in a single frame, whereas the temporal information refers to the skeleton data’s dependence information across frames With the growth of deep learning, data-driven architectures have been proposed for action recognition with promising results in recent years Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Graph Convolutional Network (GCN) are some of the deep learning architectures that have been developed for action recognition using skeleton data Figure 1.1 Timeline of skeleton-based action recognition 1.6 Research on action recognition in Vietnam Some research groups in Vietnam have conducted studies on human action recognition Action recognition has been and continues to be a challenging and attractive research topic 1.7 Conclusion of the chapter This chapter summarizes advances in action recognition Various data modalities and action recognition methods have been investigated The review focuses on skeleton data and skeleton-based action recognition, which is the dissertation’s focus Applications of action recognition in a variety of fields are discussed Although several breakthroughs in action recognition have been reported, there are gaps in the practical application of action recognition systems The challenges include action similarity, noise, occlusion, and labeled data This necessitates more research into methods to improve current action recognition systems CHAPTER JOINT SUBSET SELECTION FOR SKELETON-BASED HUMAN ACTION RECOGNITION 2.1 Introduction Previous researches show that joints have different roles in action representation and recognition Joint subset selection (JSS) could be categorized into either Preset JSS or automatic JSS In the first category, joints are pre-determined to simplify the joint selection procedure This scheme helps avoid computation and classification complexity while still ensuring certain informative joints are selected Joints are automatically selected in the second category via statistical metrics such as means/variances of joint positions/angles The number of selected joints in each action may vary In this chapter, a Preset JSS scheme is proposed using the Joint Position (JP) method as the baseline Then, two techniques are proposed for automatic JSS named the Fixed number of Most Informative Joints (FMIJ) and the Adaptive number of Most Informative Joints (AMIJ) For FMIJ, the number of selected joints is fixed for all action classes For AMIJ, the number of selected joints is adapted to the characteristic of each action class The roles of these methods are shown in Figure 2.1 Figure 2.1 Joint Subset Selection for skeleton-based action recognition 2.2 2.2.1 Proposed methods Preset Joint Subset Selection In Preset JSS, a set of joints is selected to represent actions depending on the dataset Based on the observation that human actions mainly involve joints on arms, legs, and head, 13 joints are pre-determined for action representation and recognition (refer to Figure 2.2) The system diagram for Preset JSS is shown in Figure 2.3 The feature vector is formed by combining joint positions and velocities Each action is a sequence of skeleton poses Figure 2.2 Preset JSS for action representation (using joints in blue) Denote N as the number of joints in the skeleton model Coordinate of the ith joint at the tth time frame is expressed as: pi (t) = [xi (t), yi (t), zi (t)] (2.1) Human pose at the tth time frame expressed by N joints: p(t) = [p1 (t), p2 (t), , pN (t)] (2.2) Motivated by the approach mentioned in the research of Ghorbel et al., joint position p(t) and joint velocity V (t) are used to represent human action The joint velocity is defined as: V (t) = {pi (t + 1) − pi (t − 1)|i = N } (2.3) Joint velocity in the first/last frame is set equal to the velocity of its adjacent frame The feature vector F is formed by concatenating joint position with joint velocity: F (t) = [p(t), V (t)] (2.4) Temporal normalization is applied using Dynamic Time Warping (DTW) Fourier Temporal Pyramid (FTP) helps handle noise in the skeleton data Classification is performed via one-versus-all SVMs The proposed method is evaluated on MSR-Action3D and CMDFALL Table 2.1 Accuracy (%) comparison of MIJ methods with existing methods on MSRAction3D No 10 11 12 13 14 15 16 17 18 Method Action Graph, 2010 Histogram, 2012 EigenJoints, 2012 Cov3DJ, 2013 Joint Position (JP), 2014 Relative JP (RJP), 2014 Joint Angle (JA), 2014 Absolute SE(3), 2014 LARP, 2014 Spline Curve, 2015 Multi-fused, 2017 CovP3DJ, 2018 CovMIJ, 2018 Lie Algebra with VFDT, 2020 Preset JSS Preset JSS using Covariance Descriptors Proposed (FMIJ) Proposed (AMIJ) AS1 72.9 87.98 74.5 88.04 93.36 95.77 84.51 90.3 94.72 83.08 90.8 93.48 93.48 94.66 95.86 95.7 95.7 96.7 AS2 71.9 85.48 76.1 89.29 85.53 86.9 68.05 83.91 86.83 79.46 93.4 84.82 90.18 85.08 91.27 91.1 92.9 92.9 AS3 79.2 63.46 96.4 94.29 99.55 99.28 96.17 95.39 99.02 93.69 95.7 94.29 97.14 96.76 99.47 96.2 98.1 99.0 Table 2.2 Performance evaluation for FMIJ/AMIJ on CMDFALL No 10 11 12 13 2.4 Method Cov3DJ Joint Position (JP) Res-TCN CovMIJ CNN CNN-LSTM CNN-Velocity CNN-LSTM-Velocity RA-GCN Preset JSS Preset JSS using Covariance Descriptors Proposed (FMIJ) Proposed (AMIJ) Year 2013 2014 2017 2018 2019 2019 2019 2019 2019 2019 - F1-score (%) 61 49.18 39.38 62.5 40.34 39.24 46.13 45.23 58.63 52.86 60.2 64 64 Conclusion of the chapter Both heuristic and statistical metric-based methods improve the performance of action recognition An efficient Preset JSS is proposed to simplify the joint selection procedure Preset JSS outperforms the baseline method JP on evaluation datasets For automatic joint subset selection, FMIJ and AMIJ are proposed Covariance descriptors are computed using joint position and joint velocity Both FMIJ and AMIJ outperform the baseline method 11 Cov3DJ on evaluation datasets FMIJ/AMIJ are better than Preset JSS but require more computational time The proposed JSS methods are stable on skeleton data collected from different sources The main results in this chapter are published in the publications [C1], [C2], and [J2] 12 CHAPTER FEATURE FUSION FOR THE GRAPH CONVOLUTIONAL NETWORK 3.1 Introduction Joints on the human body are arranged in a certain order with graph structure in nature However, methods discussed in Chapter have not focused on the graph nature of the skeleton data Early approaches employ manual feature engineering to extract features using various rules Handcraft feature engineering has limitations in performance and is difficult to generalize Deep learning networks such as CNNs and RNNs have recently been used for action recognition with skeleton data However, these methods can not capture joint arrangement in the skeleton model, which is important for action recognition A recent method is to represent skeleton sequences as graphs In this chapter, graph-based deep learning models are studied to improve action recognition performance on noisy skeleton data The goal is to create an efficient technique that uses joint offsets in skeleton sequences 3.2 Related Work on Graph Convolutional Networks The use of convolution operators in Graph Convolutional Network (GCN) is the extension from the image domain to the graph domain GCN is originally introduced for HAR in Spatial-Temporal GCN (ST-GCN) to model the skeleton data as graphs The Attentionenhanced Adaptive Graph Convolutional Network (AAGCN) is recently introduced using adaptive graphs AAGCN’s system diagram is a stack of ten basic blocks 3.3 Proposed method In this chapter, a method using the baseline Attention-enhanced Adaptive Graph Convolutional Network (AAGCN) with a Feature Fusion module is proposed The system diagram of the proposed method is shown in Fig 3.1 The purpose of the Feature Fusion module is to enable the extraction of useful information for action representation The Feature Fusion module also aids in reducing the influence of noise in skeleton data Data from the Feature Fusion module are normalized using Batch Normalization (BN) There are ten basic blocks: B1 , B2 , , B10 The first four blocks have 64 output channels each: B1 , B2 , B3 , B4 The next three blocks B5 , B6 , B7 have 128 output channels There are 256 output channels in the last three blocks B8 , B9 , B10 The number of output channels for each block equals the number of filters employed in the convolutional operation The goal of these settings is to use trainable parameters to extract graph features at multiple scales The stride is set as two for B5 , B8 to reduce frame length A global average pooling (GAP) layer is used to pool features maps GAP is an efficient mechanism to reduce tensor dimensions and speed up computation One advantage of the GAP layer is that it enforces correspondences between feature maps and 13 action classes As a result, the feature maps can be interpreted as confidence maps for action classes Another benefit of global average pooling is that there are no tuned parameters; therefore, overfitting is avoided at this layer Softmax is used for action classification Figure 3.1 The system diagram consists of a Feature Fusion (FF) module, a Batch Normalization (BN) layer, ten basic blocks, a Global Average Pooling (GAP), and a Softmax layer Figure 3.2 shows the diagram of a basic block A spatial GCN layer (Convs), an attention module, and a temporal GCN layer (Convt) A BN and a ReLU layer follow the spatial and temporal GCN layers A residual connection is included in each basic block to avoid gradient vanishing Figure 3.2 Spatial-Temporal Basic Block • #Channels: Three dimensions x, y, z of joints are used in AAGCN, so the number of channels is three The Feature Fusion module combines Relative Joint Position and joint velocity, so the number of channels at the output of the Feature Fusion module is six • #Frames: All sequences are normalized to the same length Tmax Skeleton sequences are padded by repeating the samples to achieve the same length Tmax • #Joints: The number of joints in the skeleton model • #Actors: The maximum number of actors in each scene Dataset parameters are shown in Table 3.1 In AAGCN, joint coordinates are used as inputs When there is noise in skeleton data, employing joint coordinates may result in inaccurate recognition Variations can be reduced by using Relative Joint Position (RJP) There are actions with similar joint coordinate sequences These actions can be discriminated by the velocity In this work, the Feature Fusion module 14 Table 3.1 Dataset parameters No Dataset CMDFALL MICA-Action3D NTU RGB+D #Channels 3 #Frames 600 175 300 #Joints 20 20 25 #Actors 1 incorporates two features: RJP and joint velocity The coordinates of the ith joint at time frame tth can be represented as: pi (t) = [xi (t), yi (t), zi (t)] (3.1) Skeleton at the frame tth consists of N joints: p(t) = [p1 (t), p2 (t), , pN (t)] (3.2) In the skeleton model, RJP is defined as the spatial offset between a joint and the center joint pc , as shown in Figure 3.3 For datasets under consideration, the spine joint is chosen as the center joint pc RJP can be stated mathematically as: Figure 3.3 RJP for (a) Microsoft Kinect v1 skeleton with 20 joints (b) Microsoft Kinect v2 skeleton with 25 joints RJP (t) = {pi (t) − pc (t)|i = N } (3.3) The joint velocity is defined in Eq (2.3) The feature vector F is created by combining RJP and joint velocity across the channel dimension: F (t) = [RJP (t), V (t)] 15 (3.4) 3.4 Experimental results Three benchmark datasets are used for evaluation: CMDFALL, MICA-Action3D, and NTU RGB+D Data from half of the subjects are used for training, whereas data from the remaining are employed for testing The proposed method is implemented on a server with an Intel i7-8700 CPU, 32 GB memory, and a GeForce GTX 1080Ti GPU Table 3.2 shows the obtained results on CMDFALL using Joint Position, Joint Velocity, RJP, and the Feature Fusion module Table 3.2 Ablation study on CMDFALL No Method AAGCN using Joint Position AAGCN using Joint Velocity AAGCN using RJP Proposed (FF-AAGCN) Precision (%) 65.7 68.64 69.15 77.87 Recall (%) 65.57 69.7 69.72 78.52 F1 (%) 65.11 68.54 69.04 77.59 Table 3.3 compares the proposed method and state-of-the-art methods on the CMDFALL dataset On the CMDFALL dataset, the proposed method achieves state-of-the-art performance with an F1 score of up to 77.59%, whereas the baseline method only achieves 65.11% Figure 3.4 shows a visualization of CMDFALL using t-Distributed Stochastic Neighbor Embedding (t-SNE) (a) AAGCN (b) Proposed Figure 3.4 Distribution of 20 action classes in CMDFALL obtained by AAGCN (left) and the proposed method (right) using t-SNE 16 Table 3.3 Performance evaluation on CMDFALL with Precision, Recall, and F1 scores [%] No 10 11 12 13 14 15 16 Method Cov3DJ Joint Position (JP) Res-TCN CovMIJ CNN CNN-LSTM CNN-Velocity CNN-LSTM-Velocity RA-GCN AAGCN AS-RAGCN Preset JSS Preset JSS using Covariance FMIJ (Chapter 2) AMIJ (Chapter 2) Proposed (FF-AAGCN) Year 2013 2014 2017 2018 2019 2019 2019 2019 2019 2020 2020 2019 2019 2021 2021 - Prec (%) 48.68 45.24 49.97 47.64 61.18 65.7 75.82 77.87 Recall (%) 41.78 40.58 47.89 46.51 59.28 65.57 74.81 78.52 F1 (%) 61 49.18 39.38 62.5 40.34 39.24 46.13 45.23 58.63 65.11 74.9 52.86 60.2 64 64 77.59 Table 3.4 Performance evaluation by accuracy (%) on NTU RGB+D No 10 11 12 13 14 15 16 17 18 19 20 21 22 Method Bi-directional RNN Part-based LSTM ST-LSTM STA-LSTM VA-LSTM ARRN-LSTM IndRNN SRN+TSL Res-TCN Clip CNN Synthesized CNN Motion CNN Multi-scale CNN ST-GCN GCNN Dense IndRNN AS-GCN AGCN 3s RA-GCN AS-RAGCN AAGCN Proposed (FF-AAGCN) 17 Year 2015 2016 2016 2016 2017 2018 2018 2018 2017 2017 2017 2017 2017 2018 2018 2019 2019 2019 2020 2020 2020 - CS 59.1 60.7 69.2 73.4 79.2 80.7 81.8 84.8 74.3 79.6 80.0 83.2 85.0 81.5 83.5 86.7 86.8 87.3 87.3 87.7 88.0 88.2 CV 64.0 67.3 77.7 81.2 87.7 88.8 88.0 92.4 83.1 84.8 87.2 89.3 92.3 88.3 89.8 94.0 94.2 93.7 93.6 92.9 95.1 94.8 3.5 Conclusion of the chapter In this chapter, an action recognition method is proposed based on integration of the Feature Fusion module to the AAGCN The proposed system is named FF-AAGCN RJP and joint velocity are combined in the Feature Fusion module FF-AAGCN outperforms the baseline method AAGCN on the challenging dataset CMDFALL On NTU RGB+D, the proposed method achieves a cross-subject accuracy of 88.2% and a cross-view accuracy of 94.8% This result of the proposed method is competitive to that of AAGCN on NTU RGB+D Improvement is observed on Feature Fusion using velocities with different frame offsets Results in this chapter are published in the publications [C3], [J1], and [J3] 18 CHAPTER THE PROPOSED LIGHTWEIGHT GRAPH CONVOLUTIONAL NETWORK 4.1 Introduction As shown in Chapter 2, Joint Subset Selection (JSS) is efficient for action representation The Feature Fusion module is proposed in Chapter to improve the performance of the graph-based network on challenging datasets All those methods mainly focus on improving recognition accuracy However, low-complexity deep learning models are required for application development on devices There are different approaches for reducing the number of parameters of deep learning models In this chapter, a lightweight model is proposed by pruning layers of the deep learning network, combining Preset JSS with the Feature Fusion module as studied in Chapter and Chapter Two graphs are defined using the joints selected by the Preset JSS A demo using the proposed lightweight model for action recognition is presented in this chapter 4.2 Related work on Lightweight Graph Convolutional Networks On large-scale datasets, GCN-based models perform exceptionally well However, these GCN-based models necessitate a lot of computing power Efforts have been made to develop lightweight models 4.3 Proposed method A lightweight model is proposed based on the FF-AAGCN in Chapter The purpose is to design a deep learning model with fewer model parameters The proposed lightweight model is named LW-FF-AAGCN In FF-AAGCN, there are ten basic blocks with different numbers of output channels: four blocks with 64 output channels, three blocks with 128 output channels, and three blocks with 256 channels Only three basic blocks with 128 output channels are used in LW-FF-AAGCN Two graphs based on the Preset JSS are proposed The diagram for the proposed method is shown in Figure 4.1 The Feature Fusion module pre-processes the skeleton data Preset JSS (optional) is applied to select 13 joints from the skeleton model, same as Chapter Two graphs are defined based on the joints selected by the Preset JSS The output of the JSS is fed into the Batch Normalization (BN) layer for data normalization Output data from the BN layer are transferred to three basic blocks, B1 , B2 , and B3 , with 128 output channels The output of B3 is fed into a global average pooling (GAP) layer, which creates a 128-dimensional feature vector The feature vector is passed to a softmax layer for classification The Preset JSS scheme in Chapter is used to select 13 joints (marked in blue) from the 19 Figure 4.1 Diagram of the proposed LW-FF-AAGCN skeleton model of 20 joints as shown in Figure 4.2.a The joints marked in red are not used for action recognition Two graphs are proposed using the subset of 13 blue joints These two graphs are JSS graph type A (JSS-A) and JSS graph type B (JSS-B) In JSS-A, elbow and knee joints are connected to the head joint to form a connected graph from selected joints as shown in Figure 4.2.b For JSS-B, symmetrical connections are added for pairs of elbows, wrists, knees, and ankles, as shown in Figure 4.2.c The connections between these symmetrical joints are important since, in each pair, the joints move in different directions in many actions such as running, walking Figure 4.2 (a) Preset selection of 13 joints (blue) from the skeleton model of 20 joints (b) Graph type A (JSS-A) is defined by the solid edges connecting 13 blue joints (c) Graph type B (JSS-B) is defined by the solid green edges connecting 13 blue joints with edges between symmetrical joints Only the spatial dimension of the graph is shown for simplicity 4.4 Experimental results The lightweight model consists of a Feature Fusion module (FF), lightweight (LW) using layer pruning, and the Preset JSS An ablation study is conducted to see how each component 20 Table 4.1 Ablation study on CMDFALL Performance scores are in percentage Abbreviations in use include Feature Fusion (FF), Lightweight (LW), and Joint Subset Selection (JSS) No Method AAGCN LW-AAGCN FF-AAGCN (Chapter 3) LW-FF-AAGCN LW-FF-AAGCN with JSS-A LW-FF-AAGCN with JSS-B FF ✗ ✗ ✓ ✓ ✓ ✓ LW ✗ ✓ ✗ ✓ ✓ ✓ JSS ✗ ✗ ✗ ✗ ✓ ✓ Precision 65.70 67.03 77.87 80.64 79.73 81.00 Recall 65.57 66.44 78.52 81.48 80.20 80.97 F1 65.11 66.39 77.59 80.59 79.56 80.63 of the proposed method contributes to the overall results Comparison results are shown in Table 4.1 Experiments are conducted on a server with an Intel i7-8700 CPU, 32 GB memory, and a GeForce GTX 1080Ti GPU A minor improvement is observed when only lightweight is applied to the baseline AAGCN with an F1-score of 66.39% Using the Feature Fusion module only, the F1-score is 77.59%, as reported in Chapter When combining both LW and FF, the F1-score of the LW-FF-AAGCN is 80.59% Then, JSS graph types A and B are evaluated The F1-scores are 79.56% and 80.63% for JSS-A and JSS-B, respectively It means that for CMDFALL, adding symmetrical connections helps improve recognition performance Table 4.2 Performance comparison of different methods on CMDFALL Performance scores are in percentage No 10 11 12 13 14 15 16 17 18 19 Method Cov3DJ Joint Position (JP) Res-TCN CovMIJ CNN CNN-LSTM CNN-Velocity CNN-LSTM-Velocity RA-GCN AAGCN AS-RAGCN Preset JSS (Chapter 2) Preset JSS using Covariance (Chapter 2) FMIJ (Chapter 2) AMIJ (Chapter 2) FF-AAGCN (Chapter 3) Proposed (LW-FF-AAGCN) Proposed (LW-FF-AAGCN JSS-A) Proposed (LW-FF-AAGCN JSS-B) Year 2013 2014 2017 2018 2019 2019 2019 2019 2019 2020 2020 2019 2019 2021 2021 2021 - Precision 48.68 45.24 49.97 47.64 61.18 65.7 75.82 77.87 80.64 79.73 81.00 Recall 41.78 40.58 47.89 46.51 59.28 65.57 74.81 78.52 81.48 80.20 80.97 F1 61 49.18 39.38 62.5 40.34 39.24 46.13 45.23 58.63 65.11 74.9 52.86 60.2 64 64 77.59 80.59 79.56 80.63 Performance comparison between LW-FF-AAGCN with existing methods on CMDFALL is shown in Table 4.2 LW-FF-AAGCN achieves an F1-score of up to 80.59%, which is 3% 21 higher than FF-AAGCN The F1-score of LW-FF-AAGCN is 14.44% higher than the baseline AAGCN Cov3DJ and Joint Position (JP) are the baselines for the methods as proposed in Chapter Res-TCN is the method used in the introductory paper on the CMDFALL dataset CovMIJ is a variant of FMIJ, as discussed in Chapter CNN and CNN-Velocity are methods based on the Convolutional Neural Network CNN-LSTM and CNN-LSTM-Velocity are hybrid methods between Convolutional and Recurrent Neural Networks RA-GCN, AAGCN, and ASRAGCN are Graph Convolutional Networks The remaining are methods proposed in Chapter and Chapter The numbers of model parameters and computation requirements on CMDFALL are shown in Table 4.3 The computation metric is the number Floating Point Operations (FLOPs) This is the number of operations required to classify an action sample It can be seen that LW helps reduce the number of model parameters up to 5.6 times The lightweight model with JSS requires the number of FLOPs 1.74 times less than the baseline AAGCN Table 4.3 Model parameters and computation requirement on CMDFALL No Method AAGCN LW-AAGCN FF-AAGCN (Chapter 3) LW-FF-AAGCN LW-FF-AAGCN with JSS-A LW-FF-AAGCN with JSS-B FF ✗ ✗ ✓ ✓ ✓ ✓ LW ✗ ✓ ✗ ✓ ✓ ✓ JSS ✗ ✗ ✗ ✗ ✓ ✓ Param 3.74M 0.66M 3.75M 0.66M 0.66M 0.66M FLOPs 50.94G 44.81G 50.98G 44.85G 29.15G 29.15G On the large-scale NTU RGB+D, the AAGCN model has a total of 3.76 million parameters The proposed model has 0.67 million parameters, which is 5.6 times less than the baseline AAGCN The proposed method achieves a trade-off performance compared with AAGCN, as shown in Table 4.4 For the cross-subject benchmark, the accuracy of the proposed method is 86.9%, whereas that of AAGCN is 88.0% For the cross-view benchmark, the accuracy of the proposed method is 92.7%, and the accuracy of AAGCN is 95.1% 4.5 Application Demonstration In this section, a demo is developed to score a person’s performance with actions in MSR-Action3D Due to time limitations, only the result of the action recognition module is introduced The related modules such as action spotting, human pose evaluation to scoring/assessment are out of the scope of this study 22 Table 4.4 Comparison on the numbers of model parameters, FLOPs, and accuracy (%) on NTU RGB+D No 10 11 12 13 14 15 16 17 18 4.6 Method LSTM-CNN SR-TSL HCN ST-GCN DCM AS-GCN RA-GCNv1 AGCN RA-GCNv2 AAGCN SAR-NAS AS-RAGCN STAR-64 STAR-128 FF-AAGCN (Chapter 3) LW-FF-AAGCN LW-FF-AAGCN JSS-A LW-FF-AAGCN JSS-B Year 2017 2018 2018 2018 2019 2019 2019 2019 2020 2020 2020 2020 2021 2021 2021 - Param 60M 19.1M 2.64M 3.1M 10M 7.1M 6.21M 3.47M 6.21M 3.76M 1.3M 4.88M 0.42M 1.26M 3.76M 0.67M 0.66M 0.66M FLOPs 4.2G 16.32G 35.92G 32.8G 18.66G 32.8G 16.43G 10.2G 16.44G 14.26G 7.42G 7.42G CS (%) 82.9 84.8 86.5 81.5 84.5 86.8 85.9 87.3 87.3 88.0 86.4 87.7 81.9 83.4 88.2 86.9 84.1 83.5 CV (%) 90.1 92.4 91.1 88.3 91.3 94.2 93.5 93.7 93.6 95.1 94.3 92.9 88.9 89.0 94.8 92.7 90.1 90.1 Conclusion of the chapter In this chapter, a lightweight model LW-FF-AAGCN is proposed Layer pruning for the deep learning network AAGCN is proposed with a Preset JSS module and a Feature Fusion module Once Preset JSS is enabled, two graph topologies (JSS-A and JSS-B) are defined for the selected joints The graph type B (JSS-B) with the edges connecting symmetrical joints achieves excellent performance on CMDFALL with fewer model parameters and FLOPs The number of parameters is reduced using Preset JSS and layer pruning Experimental results show that the lightweight model with graph type B (JSS-B) outperforms the baseline AAGCN on challenging datasets with trainable parameters 5.6 times fewer than the baseline The computation complexity in FLOPs of the proposed model is 3.5 times lower than that of the baseline on CMDFALL A study is conducted to evaluate the performance of LW-FF-AAGCN with different dataset sizes A demo is presented using the proposed method for human action recognition Results in this chapter have been submitted to the Multimedia Tools and Applications (MTAP), an ISI Q1 journal, in the paper ”A Lightweight Graph Convolutional Network for Skeleton-based Action Recognition” 23 CONCLUSION AND FUTURE WORKS Conclusions In the dissertation, skeleton-based action recognition methods are proposed There are three main contributions The first contribution of the dissertation is on joint subset selection with both preset configuration and automatic schemes that help improve the performance of action recognition In the second contribution, a Feature Fusion module is coupled with AAGCN to form FF-AAGCN The Feature Fusion is a simple and efficient data pre-processing module for graph-based deep learning, especially for noisy skeleton data The proposed method FF-AAGCN outperforms the baseline AAGCN on CMDFALL, a challenging dataset with noise in skeleton data On a large-scale dataset as NTU RGB+D, FF-AAGCN also obtains competitive results compared to AAGCN The third contribution is a lightweight model LWFF-AAGCN The number of model parameters in LW-FF-AAGCN is 5.6 times less than the baseline The proposed lightweight model is suitable for application development for edge devices with limited computation capacity LW-FF-AAGCN outperforms both AAGCN and FF-AAGCN on CMDFALL Future work Short-Term Perspectives • Study on noise in the skeleton data caused by pose estimation errors using RGB-D sensors Standard calibrated Mocap system is required for evaluation • Study different statistical metrics for Joint Subset Selection, such as the variance of joint angles, in graph-based deep learning networks • Develop graph-based lightweight models for application development on edge devices As computation capacity is limited on edge devices, lightweight models are required for real-time applications • Study on the interpretability of action recognition using graph-based deep learning • Improve the quality of pose estimation for high-performance action recognition Long-Term Perspectives • Extend the proposed methods to continuous skeleton-based human action recognition • Extend the study of Graph Convolutional Networks to Geometric Deep Learning Geometric Deep Learning is such an approach to unify the deep learning models by exploring the common mathematics in these models • Develop applications using the proposed models for human action recognition such as elderly monitoring in healthcare or camera surveillance for abnormal behavior detection 24 PUBLICATIONS Conferences [C1] Tien-Nam Nguyen, Dinh-Tan Pham, Thi-Lan Le, Hai Vu, and Thanh-Hai Tran (2018), Novel Skeleton-based Action Recognition Using Covariance Descriptors on Most Informative Joints, Proceedings of International Conference on Knowledge and Systems Engineering (KSE 2018), IEEE, Vietnam, ISBN: 978-1-5386-6113-0, pp.50-55, 2018 [C2] Dinh-Tan Pham, Tien-Nam Nguyen, Thi-Lan Le, and Hai Vu (2019), Analyzing Role of Joint Subset Selection in Human Action Recognition, Proceedings of NAFOSTED Conference on Information and Computer Science (NICS 2019), IEEE, Vietnam, ISBN: 9781-7281-5163-2, pp.61-66, 2019 [C3] Dinh-Tan Pham, Tien-Nam Nguyen, Thi-Lan Le, and Hai Vu (2020), SpatioTemporal Representation for Skeleton-based Human Action Recognition, Proceedings of International Conference on Multimedia Analysis and Pattern Recognition (MAPR 2020), IEEE, Vietnam, ISBN: 978-1-7281-6555-4, pp.1-6, 2020 Journals [J1] Dinh-Tan Pham, Quang-Tien Pham, Thi-Lan Le, and Hai Vu (2021), An Efficient Feature Fusion of Graph Convolutional Networks and its application for Real-Time Traffic Control Gestures Recognition, IEEE Access, ISSN: 2169-3536, pp.121930 - 121943, 2021 (ISI, Q1) [J2] Van-Toi Nguyen, Tien-Nam Nguyen, Thi-Lan Le, Dinh-Tan Pham, and Hai Vu (2020), Adaptive most joint selection and covariance descriptions for a robust skeleton-based human action recognition, Multimedia Tools and Applications (MTAP), Springer, DOI: 10 1007/s11042-021-10866-4, pp.1-27, 2021 (ISI, Q1) [J3] Dinh Tan Pham, Thi Phuong Dang, Duc Quang Nguyen, Thi Lan Le, and Hai Vu (2021), Skeleton-based Action Recognition Using Feature Fusion for Spatial-Temporal Graph Convolutional Networks, Journal of Science and Technique, Le Quy Don Technical University (LQDTU-JST), ISSN 1859-0209, pp.7-24, 2021

Ngày đăng: 15/03/2022, 21:42