Early action prediction, i.e., predicting the label of actions before they are fully executed, is a promising application for medical monitoring, security surveillance, autonomous vehicle driving, and human-computer interaction. Different from the traditional action recognition task that intends to recognize actions from full videos, early action prediction aims to predict the label of actions from partially observed videos with incomplete action executions. In terms of current human movement recognition systems, there are two mainstream approaches. The first is to use a 3D skeleton sequence as input, while the second is to use RGB image sequence information as input. However, compared to RGB data, the skeleton information in 3D space can provide richer and more accurate information to represent the human body’s movement. Since the use of RGB information as input is affected by noise such as lighting changes, background clutter, or clothing texture, the use of 3D skeleton information has a higher noise immunity advantage. Considering the above advantages and disadvantages, the approach focuses on 3D skeleton sequences based on complete time, and incomplete time is better. This work proposes an attentional adaptive graphical convolutional neural network (KD-AAGCN) called knowledge distillation. The contribution of this thesis includes three chapters. In Chapter 1, the literature review is presented. Chapter 2 explained the proposed method in detail. And Chapter 3 shows experimental results.
LITERATURE REVIEW
Overview
In traditional Convolutional Neural Networks (CNN) and Deep Neural Networks (DNN), the temporal relationship between spatial information and joints cannot be handled well On the other hand, Gated Recurrent Unit (GRU), Long Short Term Memory (LSTM), and Recurrent Neural Network (RNN) are available to consider temporal information, but these methods cannot fully represent the overall structure of the skeletal data
In [1], a graphical convolutional neural network containing spatiotemporal information is proposed for the first time, which can obtain both spatial and temporal information from the skeleton sequence They used an Adjacency matrix to directly convey information about the human skeleton connections in the form of human joint connections
However, the connections between the joints are fixed and there is no flexibility to capture information about human skeletons In [2], 2-stream information input is used to improve the shortcomings of the adjacency matrix and focus on enhancing the important joints by combining a graphical convolutional neural network (GCN) with a spatio-temporal channel (STC) attention and an adaptive module
Moreover, in [3], the authors once again enhance the 2-s AGCN by extracting more and more information about human skeletons Specifically, they suggested an MS- AAGCN that uses 4 streams of information for extensive experiments performed on two large-scale datasets: NTU-RGBD [4] and Kinetics Skeleton [5], and achieves state-of- the-art performance on both datasets for skeleton-based action recognition
In [6] Lie et al provided rich (RICH) or higher-order (up to order-3) joint information in the spatial and temporal domain, respectively, and fed them as the input to GCN networks in two ways: 1) early fusion, and 2) late fusion The experiment results show that with Rich information, their model is successful in boosting performances by
2.55% (RAGCN, CS, LF) and 1.32% (MS-AAGCN, CS, LF), respectively, in recognition accuracy based on the NTU RGB-D 60 dataset
Early action recognition is closely related to traditional action recognition The main challenge of early action recognition is the lack of information to discriminate the action because there is not enough information for incomplete time sequences Therefore, the papers [7] have used full-time series of video learning knowledge called teacher models and designed their framework In this work, KD-AAGCN is also based on the Teacher-Student model to obtain the task of early action recognition.
Action Recognition
1.2.1 Skeleton-Based Action Recognition with Multi-Stream Adaptive Graph
In this paper, the authors suggested that Yan et al first use GCNs to model the skeleton data for the skeleton-based action recognition task They build a spatial graph based on the natural connections of the joints in the human body and add temporal edges between the corresponding joints in consecutive frames And they build a Spatial- Temporal Graph Convolutional network (ST-GCN) This name of the model means, there are 2 things, 2 dimensions we need to take care of, spatial and temporal But, there are three disadvantages to the ST-GCN model:
• The skeleton graph used in ST-GCN is predefined based on the natural connectivity of the human body That means only when 2 joints have bone, do they have a connection This is not true For example, when we clap like this, two hands do not have a physical connection, but the relationship between them is important to recognize However, it is difficult for ST-GCN to capture the dependency between two hands because they are located far away in the human- body graph
• The topology of the graph applied in ST-GCN is fixed over all the layers This is not flexible and capable to model the multi-level semantics in different layers
• One fixed graph structure may not be optimal for all the samples of different action classes For example, when we touch the head or do something like this, the connection between hand and head is stronger, but it is not true for other classes, such as “jumping up” and “sitting down” This fact suggests that the graph structure should depend on the data, however, ST-GCN does not support it
And they proposed a new adaptive graph convolutional layer to solve the above problems
They modified an equation that has two sub-graphs: 𝐵 𝑘 and 𝐶 𝑘 The first sub-graph 𝐵 𝑘 is the global graph learned from the data It is initialized with the adjacency matrix - 𝐴 𝑘 Different from 𝐴 𝑘 , the elements of 𝐵 𝑘 can update together with other parameters in the raining process It means 𝐵 𝑘 can be learned through training There are no constraints on the value of 𝐵 𝑘 It means that the graph is completely learned according to the training data So 𝐵 𝑘 is unique for each layer
The second sub-graph 𝐶 𝑘 is the individual graph that learns a unique topology for each sample To determine whether there is a connection between two vertexes and how strong is the connection, they use the normalized embedded Gaussian function to estimate the feature similarity of the two vertexes:
Where: 𝑁 is the number of vertexes; 𝜃 and 𝜙 are two embedding functions And then, they calculate the 𝐶 𝑘 based on the above Eq:
Where 𝑊 𝜃 and 𝑊 𝜙 are the parameters of the embedding function 𝜃 and 𝜙, respectively
They use a gating mechanism to adjust the importance of the individual graph for different layers In detail, the 𝐶 𝑘 is multiplied with a parameter 𝛼 that is unique for each layer and is learned in the training process
• 𝐴 𝑘 + 𝛼𝐵 𝑘 + 𝛽𝐶 𝑘 as an adjacency matrix The 𝐵 𝑘 , 𝐶 𝑘 , 𝛼, 𝛽 are initialized to be 0
So the 𝐴 𝑘 will dominate the early stage of the training
• Initializing 𝐵 𝑘 with 𝐴 𝑘 and blocking the propagation of the gradient
The overall architecture of the adaptive graph convolutional layer (AGCL):
Figure 1.1: Illustration of the adaptive graph convolutional layer (AGCL)
Where: K v is set to 3; w k is the weight; G: is the gate controls res(1x1): residual connection If the number of input channels is different from the number of the output channels, it will be inserted to transform the input to match the output in the channel dimension
The authors suggest an STC-attention module It contains three sub-modules: spatial, temporal, and channel attention module
Figure 1.2: Illustration of the STC-attention module
Three sub-modules are arranged sequentially in SAM, TAM, and CAM orders ⊗ denotes element-wise multiplication ⊕ denotes the element-wise addition
Where: 𝑓 𝑖𝑛 : is the input feature map; 𝑔 𝑠 : is a 1-D convolutional operation, 𝑊 𝑔 𝑠 is the weight where 𝐾 𝑠 is kernel size; 𝜎: is the Sigmoid activation function
The symbols are similar to the above Eq
Where: 𝑊 1 and 𝑊 2 are the weights of two fully-connected layers; 𝛿 (𝑑𝑒𝑙𝑡𝑎): is the ReLU activation function
Figure 1.3: Illustration of the basic block
Both the spatial GCN and the temporal GCN are followed by a batch normalization (BN) and a ReLU layer A basic block is the series of one Spatial GCN (Convs), one STC attention module (STC), and one temporal GCN (Convt) A residual connection is added for each basic block to stabilize the training and gradient propagation
The overall architecture of the network is the stack of these basic blocks There are a total of 9 blocks The numbers of output channels for each block are 64, 64, 64,
128, 128, 128, 256, 256, 256 The BN layer batch normalization is added at the beginning to normalize the input data A global average pooling layer (GAP) is used The final output is sent to a softmax classifier to obtain the prediction
Figure 1.4: Illustration of the network architecture
There are a total of 9 basic blocks (B1-B9) The three numbers of each block represent the number of input channels, the number of output channels, and the stride, respectively GAP represents the global average pooling layer
The first-order information (the coordinates of the joints), second-order information (the direction and length of the bones), and their motion information should be investigated for the action recognition task
In this paper, they model these four modalities in a multi-stream framework In particular, they define that the joint closer to the center of gravity is the root joint, and the joint father is the target joint Each bone is represented as a vector pointing from its root joint to its target joint For example, the root joint in frame t is: v i,t (x i,t , y i,t , z i,t ) and the target joint is v j,t = (x j,t , y j,t , z j,t ), so the vector of bone is calculated as e i,j,j = (x i,t − x j,t , y i,t − y j,t , z i,t − z j,t )
For the motion information, it is calculated as the difference between the same joints, or the same bones in two consecutive frames For example, the joint in frame t is: v i,t (x i,t , y i,t , z i,t ), and in frame t+1, it is: v i,t+1 = (x i,t+1 , y i,t+1 , z i,t+1 ), so the motion information is: m i,t,t+1 = (x i,t+1 − x i,t , y i,t+1 − y i,t , z i,t+1 − z i,t )
The overall architecture (MS-AAGCN) is shown in Fig:
Figure 1.5 Illustration of the overall architecture of the MS-AAGCN
The four modalities (joints, bones, and their motions) are fed into four streams Finally, the softmax scores of the four streams are used to obtain the action scores and predict the action label
1.2.2 An Effective Pre-processing of High-order Skeletal Joint Feature Extraction to Enhance Graph-Convolution-Network-Based Action Recognition
Early Action Recognition
1.3.1 Adaptive Graph Convolutional Network With Adversarial Learning for
They proposed an AGCN-AL for action prediction based on skeleton data in this section, as shown in the following Figure:
Figure 1.14: Overall structure of the AGCN-AL
Where 𝑠 𝑝 denotes the partial sequences, and the 𝑠 𝑓 denotes the full sequences ℎ denotes the hidden features of the action sequences
The AGCN-AL consists of a feature extraction network, a discriminator, and a classifier This network takes a full sequence or a partial sequence as input The feature extraction network extracts spatial and temporal features from the input sequences, and feeds them to the discriminator is used to discriminate whether the input is a full sequence or just a partial sequence The classifier is used to determine the action class and maximize the confusion of the discriminator
To extract features from sequences, the AGC block is proposed in [2] and changes the position of the residual connection as shown in the below figure
Figure 1.15: Details of the AGC block
The Convs and the Convt in this block refer to spatial GCN and temporal GCN, respectively The drop rate of the dropout layer in the block is set to 0.5
Their feature extraction network is similar to the J-stream of 2s-AGCN Specifically, the feature extraction network contains a batch normalization (BN) layer, ten AGC blocks, and a global average pooling (GAP) layer The numbers of the output channels for each AGC block are 64, 64, 64, 64, 128, 128, 128, 256, 256, and 256, as shown in Fig…
Figure 1.16: Illustration of the feature extraction network
There are a BN, ten AGN blocks (B1-B10), and a GAP The three numbers in parentheses indicate the numbers of the input channels, output channels, and stride, respectively
They divided the training of the AGCN-AL into two steps: 1) train the discriminator while freezing the parameters of the classifier and the feature extraction network and 2) train the feature extraction network and the classifier while freezing the parameters of the discriminator These two training steps are carried out iteratively The first step and the second step are performed once each as a complete training
The goal of AGCN-AL is to learn the features in the partial sequence that are close to the features in the corresponding full sequence The authors suggested that some partial sequences may already contain discriminative subactions that can be used for action prediction, called local AGCN So they combine the AGCN-AL with the local AGCN to form a two-stream architecture (Local+AGCN-AL) to further improve the performance of action prediction, as shown in the below figure
Figure 1.17: Overall architecture of the Local+AGCN-AL
The AGCN-AL and the local AGCN are trained separately When the Local+AGCN-AL is tested, the scores of the AGCN-AL and the local AGCN are averaged
1.3.2 Progressive Teacher-student Learning for Early Action Prediction
Figure 1.18: The overall framework of our progressive teacher-student learning for early action prediction
They employed a standard 1-layer long short-term memory (LSTM) [9] architecture as their student prediction model to predict early action at any progress level, and used 1-layer bidirectional LSTM (BiLSTM) [10] as the teacher model
The BiLSTM often contains more discriminative action information than a single LSTM Still, BiLSTM does not apply to the student model because the input sequence in the student model is incomplete Even though the teacher model and student model are different, the authors demonstrate that BiLSTM could still be used for early action recognition as a teacher model to guide student learning
Let denotes the latent feature representations of the teacher and student model over all the progress levels for 𝑖 − 𝑡ℎ video sample by 𝑆 𝑖 and 𝑇 𝑖 , respectively The task early action recognition can be achieved by minimizing:
Where 𝐿 𝑇𝑆 is the knowledge distillation (KD) loss and 𝐿 𝐶 is the prediction loss of the student model 𝑦 𝑖 indicates the ground truth action label for the 𝑖 − 𝑡ℎ video sample.
PROPOSED METHOD
Overview of the Architecture
Figure 2.19: VA + Rich + KD-AAGCN architecture
The proposed method is illustrated in the Figure above and consists of a teacher network and a student network, as well as a knowledge distillation module The task of the teacher network is global (observation ratio = 100%) action recognition with a skeleton input of complete time series, meanwhile, the task of the student network is action prediction with a skeleton input of incomplete time series, but the lack of complete information in the student network leads to unsatisfactory result in predicting early action sequences
Therefore, this thesis uses a knowledge distillation module to transfer the complete information of the teacher network to the incomplete information of the student network, and the student network can achieve improvement after being taught by the teacher network First of all, the teacher has to be pre-trained and then transfer the complete knowledge to the student and train the student model through the assistance of the knowledge distillation module The training phase is divided into two stages In the first stage, the teacher model is trained and the student model does not do any work After that, the student model is trained with assistance from the teacher
The task of the teacher model is traditional action recognition, and the student model is similar to that But there are 2 things different between the student and teacher, the first one is the input skeleton sequence being fed into the student is incomplete, it can be 20%, 40%, 60%, 80%, and 100% The next chapter of this thesis explains how to generate incomplete data
Following [6], the View Adaptive and the Rich information are used to enrich the input for GCN, in this case, AAGCN These two models use the 5-information as the input, and after features extraction, these features, on the one hand, are concatenated in fully connected to be recognized in the teacher model, and on the other hand, in the student model, are compared to the teacher features, and after that, the task of prediction is implemented This comparison is Knowledge Distillation, and Mean Square Error (MSE) is used in this module.
Loss Design
The Categorical Cross Entropy is used as the loss function of the teacher and student network And, in Knowledge Distillation Module, 𝐿 𝐾𝐷 represents MSE as the loss function of the knowledge distillation in this thesis The hyperparameter 𝛽 in the following formula is a parameter that can be adjusted before training the model, and the
𝐿 𝑡𝑜𝑡𝑎𝑙 is only used when training the student model
The task of cross-entropy loss is to compute the error between the real and the predicted probabilities In detail, the 𝐿𝑐𝑟𝑜𝑠𝑠−𝑒𝑛𝑡𝑟𝑜𝑝𝑦 is described as:
𝑖=1 where 𝑦̂ is the predicted probability for each action category, obtained by using the 𝑖 softmax layer on the output 𝑥 𝑖 of the network, 𝑐 denotes the number of the action classes,
As shown in Fig …, both two networks are trained using ten layers of AAGCN and a set of FC modules, which are considered as 11 layers in total, so it is necessary to compare the features of each layer output
Figure 20: Detail of the knowledge distillation of the teacher-student model
As shown in the following Equation, 𝑖 ∈ 1,2, … ,11 represent the features of different layers, and 𝑆 = {𝑆 1 , 𝑆 2 , … , 𝑆 11 } represents the student network features 𝑇 {𝑇 1 , 𝑇 2 , … , 𝑇 11 } represents the teacher network features
EXPERIMENTS AND RESULTS
Data
This thesis uses the NTU RGB+D 60 skeleton dataset provided in [4] for the experiment This dataset contains 56880 samples of human body movements, with a total of 60 classes The data formats of each sample are RGB image sequences, depth map sequences, 3D human skeleton action sequences, and infrared image sequences In this thesis scope, only 3D human skeleton sequences data is used, which contains 25 human nodes per action The original paper [4] of the dataset recommends two benchmarks: 1) Cross-subject (X-sub): the dataset in this benchmark is divided into a training set (40,320 videos) and a validation set (16,560 videos), where the actors in the two subsets are different, 2) Cross-view (X-View): the training set in this benchmark contains 37,920 videos captured by cameras 2 and 3, and the validation set contains 18,960 videos captured by camera 1 In this thesis, X-sub has been focused on because it is more challenging and also a reality while the actors in the training phase are different from the validation phase
Cross-Subject (CS): The test set and training set are partitioned according to the subject IDs, e.g., the subject IDs of the training set are (1, 2, 4, 5, 8, 9, 13, 14, 15, 16,
17, 18, 19, 25, 27, 28, 31, 34, 35, 38), and the remaining numbered samples are divided into the test set
In [2], the authors deal with the NTU-RGBD dataset In this dataset, mostly two people in each sample of the dataset And if the number of bodies in the sample is less than 2, they pad the second body with 0 The max number of frames in each sample is
300 For the sample with less than 300 frames, repeat the samples until it reaches 300 frames In this thesis, I also follow their configuration, the learning rate is set as 0.1 and is divided by 10 at the 30 𝑡ℎ epoch and 40 𝑡ℎ epoch The training process ended at the
The incomplete data: In [7], they considered each training video contains complete action execution and uniformly partition it into N shorter segments The first
𝑛 segments (𝑛 = 1, 2, … , 𝑁) form a partial video with a progress level of 𝑛, whose observation ratio is defined as 𝑛/N In the complete dataset, there are several samples that less than 300 frames are repeated Therefore, two kinds of incomplete data could be used as input for the student model 1) No-repeating in the incomplete data For example, if the number of frames in a certain action is 100 frames, that means, only 20 first frames are maintained, and the other is set as 0 which is no-repeat any frames of this action 2) Repeating incomplete data Take the previous example, the number of frames is 100, maintain 24 first frames, and repeat them in the next step That means the action will be maintained for 20 frames, 80 frames as 0, and repeated two times The total of frames observed in this case could be 60, divided into three segments In this thesis, the first kind of incomplete data is used.
Experimental Results
The results of experiments with the proposed framework for early action recognition are described in this chapter, including tests of the student model with incomplete time series (OR = 20%) and the teacher model with the complete time series
3.2.1 The teacher model with complete data
Table 3.1: Methodology of training teacher model and testing recognition rate results
In [3], MS-AAGCN obtained 90% on the CS benchmark Lie et al modified MS-AAGCN to be EF-AAGCN, which means MS-AAGCN with Early Fusion, and obtained
87.78% [6] In this thesis, even though with only one stream (J-stream), when combining
VA and Rich information, slightly better performance could be archived, but less and less time to spend on training
In this work, VA+Rich-1s-AAGCN of [6] is used as the teacher model with five skeleton information as input, 𝑆 1 , 𝑆 2 , 𝑆 3 , 𝑇 2 , 𝑇 3 , etc
3.2.2 The student without KD and KD-AAGCN
This section describes the methods used to train the student model without KD (OR %) and as well as KD-AAGCN
The process of reducing a sampling rate by an integer factor is referred to as the downsampling of a data sequence This could save much more time in the training phase but has to be tested on a full validation dataset to make the right comparison In downsampling, the samples taken into training have to be randomized, which means the dataset used in training is different from the consecutive epoch This act could make the fair for every sample in training, and the model could learn better.
Let denotes K as the downsampling rate, the result is shown in the following table:
Table 3.2: Comparison of the recognition rate of different downsampling rates K
Method Observation Ratio (OR = 20%) / CS(%)
Downsampling ablation experiments could reduce the performance of this model, but also reduce the training time Improving the time spent on the experiment by downsampling number as shown in Table:
Table 3.3: Comparison of the training time of different downsampling rates K
VA + Rich + 1s-AAGCN, K=1 50 min/epoch
VA + Rich + 1s-AAGCN, K=5 13 min/epoch
Using downsampling in KD-AAGCN, got the result is shown in the following table:
Table 3.4: Comparison of the student without KD and KD-AAGCN
Method Observation Ratio (OR = 20%) / CS(%)
As shown in table 3.4, VA+Rich+1s-AAGCN, K=5 got 27,84%, meanwhile, VA+Rich+KD-AAGCN, K=5 got 29,99% The accuracy is improved by using KD could make the student recognition more proper, in this case, the accuracy is a 2,15% improvement
In this section, the result from this thesis will be compared with the experimental result from the related research literature, and the experimental results will be consolidated into Table 3.5, which shows the comparison of each method
Table 3.5: Comparison of the proposed method with the related research
Method Observation Ratio (OR = 20%) / CS(%)
As shown in Table, the result of the experiment conducted by the proposed method is in bold Even though using downsampling with a rate K=5, the proposed method is more accurate than the one proposed in [12], but could not defeat the results from [13], [14] However, for the student without KD, K=1 is 34,97%, which is expected to defeat all those results above not only in OR % but also 40%, 60%, 80%, and 100% when using KD
There are not many early studies on human skeleton-based input for motion recognition, and it is necessary to select the literature that uses the same dataset and evaluation metrics for comparison In the early studies, most of the papers used RGB image sequences as network input and trained models, while in recent years, the human skeleton has been used as input and trained models Therefore, most of the research has not yet been conducted on early motion recognition based on skeleton sequence input
A comparison of the proposed method with other papers is shown in Table 6 The selection criteria for the experimental results and methods are (1) the study must be based on early motion recognition of human skeletal sequence input (2) the use of NTU RGB+D 60 dataset for training the model (3) the use of Cross Subject (CS) evaluation metrics for the experiment Based on the above three principles, a comparison of the methods was conducted
This work proposes a new architecture to transfer knowledge from teacher to student At this point just downsampling to save time and we can do more tests to find the best value of β for best performance Even so, the KD module has proven its effectiveness by increasing the accuracy compared to the student model without KD
The future work has been continued under the supervision of Prof Lie and my supervisor Dr Bui
[1] Yan, S., Xiong, Y., & Lin, D (2018, April) Spatial temporal graph convolutional networks for skeleton-based action recognition In Proceedings of the AAAI conference on artificial intelligence (Vol 32, No 1)
[2] Shi, L., Zhang, Y., Cheng, J., & Lu, H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp 12026- 12035)
[3] Shi, L., Zhang, Y., Cheng, J., & Lu, H (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks IEEE Transactions on Image Processing, 29, 9532-9545
[4] Shahroudy, A., Liu, J., Ng, T T., & Wang, G (2016) Ntu rgb+ d: A large scale dataset for 3d human activity analysis In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 1010-1019)
[5] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S.,
& Zisserman, A (2017) The kinetics human action video dataset arXiv preprint arXiv:1705.06950
[6] Lie, W N., Huang, Y J., Chiang, J C., & Fang, Z Y (2021, September) High-Order Joint Information Input For Graph Convolutional Network Based Action Recognition In 2021 IEEE International Conference on Image Processing (ICIP) (pp 1064-1068) IEEE
[7] Wang, X., Hu, J F., Lai, J H., Zhang, J., & Zheng, W S (2019) Progressive teacher- student learning for early action prediction In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp 3556-3565)
[8] Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., & Zheng, N (2019) View adaptive neural networks for high performance skeleton-based human action recognition IEEE transactions on pattern analysis and machine intelligence, 41(8), 1963-1978 [9] Schmidhuber, J., & Hochreiter, S (1997) Long short-term memory Neural Comput, 9(8), 1735-1780.