Trang 1 LỂ TUẤN DŨNGHỆ THỐNG THÔNG TIN MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY --- Tuan Dung LE IMPROVING MULTI-VIEW HUMAN ACTION RECOGNITION WITH
HUMAN ACTION RECOGNITION APPROACHES
Overview
The recognition and analysis of human actions have garnered significant interest in the field of computer vision over the past three decades, driving research aimed at addressing challenges in intelligent monitoring, video recovery, video analysis, and human-machine interaction However, recent studies have underscored the complexities involved, such as variability in individual actions, differences in movement and attire, camera angles, lighting conditions, occlusions from environmental objects or body parts, and surrounding disturbances These numerous influencing factors often constrain current methods, which tend to operate effectively only in simplified scenarios with basic backgrounds, limited action classes, stationary cameras, or restricted viewing angle variations.
Human action recognition has evolved through various approaches, primarily categorized by the visual information utilized Single-view methods rely on a single camera to capture actions, but their effectiveness diminishes when actions are viewed from different angles, as they typically assume consistent viewing angles in both training and testing data To enhance accuracy, increasing the number of cameras for multi-view methods can provide a richer visual dataset Although the multi-view approach has only gained traction in the last decade due to past technological limitations, recent advancements have enabled its application across diverse contexts, significantly improving human action recognition capabilities.
Action recognition methods can be categorized into two main approaches: traditional methods that rely on manually selected features and neural network-based methods While neural networks require large training datasets to be effective, practical applications often involve medium to small datasets Consequently, this study focuses on the traditional approach, which utilizes manually extracted features to construct performance representations from either 2D or 3D data.
The current trend in 3D methods involves integrating visual information from multiple angles to create a 3D model representing actions This process typically combines 2D human body poses, using binary silhouettes to identify video frame pixels associated with the human body from each camera Once a 3D representation of the human body is established, actions are articulated as sequences of successive 3D poses Various representations used in these 3D methods include visual hulls, motion history volumes, optical flow related to the human body, Gaussian blobs, and cylindrical or ellipsoid body models.
Figure 1 1 human body in frame, b) binary silhouttes, c) 3D Human Pose a) (visual hull), d) motion history volume, e) Motion Context, f) Gaussian blob human body model, g) cylindrical/ellipsoid human body model [1]
While many approaches utilizing human body shape and 3D motion have shown success, they often require the human body to be visible in all cameras of fixed systems during both training and testing This assumption can limit practical applications, as individuals may not always be within the camera's field of view or may be obscured by other objects Insufficient information from all cameras can lead to inaccurate 3D representations and erroneous predictions In contrast, multidimensional 2D viewing methods effectively address this limitation by focusing on invariant features from various angles and integrating the predicted results for action classification Consequently, the absence of data from a particular view does not compromise the overall accuracy of the results These multidimensional 2D techniques can be further categorized into two main approaches: view-invariant features.
The initial strategy for action recognition involves depicting actions through view-invariant features Each video captured by independent cameras is analyzed to identify actions The methods focus on representing actions using these invariant features, which subsequently inform the classification of the action type.
N Junejo et al proposed a view-invariant approach to assess the similarity of a series of images over time, ensuring model stability across multiple viewing angles This method constructs a descriptive vector that encapsulates the structural characteristics of similarity and temporal differences in action sequences Initially, the authors compute the differences between consecutive frames in each video, leading to the creation of a self-similarities matrix (SSM-pos for 3D data and SSM-HOG-HOF for 2D data) They then extract a local SSM vector from the existing model and incorporate it into the K-means clustering algorithm, where clusters represent words in a dictionary using the Bag-of-Words (BoW) approach Finally, the SVM classifier employs a squared kernel with a one-vs-all strategy This method offers significant advantages, including high stability against varying angles and resilience to differences in action execution.
Figure 1 2 Construct HOG-HOF descriptive vector based on SSM matrix [6]
Anwaar-ul-Haq et al [9] introduced a view-invariant method utilizing dense optical flow and epipolar geometry for action recognition They developed a unique similarity score for action matching that leverages the characteristics of the segmentation matrix or the two-body fundamental matrix This innovative approach enables a view-invariant action matching framework without requiring any preprocessing of the original video sequences.
The article discusses the original video of walking actions, highlighting the viewpoints, volumes, and silhouettes of the actors It also explores epipolar geometry in two contexts: first, with extracted body silhouettes of the actors, and second, in dynamic scenes featuring active actors against a static background without silhouette extraction The combination of information from multiple viewpoints enhances the understanding of these scenarios.
Second approach will be done by combining information from different views
Unlike view-invariant methods, our research demonstrates that different views provide varying amounts of information that can complement each other For instance, certain body parts may be occluded in one view but visible in another, allowing for a more comprehensive understanding of actions This approach is akin to creating a 3D representation by integrating various perspectives, yet it can also be applied in 2D by merging features from different views before classification or by combining classification results afterward This method has the potential to enhance classification accuracy Key challenges include determining the most effective features to represent actions and optimizing the integration of information across views for improved predictive outcomes.
Action representation can be categorized into global and local features Global representation captures the overall structure, shape, and movement of the human body, with Motion History Images (MHI) and Motion Energy Images (MEI) serving as key examples These methods consolidate information about human motion and shape into a single "template" image, allowing for efficient analysis While global performances in action identification were extensively researched from 1997 to 2007, focusing on preserving spatial and temporal structures, there is a growing emphasis on local representations in contemporary studies.
Figure 1 4 MHI (middle row) and MEI (last row) template [15].
Local action representation in video analysis involves detecting points of interest, extracting local features, and aggregating them into action representation vectors Various point detection methods, such as Harris3D, Cuboid, Hessian3D, and dense sampling, are utilized to identify feature points in videos Once detected, a vector is computed for each feature point based on the surrounding image intensity in three dimensions Common descriptive vector types include Cuboid, HOG/HOF, HOG3D, and ESURF These locally described vectors are then used to train a Bag of Words (BoW) model, which generates a descriptive vector for each video Ultimately, these vector representations are classified to label actions Research shows that the combination of detectors and descriptive vectors varies in effectiveness across datasets; for instance, the KTH dataset achieved optimal accuracy with the Harris3D detector and HoF vector, while the UCF dataset performed best with dense sampling and the HOG3D vector Therefore, the ideal combination of detection methods and descriptive vectors is context-dependent.
In addressing problem (2), two prevalent methods for combining information from different views are early and late fusion Early fusion involves concatenating feature descriptors from various views into a single vector before classification, while late fusion trains individual classifiers for each view and combines their outputs for the final result Research by G Burghouts et al demonstrated that late fusion, using the STIP feature and BoW model, achieved the highest accuracy on the IXMAS dataset Similarly, R Kavi et al extracted features from LMEI and employed LDA classifiers for each view, combining their outputs for final predictions Their work also utilized LS™ ConvNets with both early and late fusion strategies, revealing that late fusion consistently outperformed early fusion in recognition accuracy This discrepancy can be attributed to occlusions affecting feature extraction and the lack of correlation in action representation due to varying camera perspectives While late fusion is not immune to these issues, it allows for more accurate predictions by leveraging the strongest view To enhance efficiency amidst positional and directional differences in training and assessment data, R Kavi et al proposed a circular view shifting approach during evaluation.
Section 1.2 will introduce the baseline method that we apply and propose an improving framework.
Baseline method: combination of multiple 2D views in the Bag- -Words of
G.Burghouts et al.[11] proposed a BoW pipeline consisting of STIP features
The study utilized a Harris 3D detector and HOG/HOF descriptors to extract features from video, which were then transformed into histograms using a random forest model to serve as video descriptors An SVM classifier was employed to predict action classes The authors explored various strategies for integrating information from multiple views, including early fusion of features, intermediate fusion of video descriptors, and late fusion using posterior probabilities Results indicated that averaging prediction probabilities from all views achieved the highest accuracy on the IXMAS dataset.
Figure 1 5 Illustration of spatio-temporal interest point detected in a people clapping’s video [16]
Figure 1 6 Three ways to combine multiple 2D views information in the
To recognize human actions from specific views, we begin by extracting STIP features from the video For each detected local keypoint, we compute histograms of oriented gradients (HoG) and histograms of optic flows (HOF) within spatio-temporal blocks to capture both shape and motion information By concatenating the HoG and HOF histograms, we generate a comprehensive descriptor consisting of 162 values.
In the second step, a random forest model is employed to train the codebook for each action, chosen for its superior ability to generate a more discriminative codebook and its faster processing speed compared to K-means clustering The training data includes a collection of positive features from a specific action class alongside negative features sourced from other classes, with two methods available for gathering these negative features.
(1.1) where is the codebook for action in view
To quantize the STIP features of a video, we convert them into a histogram format By processing the STIP features through trained forests, we generate a 320-bin normalized histogram that effectively summarizes the video content This histogram serves as a descriptor for training a binary classifier for classification tasks, aligning with the established codebooks and binary sets.
(1.2) where is the classifier for action in view
The Support Vector Machine (SVM) is trained with a chi kernel (C = 1) to produce posterior probabilities In binary classification, these probabilities are calibrated through Platt scaling, which involves applying logistic regression to the SVM's scores and fitting the model using additional cross-validation on the training dataset Consequently, for each test sample, a corresponding set of probabilities is generated.
(1.3) where is the probabilities of action in view
Then, the posterior probabilities from all views are combined by taking their average:
(1.4) is the probability of test sample belonging to action The label is assigned to the class having the highest probability:
This method demonstrates strong performance in multi-view human action recognition, achieving 96.4% accuracy on the IXMAS dataset using selective negative samples for random forest When tested with random negative samples, it attained 88% accuracy on the same dataset While the local descriptor STIP captures shape and motion information of keypoints, it lacks essential location data, including both spatial and temporal coordinates Additionally, the Bag of Words (BoW) model reveals the distribution of visual words in a video but fails to account for the order of appearance and spatial correlations between these words This oversight can result in confusion between actions with similar local information but different relative positions, such as arms and legs or their sequence To address these limitations, several methods have been proposed to incorporate spatial and temporal information of local features into the BoW model, leading to improved performance in single-view human action recognition Parul Shukla et al [24] segment videos based on the time domain, while M Ullah et al [25] utilize action detectors to divide the spatial domain into smaller segments The common goal is to create a final descriptor that merges information from these smaller spatial and temporal components, enhancing overall action recognition accuracy.
Based on these ideas that we have mentioned above, we proposed a framework for human action recognition using multiple views This will be described in detail in next chapter.
PROPOSED FRAMEWORK
General framework
We propose a framework for human action recognition using multiview cameras, as depicted in Fig 2.1 The process begins with extracting STIP features from a video sequence of an action, followed by the application of a Bag of Words (BoW) model to represent the actions To effectively distinguish subtle motion activities, we introduce a spatial-temporal pooling block that combines spatial and temporal information with the BoW model for improved action class representation A simple background subtraction algorithm is utilized to enhance the spatial information of local features Additionally, to address variations in performance between training and assessment data, we implement view shifting during the testing phase The main contributions of this framework are elaborated in the subsequent sections.
Combination of spatial/temporal information and Bag-of-Words model
STIP features are extracted from a sequence of original images by computing histograms of spatial gradients and optical flow within space-time neighborhoods of identified interest points To minimize the influence of background movement, STIP features are exclusively detected in moving regions of interest (ROIs), which are determined using conventional background subtraction techniques Given the significant interclass similarity among actions, where some actions are differentiated only by minor movements of body parts like hands or feet, we enhance the Bag of Words (BoW) model by dividing ROIs into segments based on human structure for each frame.
In our experimental sections, we will explore various methods of dividing the bounding box to assess their effectiveness Specifically, we will segment the bounding box into three or four spatial parts based solely on its height, as illustrated in Figure 2.2 e, and into six spatial parts using the centroid's coordinates, shown in Figure 2.2 f For each spatial division, we will calculate the histogram of the features within that section, and subsequently, concatenate these histograms to form the final feature vector.
Figure 2 2 Dividing space domain based on bounding box and centroid
2.2.2 Combination of temporal information and Bag-of-Words model -(T BoW)
Similar to space domain, we also split the video into smaller temporal parts
We will divide the video into 2 or 3 segments, depending on its length For each segment, we will create a histogram vector that captures the characteristics present during that specific time frame The final descriptive vector is formed by concatenating the histogram vectors from all segments.
Figure 2 3 Illustration of T-BoW model.
View shifting technique
During the testing phase, the individual performs the action from a different orientation compared to the training phase For example, as illustrated in Fig 2.4, the subject initially stands in front of the first camera during training, where the training sample consists of a sequence of actions.
During the testing phase, the subject performs the same action in front of a second camera The conventional Bag of Words (BoW) model, as described in section 1.2, processes this sequence through classifiers Each classifier outputs the probability of the action belonging to a specific class The class with the highest average probability is selected This approach yields favorable results when the subject's orientation during testing closely matches that of the training phase.
Figure 2 4 Illustration of view shifting in testing phase
When the orientations of the test samples differ significantly, as illustrated in Figure 2.4, it can lead to poor classification results To mitigate this issue, we systematically test all possible configurations of the test sequence by implementing a cyclical view shift Each test sample comprises a set of videos corresponding to a specific view, and through this cyclical shifting process, we can identify the optimal configuration that yields the best results.
(2.1) where is the shift index
When the camera is symmetrically positioned, testing various configurations can identify the optimal orientation that aligns with the training set subjects, leading to improved results compared to relying solely on the initial testing configuration.
For each configuration, we identify the class label with the highest probability among the available classes Ultimately, our decision for all configurations is based on the class label that exhibits the maximum probability.
EXPERIMENTS
Setup
o Programming language : Python o Library: numpy, scikit-learn, openCV o Tool: STIP code Ivan Laptev, ffmpeg o Opera system: Ubuntu 16.04 o Device : 8GB Ram, Intel Core i5 CPU 2.60 GHz
We will extract STIP features for each video and save it to a corresponding file
We use STIP code version 2.0 on linux provided by Ivan Laptez (Appendix 1) An extracted file consists of STIP feature vectors, a feature corresponds to a line.
Create codebook by Random Forest model
Due to the absence of source code from G Burghouts et al., we developed our proposed method using our own code, which incorporates a random selection approach for negative samples in the random forest algorithm.
We will develop a Random Forest model for each action class using training data that consists of 1,000 positive STIP feature vectors (HOG/HOF descriptors) randomly selected from action STIP files and 1,000 negative STIP feature vectors chosen from STIP files of other actions While this model structure is effective for data clustering, excessively large datasets may negatively impact algorithm performance.
When training a Random Forest model, a new model is generated each time due to the inherent randomness of the algorithm and the selection of positive and negative feature vectors Consequently, we develop three distinct codebooks: D1, D2, and D3 For each codebook, we systematically test various configurations, which include incorporating temporal information, spatial information, and applying the view shifting technique The final accuracy is achieved by aggregating the results from the three different codebooks, D1, D2, and D3.
When training a Random Forest model using the scikit-learn library, several key parameters can be optimized to improve performance The max_depth parameter, set to 5, limits the maximum number of leaves per tree to 32, preventing overfitting Additionally, the n_estimators parameter is set to 10, specifying the number of trees in the forest, which can be increased for more robust predictions Furthermore, the max_features parameter is set to 'auto', allowing the model to automatically determine the optimal number of features to consider when finding the best split for each tree.
After each training session, Random Forest models are preserved, with each leaf representing a visual word or cluster During the training of a video, STIP features are processed through the trained forests, resulting in the identification of the leaf positions corresponding to these features Ultimately, this process generates a 320-bin normalized histogram that effectively characterizes the video.
Classify by using Support Vector Machine
In this study, we utilize a codebook generated from a Random Forest model to represent all training videos through a visual-word histogram We then employ these descriptors to train a binary Support Vector Machine (SVM) using a one-vs-all strategy for action classification The SVM parameters are configured in the scikit-learn library with a chi-square kernel (χ2), class weights set to 'balanced' to address unbalanced data between positive and negative samples, a penalty parameter (C) of 1 for regularization, and the probability option enabled to allow for posterior probability calculations.
Output of each binary SVM correspond to action is the posterior probability of action class
In order to evaluate our proposed framework, we use the prediction accuracy and confusion matrix
Accuracy: This value is calculated by the ratio of the number of samples which are correctly predicted and the number of all samples used for the assessment
A confusion matrix is a specific table layout used to visualize the performance of supervised learning algorithms, while in unsupervised learning, it is often referred to as a matching matrix In this matrix, each row represents instances of a predicted class, and each column represents instances of an actual class The term "confusion matrix" highlights its ability to reveal whether the algorithm is mislabeling classes, thereby making it easier to identify any confusion between them.
Figure 3 5 Illustration of confusion matrix
3.4 Experiment results performance on WVU dataset at 68.36% prediction accuracy with baseline method After analyzing WVU dataset, we found 2 main reasons why the accuracy was quite lower than we expected The fist reason is the difference between training and testing samples that we have mentioned in the section 3.3.1 In addition, WVU dataset has several actions that is extremely hard to discriminate because of the similarity of these extracted STIP features We note that STIP extractor captures the points which have a huge intensity variation in two dimensions (spatial and temporal) At these points, the trajectory of the motion suddenly changes For example, clapping, wave one hand and wave two hands have a quite similar trajectory of the moving arm This can lead to the confusion between these action classes The confusion matrix (Fig 3.6 shows a) two action groups which are hard to discriminate with the baseline BoW model: the fist class contains clapping, waving one and two hands; and the second group contains punching, kicking, throwing, bowling
Table 3 1 Accuracy (%) of basic BoW model on WVU dataset
Table 3.2 presents the results achieved with the T-BoW model by segmenting the temporal domain into 2 and 3 bins, showing slight performance improvements of 3.09% and 1.39% in accuracy, respectively However, when action durations in the dataset are short, significant performance enhancements are not observed Additionally, increasing the number of temporal segments may lead to less robust action descriptors due to intra-class variation, resulting in decreased performance if the number of temporal parts is excessively high.
Table 3 2 Accuracy (%) of T-BoW model on WVU dataset
Figure 3 6 Confusion matrix: a) Basic BoW model with codebook D3, accuracy 70,83%; b) S-BoW model with 4 spatial parts codebook D3, accuracy 82,41%
The S-BoW model was tested with parameters s = 3 and s = 4, resulting in a significant accuracy increase of 12.97%, from 68.36% to 81.33% with s = 4 By dividing the bounding box into spatial bins along the height of the subjects, this method effectively discriminates STIP features detected in various moving regions As illustrated in Fig 3.7, the division into four bins enhances the differentiation between actions such as clapping and waving, as well as kicking and bowling However, confusion persists between classes like waving one hand versus two hands, and between punching, throwing, and bowling When the human body is divided spatially with s = 6, the accuracy does not meet expectations; while this division improves the distinction between waving actions, it also leads to confusion among other action pairs due to the arbitrary views of test samples.
Table 3 3 Accuracy (%) of S-BoW model on WVU dataset
Figure 3 7 Confusion matrices: a) S-BoW model with 6 spatial parts, codebook D3, accuracy 78,24%; b) S-BoW model with 6 spatial parts and view shifting, codebook D3, accuracy 96,67%
In our third experiment, we evaluated the view shifting technique, as summarized in Table 3.4 and illustrated by the confusion matrix in Fig 3.7 This technique proved to be highly effective for the WVU dataset, as it successfully identifies the most similar orientations between testing and training samples Notably, we observed significant performance improvements using the shift-view technique, achieving a 13.58% increase in accuracy compared to the baseline, even without considering additional spatial information (s=1) When incorporating spatial data, accuracy reached 92.28% with six spatial bins and 90.71% with four spatial bins.
Table 3 4 Accuracy (%) of S-BoW model with (w) and without (w/o) view shifting technique on WVU dataset
In our comparison of methods on the WVU dataset, we achieved an impressive accuracy of 92.28% by employing spatial pooling with a 6-bin configuration and utilizing a view shifting technique during the testing phase This demonstrates that our proposed method surpasses other state-of-the-art techniques, including those utilizing advanced convolutional neural networks However, it is important to note that our approach relies on extracting STIP features, which may lead to reduced effectiveness in scenarios where the subject remains nearly static; for instance, the recognition of the standing still class is based solely on the number of STIP features extracted.
In our experiments with the WVU dataset, we employed a baseline method, achieving an average accuracy of 57.81% across three codebooks, as shown in Table 3.6 Analyzing the confusion matrix (Fig 3.7a) and reviewing the N-UCLA dataset revealed two primary factors contributing to the low accuracy Firstly, the dataset contains similar action classes, such as picking up with one hand, picking up with two hands, and dropping trash Secondly, the dataset suffers from inaccurate segmentation This results in a large number of overlapping action classes, where individuals are frequently walking throughout the video (e.g., picking up with one hand, picking up with two hands, dropping trash, carrying, and walking around), leading to reduced robustness in both local features and the Bag of Words (BOW) model.
Table 3 6 Accuracy (%) of basic model on N-UCLA dataset
We will sequentially implement the T-BoW model, the S-BoW model, and a combination of the S-BoW model with a view shifting technique As shown in Table 3.7, the results obtained with the T-BoW model, when dividing the temporal domain into 2 and 3 bins, indicate a slight performance improvement of approximately 1.5%.
Table 3 7 Accuracy (%) of T-BoW model on N-UCLA dataset
The S-BoW model demonstrates a significant improvement in performance, achieving an accuracy of 63.24% when the bounding box of the subject is divided into six spatial parts, which is a 5.5% increase over the basic BoW model Analysis of the confusion matrices indicates that while the S-BoW model with six spatial divisions reduces some confusion among action classes, it does not completely eliminate confusion for any specific pairs.
Table 3 8 Accuracy (%) of the combination of S-BoW model and view shifting on N-UCLA dataset
Figure 3 8 Confusion matrices: a) Basic BoW model, codebook D3, accuracy 59,57%; S-BoW model with 6 b) spatial parts, codebook D3, accuracy 63,40%