VII-O-6 HUMAN ACTIVITY RECOGNITION USING HKM AND HYBRID DEEP NETWORK IN VIDEO Vo Hoai Viet, Ly Quoc Ngoc, Tran Thai Son Computer Vision and Robotics Department, University Of Science,
Trang 1VII-O-6
HUMAN ACTIVITY RECOGNITION USING HKM AND
HYBRID DEEP NETWORK IN VIDEO
Vo Hoai Viet, Ly Quoc Ngoc, Tran Thai Son
Computer Vision and Robotics Department, University Of Science, VNU-HCM
ABSTRACT
Recognizing human activity in video has many applications in computer vision and robotics such
as human-machine interaction, surveillance system, data-driven automation and smart home It has attracted many attentions in activity in recently years In this paper, we present a novel for human activity using segment-based approach for motion and appearance features Instead of using a representation for whole video, we divide a video into overlapping segments Then, each segment is extracted motion features and appearance features In activity representation phase, we use HKM to build Bag of Words for segment descriptors and soft-weighting scheme are used to yield the histogram
of word occurrences to present for activity in video In order to have a good performance in recognition rate, we propose hybrid model for classification based on Sparse Auto-encoder and Deep Neural Network To demonstrate generalizability, our proposed method has been systematically evaluated on
a variety of dataset shown to be more effective and accurate for activity recognition compared to the previous work We obtain overall accuracies of: 95.2%, 100%, and 84.5% with our proposed method
on the KTH, Weizmann, and YouTube dataset, respectively
Keywords: Activity Recognition, Segment-based Approach, Hierachical Kmeans, Deep Network,
Sparse Auto-encoder
INTRODUCTION
Human activity recognition is a significant area of computer vision research today It has wide range of impact applications in domains such as human-machine interaction, surveillance system, data-driven automation, smart home and robotics The goal of human activity recognition is to automatically analyze ongoing activity from an unknown video (i.e a sequence of image frames) Generally speaking, activity recognition framework contains three main steps namely feature extraction, activity representation (dimension reduction …) and pattern classification
Though much progress has been made [12, 18], recognizing activity with a high accuracy remains difficult due to the complexity and variability of activity For examples, variants of human pose configurations, spontaneity of human activities, speed, influence of background, appearance of unexpected objects, illumination changes, partial occlusions, or different viewpoints, etc [12, 18] These effects can cause dramatic changes in the description of a certain activity, generating great intra-class variability So, deriving an effective activity representation from sequence of images is a vital step for successful activity recognition There are two common approaches to extract action features: local feature-based methods and global feature-based methods [18, 20]
In this research, we empirically study activity representation using segment-based approach and holistic features In specific, the main contributions of this research are: i) firstly, we focus on developing effective features for activity throughout shape and motion cues The heart of our work is motivated by the success of temporal templates [3], which is used for recognizing activity; ii) second, we propose activity representation method based on segments and HKM to yield the histogram of word occurrences for each activity in video; iii)
we propose hybrid deep network for activity classification In this model, we apply Sparse Auto-encoder to pre-training for Deep Neural Network so that we improve the performance of system Through extensive experiments, we demonstrate that our approach is able to effectively reflect the vision cues in video, and thus outperforms previous best published results on KTH, Weizmann and YouTube dataset
This paper is organized as follows: in section 2, we review related works In section 3, we present feature extraction In section 4, we introduce segment-based approach and activity representation In section 5, we present our approach for activity classification phase In section 6, we present experimental setup to evaluating our approach In section 7, we show some results from our experiments and discussion We conclude in section
8
RELATED WORKS
The problem of human activity recognition has been studied extensively in the literature Comprehensive reviews of the previous researches can be found in [12, 18] Our discussion in this section is restricted to a few influential and relevant parts of literature, with a focus only on the most relevant works
Trang 2The first idea of holistic features is temporal templates is introduced by Bobick and Davis [3] The authors presented a new approach for action representation A binary motion energy image (MEI) which represents where motion has occurred in images sequence is generated A motion history image (MHI) which is a scalar-valued image that its intensity is a function of the temporal history of motion They use the two components (MEI and MHI) for representation and recognition of human movement Moreover, the other approaches based
on encoding the information of the region of interest (ROI) as whole The ROI is usually obtained through background subtraction or tracking Common global representations are derived from silhouettes, edges or optical flow
Beside, state-of-the-art approaches [2, 7, 9, 10, 11, 13, 14, 16, 17, 20] have reported good results on human activity datasets Among most methods, local spatio-temporal features and bag-of-features (BoF) representations achieve remarkable performance for action recognition Laptev et al [10-11] are the first to introduce space-time interest point by extending 2D Harris-Laplace detector To produce denser space-time feature points, Dollar et al.[18] use a pair of 1D Gabor-filter to convolve with a spatial Gaussian to select local maximal cuboids Moreover, Many classical image features have been generalized to videos, e.g., 3D-SIFT [17], extended SURF [7], HOG3D [2], and local trinary patterns [14] Among the local space-time features, dense trajectories [9] have been shown to perform best on a variety of datasets In order to add context information to representation, Nicolas Ballas [16] use static grid to capture context and motion identifies regions with strong dynamics and light provides coarse object segmentation for activity representation
In this paper, our method falls in holistic approach category We use MEI and MHI features are extracted for each segment Moreover, we also use HOG is extracted from MEI and MHI image to capture more information about structure of movement Also, we use HKM clustering to yield effective visual words in activity representation phase
FEATURE EXTRACTION
Feature extraction is important step in activity recognition system Robust features will help the system increase the performance Silhouettes are robust to appearance variations due to internal texture and illumination but they are unable to represent the internal motion of an object To capture both motion and appearance that decomposed motion-based recognition into first describing where there is motion (the spatial pattern) and then describing how the motion is moving There are some fundamental limitations of shape- and flow-based features and how these can be overcome when the two feature types are combined So, activity features have better contain two properties above In order to extract activity features, we propose effective features that are MEI and MHI fusing with HOG
Motion Energy Image
A motion energy image (MEI)[3] encodes where the motion occurred Let I x, y, t be an image sequence and let D x, y, t be a binary image sequence indicating regions of motion In MEI Eτ(x, y, t) at t time and at location (x, y) is defined by:
𝐸𝜏 𝑥, 𝑦, 𝑡 = 𝐷 𝑥, 𝑦, 𝑡 − 𝑖
𝜏−1
𝑖=1
Motion History Image
A motion history image (MHI) [3] encodes how motion the image is moving MHI image is the weighted sum of past images and weights decay back through time Therefore, an MHI image contains the past images within itself, where the most recent image is brighter than the earlier ones In MHI 𝐻𝜏(𝑥, 𝑦, 𝑡) at t time and at location (x, y) is defined by:
𝐻𝜏(𝑥, 𝑦, 𝑡) = max 0, 𝐻𝜏 𝑖𝑓 𝐷 𝑥, 𝑦, 𝑡 = 1
𝜏 𝑥, 𝑦, 𝑡 − 1 − 1 𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒
Trang 3where the motion mask 𝐷 𝑥, 𝑦, 𝑡 is a binary image obtained from subtraction of frames, and τ is the maximum duration a motion is stored In general, τ is chosen as the constant 25
Figure 6 Illustration of segment-based approach Histogram of Oriented Gradient
HOG was proposed by Navneet Dalal in [15] It describes the spatial appearance and shape of objects The use of the aggregating of histograms gives invariance ability with translation and rotation As well as, the use of overlapping grid can partly overcome the variations such as noise, partial occlusion, and changes in viewpoint The HOG descriptor is extracted as follows:
Step 1: Normalize the image using Histogram Equalization to reduce the illumination effect
Step 2: Compute gradients (x and y direction) using Sobel filter
Step 3: Accumulate weighted votes for orientation gradient over spatial cells
Step 4: Normalize contrast within overlapping blocks of cells
Step 5: Concatenate the HOG descriptors from all bocks of dense overlapping gird into a feature vector
In our experiment, we adopted the HOG descriptor with cell size is 20x20, block size is 2x2 and 9 bins for histogram
Segment-based Approach and Activity Representation
Segment-based Approach
Segment-based approach is the method that divides video into fixed length segments Segment-based approaches can be divided into two types: non-overlapping and overlapping segments For non-overlapping segments, a video is divided into continuous and equal length segments This means information about the semantic boundary of a segment is not taken account However, this information is important because it keeps semantic meaning of each segment This method also has the advantage that the subsequent ranking algorithm does not have to deal with problems arising from length differences A variant of this fixed length method uses overlapping segments In this method, a video is divided into overlapping and equal length segments This approaches can be used that try to identify lexically and semantically coherent segments
For all used methods we have to determine the length of the segments or the number of segments for a video For the activity recognition task as described above long segments clearly have two disadvantages: longer segments have a higher risk of covering several subtopics and thus give a lower score on each of the included subtopics In the second place, long segments run the risk that they include the relevant fragment but that the beginning of the segment is nevertheless too far away from the jump-in point that should be found Short segments on the other hand might get high rankings based on just a view words Furthermore, short segments
Trang 4make the recognizing process more costly In our approach, we choose different length segments to select the optimal one by experiments In this paper, we adopted the segment length of 15 frames and use uniform segment sampling with 50% of overlapping This means the number of segments will be doubled for each overlapping experiment
Activity Representation
Bag of words (BOW) is the way of constructing a feature vector based on the number of occurrences of word for classification Each visual word is just a feature vector of patch The major issue of BOW is vector quantization algorithms to create effective clusters The original BOW used k-means algorithm to quantize feature vectors Although k-means is used widely in clustering, its accuracy is not good in some cases In 2006,
D Nister and H Stewenis [5] proposed generating a vocabulary tree using a hierarchical k-means clustering scheme Instead of solving one clustering problem with a large number of clusters, a tree organized hierarchy of smaller clustering problem is solved This help visual words capture more information about activity from different levels of visual words
In addition, binary weighting for histogram of word occurrences which indicates the presence and absence
of a visual word with values 1 and 0 respectively, was used Generally speaking, all the weighting schemes perform the nearest neighbor search in the vocabulary in the sense that each interest point is mapped to the most similar visual word (i.e., the nearest cluster centroid) We argue that, for visual words, directly assigning an interest point to its nearest neighbor is not an optimal choice, given the fact that two similar points may be clustered into different clusters when increasing the size of visual vocabulary On the other hand, simply counting the votes is not optimal as well For instance, two interest points assigned to the same visual word are not necessarily equally similar to that visual word, meaning that their distances to the cluster centroid are different Ignoring their similarity with the visual word during weight assignment causes the contribution of two interest points equal, and thus more difficult to assess the importance of a visual word in video
Figure 2 Illustration of HKM clustering with branch factor of 3 [5]
Here we propose a straight-forward soft-weighting approach to weight the significance of visual words For each interest point descriptor, instead of searching only for the nearest visual word, we select the top-K nearest visual words each level of hierarchy and weighting based on the percentage of the number of visual words at each cluster
ACTIVITY CLASSIFICATION
The classification is the final step for the activity recognition system To perform reliable recognition, there
is first important problem that the features extracted from the training pattern are detectable that should have more descriptive and distinctive information Besides, we need a good model for classifying between activities to have a good recognition rate that accepted The state of art method for classification is SVM [19] have been use
in many researches However, deep learning is an emerging trend and used in many researches with promising results in recent years In this paper, we adopted deep neural network that is a kind of deep learning Deep Neural Network is a neural network which has three or more hidden layers In order to train deep neural network,
a traditional way to train a deep neural network is an optimization problem by specifying a supervised cost function on the output layer with respect to the desired target Neural Network is used to a gradient-based optimization algorithm in order to adjust the weights and biases of the network so that its output has low cost on samples in the training set Unfortunately, deep networks trained in that manner have generally been found to perform worse than neural networks with one or two hidden layers [6, 8] To overcome this problem, Dumitru Erhan et al [4], the author answers the question ―Why Does Unsupervised Pre-training Help Deep Learning?‖ The research indicates that pre-training is a kind of regularization mechanism, by minimizing variance and introducing a bias towards configurations of the parameter space that are useful for unsupervised learning [4, 8] The greedy layer wise unsupervised strategy provides an initialization procedure, after which the neural network is fine-tuned to the global supervised objective The algorithm of the deep network training is decomposed in two steps:
Trang 5Step 1: greedily train subsets of the parameters of the network using a layer wise and unsupervised learning criterion, by repeating the following steps for each layer
Step 2: fine-tune all the parameters of the network using back-propagation and stochastic gradient descent
In this paper, we adopted Sparse Auto-encoder [1] is unsupervised learning criterion to build deep neural network with 5 layers (3 hidden layers)
Figure 3 Illustration of a deep neural network with 5 layers
EPERIMENTAL SETUP
Data sets
We evaluate our approach on three different activity datasets (KTH, Weizmann, and YouTube) that we gather from the author’s websites There are three benchmark datasets to evaluate the performance of activity recognition system
The KTH dataset consists of six different types of action classes: walking, jogging, running, boxing,
waving, and clapping Four different scenarios are used: outdoors, outdoors with zooming, outdoors with different clothing and indoors There is considerable variation in the performance and duration, and somewhat in the viewpoint The backgrounds are relatively static Apart from the zooming scenario, there is only slight camera movement We use a leave-one-subject setup and test on each original sequence while training on all other sequences together with their flipped versions
Figure 4 Some samples from KTH dataset
The Weizmann dataset consists of ten different types of action classes: bending downwards, running,
walking, skipping, jumping-jack, jumping, forward, jumping in place, galloping sideways, waving with two hands, and waving with one hand The backgrounds are static and foreground silhouettes are included in the dataset The view-point is static In addition to this dataset, two separate sets of sequences were recorded for robustness evaluation One set shows walking movement viewed from different angles The second set shows front parallel walking actions with slight variations (carrying objects, different clothing, and different styles) Leave–one–out used to evaluate the performance of our approach We train a multi-class classifier and report the average accuracy over all classes
Figure 5 Some samples from Weizmann dataset
Trang 6The YouTube dataset contains 11 activity categories: basketball shooting, biking/cycling, diving, golf
swinging, horseback riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volley ball spiking, and walking with a dog This dataset is challenging due to large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions etc Similar to the KTH actions dataset, we train a multi-class classifier and report the average accuracy over all classes We use a leave-one-subject setup and test on each original sequence while training on all other sequences together with their flipped versions
Figure 6 Some samples from YouTube dataset Framework Evaluation
In our experiments, the videos were down sampled to the resolution of 160x120 After extracting MEI and MHI for each segment with fixed length is 15, we use HOG to create descriptors And, BOW is constructed with HKM clustering from segment descriptors We have adopted the number of visual words K to 10 for HKM To limit the complexity, we randomly selected 50,000 visual words from training features to clustering Features are assigned to their closest vocabulary word using Euclidean distance And the top-N nearest visual words is 3 at each level The resulting histograms of visual word occurrences are used as video sequence representations
For classification, we use deep neural network with 5 layers (input layer, 3 hidden layers, and output layer) has the parameters such as the input layer is 1000 nodes, each hidden layer is 500 nodes, the number of output layer is the number of activity classes in dataset, the learning rate is 0.2, and the number of loop is 1000 In order
to improve performance, we adopted Auto Sparse-encoder to pre-train deep neural network is proven that will have better than traditional methods without pre-training
EXPERIMENTAL RESULTS
Table 3 Compare with state of art results on kth dataset
Xinghua Sun [2009] (SIFT + ZNK) 94
Hand [2009] (different local
features + grid layouts + object
detectors)
94.1
Alexander Klaser[2010] (feature
trajectories + HOG-HOF-
MBH)
94.2
Gilbert [2009] (hierarchical data
Nicolas Ballas[2013]
Trang 7Table 4 Compare with state of art results on Weizmann dataset
Alexander Klaser[2010]
(Harris3D + HOG3D) 90.7 Xinghua Sun [2009] (SIFT +
Weiland and Boyer [2008]
(exemplar-based embedding +
silhouettes)
100
Fati and Mori[2008] (smoothed
optical flow + silhouettes + human tracks + AdaBoost)
100
Table 5 Compare with state of art results on Youtube dataset
Alexander Klaser[2010] (LKT
Heng Wang[2013](Dense
trajectories + HOG+HOF+MBH)
84.1
In this paper, we formulate for activity representation by dividing video into segments and using BOW to create histogram of word occurrences with HKM Each segment descriptor is mixed by two properties: 1) motion
of object; 2) appearance of object movement The relative importance of these elements is based on the nature of the actvities that we aim to recognize From previous experimential results [2, 20] and our, we argue that no one single category of feature can deal with all kinds of activity datasets equally well So it is quite necessary and useful to use features that can capture different properties of activity to improve the activity recognition performance In this paper, we use segment-based approach to capture more information for action representation We use encodes where the motion occurred and MHI encodes how motion the image is moving Then, HOG are applied to capture appearance and structure of motion And, HKM are used to improve BOW of segment descriptors Moreover, we use deep neural network to improve the performance of recognition rate Tables I, II and III compare our approach result with state-of-the-art results on KTH, Weizmann and YouTube dataset respectively On KTH, our recognition rate is 95.2%, more than the current best rate by 0.6% Due to the relatively small amount of data, our recognition rate equals the performance on Weizmann dataset tops out at 100% As well as, our recognition rate is 84.5%, better than the current best rate by 0.4% on YouTube Although recognition rate is improved not too high, this show that our approach is stable on cross-dataset with the same configuration In addition, our approach extracts activity features based on these algorithms that is rapidly implementation and easy comprehension compare to existing techniques
CONCLUSION
In this paper, we present the efficient approach for activity recognition A video is divided into overlapping
segments Each segment is extracted motion and appearance features with MHI and MEI are described by HOG
To represent activity, we improve BOW model by using HKM and soft-weightings scheme in order to increase clustering accuracy and create more robust activity representation In classification phase, we use deep network
to identify the most likely class for input video In addition, hybrid deep network with sparse Auto-encoder is used to pre-train classifier for recognizing human activity cross activity datasets Our approach systematically is evaluated on several benchmark datasets such as KTH, Weizmann, and YouTube The experimental results have shown outcome performance compare to state of art methods
In the future, we will investigate new features to improve appearance, motion properties as well as context for activity representation We also will integrate activity detection into classification
REFERENCES
[1] Andrew Ng, ―CS294A Lecture notes: Sparse Autoencoder‖
[2] Alexander Klaser, Learning human actions in videos, Phd Thesis, INRIA Grenoble, 2010
[3] Bobick, A and Davis, J.: The Recognition of Human Movement Using Temporal Templates IEEE Trans On Pattern Analysis and Machine Intelligence, 2001.J
Trang 8[4] Dimitru Erchan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol and Pascal Vincent, Why Does Unsupervised Pre-Training Help Deep Learning?, Journal of Machine Learning Research, 2010 [5] D Nister and H Stewenis: Scalable recognition with a vocabulary tree In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2006
[6] Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., & Vincent, P.(2009b) The difficulty of training deep architectures and the effect of unsupervised pre-training AISTATS’2009 (pp 153–160)
[7] G Willems, T Tuytelaars, and L Gool An efficient dense and scaleinvariant spatio-temporal interest point detector In ECCV, 2008
[8] Hugo Larochelle, Yoshua Bengio, Jerome Louradour and Pascal Lamblin, Exploring Strategies for Training Deep Neural Networks, Journal of Machine Learning Research, 2009
[9] H Wang, A Kl ¨ aser, C Schmid, and C.-L Liu: Dense trajectories and motion boundary descriptors for action recognition International Journal of Computer Vision, Mar 2013
[10] Laptev: On space-time interest points Int J Comput Vision, vol 64, no 2-3, pp 107–123, Sep 2005 [11] Ivan Laptev, Learning realistic human actions from movies, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008
[12] J Aggarwal and M Ryoo, ―Human activity analysis: A review,‖ ACM Comput Surv., Apr 2011
[13] J Liu, J Luo, and M Shah Recognizing realistic actions from videos in the wild InCVPR, 2009
[14] L Yeffet and L Wolf Local trinary patterns for human action recognition In ICCV, 2009
[15] Navneet Dalal, Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR, 2005
[16] Nicolas Ballas, Yi Yang and Zhen-zhong Lan Space-Time Robust Video Representation for Action Recognition ICCV, 2013
[17] P Scovanner, S Ali, and M Shah A 3-dimensional SIFT descriptor and its application to action recognition InACM Conference on Multimedia, 2007
[18] Ronald Poppe, A survey on vision-based human action recognition, Image and Vision Computing 28, 976–990, 2010
[19] V.Vapnik: Statistical learning theory John Wiley and Sons, New York, 1998
[20] Xinghua Sun, Mingyu Chen, Alexander Hauptmann, Action Recognition via Local Descriptor and holistic features, Computer Vision and Pattern RecognitionWorkshop, IEEE, 2009