Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM
VII-O-6
HUMAN ACTIVITY RECOGNITION USING HKM AND
HYBRID DEEP NETWORK IN VIDEO
Vo Hoai Viet, Ly Quoc Ngoc, Tran Thai Son
Computer Vision and Robotics Department, University Of Science, VNU-HCM
ABSTRACT
Recognizing human activity in video has many applications in computer vision and robotics such
as human-machine interaction, surveillance system, data-driven automation and smart home. It has
attracted many attentions in activity in recently years. In this paper, we present a novel for human
activity using segment-based approach for motion and appearance features. Instead of using a
representation for whole video, we divide a video into overlapping segments. Then, each segment is
extracted motion features and appearance features. In activity representation phase, we use HKM to
build Bag of Words for segment descriptors and soft-weighting scheme are used to yield the histogram
of word occurrences to present for activity in video. In order to have a good performance in recognition
rate, we propose hybrid model for classification based on Sparse Auto-encoder and Deep Neural
Network. To demonstrate generalizability, our proposed method has been systematically evaluated on
a variety of dataset shown to be more effective and accurate for activity recognition compared to the
previous work. We obtain overall accuracies of: 95.2%, 100%, and 84.5% with our proposed method
on the KTH, Weizmann, and YouTube dataset, respectively.
Keywords: Activity Recognition, Segment-based Approach, Hierachical Kmeans, Deep Network,
Sparse Auto-encoder.
INTRODUCTION
Human activity recognition is a significant area of computer vision research today. It has wide range of
impact applications in domains such as human-machine interaction, surveillance system, data-driven automation,
smart home and robotics. The goal of human activity recognition is to automatically analyze ongoing activity
from an unknown video (i.e. a sequence of image frames). Generally speaking, activity recognition framework
contains three main steps namely feature extraction, activity representation (dimension reduction …) and pattern
classification.
Though much progress has been made [12, 18], recognizing activity with a high accuracy remains difficult
due to the complexity and variability of activity. For examples, variants of human pose configurations,
spontaneity of human activities, speed, influence of background, appearance of unexpected objects, illumination
changes, partial occlusions, or different viewpoints, etc. [12, 18]. These effects can cause dramatic changes in
the description of a certain activity, generating great intra-class variability. So, deriving an effective activity
representation from sequence of images is a vital step for successful activity recognition. There are two common
approaches to extract action features: local feature-based methods and global feature-based methods [18, 20].
In this research, we empirically study activity representation using segment-based approach and holistic
features. In specific, the main contributions of this research are: i) firstly, we focus on developing effective
features for activity throughout shape and motion cues. The heart of our work is motivated by the success of
temporal templates [3], which is used for recognizing activity; ii) second, we propose activity representation
method based on segments and HKM to yield the histogram of word occurrences for each activity in video; iii)
we propose hybrid deep network for activity classification. In this model, we apply Sparse Auto-encoder to pretraining for Deep Neural Network so that we improve the performance of system. Through extensive
experiments, we demonstrate that our approach is able to effectively reflect the vision cues in video, and thus
outperforms previous best published results on KTH, Weizmann and YouTube dataset.
This paper is organized as follows: in section 2, we review related works. In section 3, we present feature
extraction. In section 4, we introduce segment-based approach and activity representation. In section 5, we
present our approach for activity classification phase. In section 6, we present experimental setup to evaluating
our approach. In section 7, we show some results from our experiments and discussion. We conclude in section
8.
RELATED WORKS
The problem of human activity recognition has been studied extensively in the literature. Comprehensive
reviews of the previous researches can be found in [12, 18]. Our discussion in this section is restricted to a few
influential and relevant parts of literature, with a focus only on the most relevant works.
ISBN: 978-604-82-1375-6
42
Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM
The first idea of holistic features is temporal templates is introduced by Bobick and Davis [3]. The authors
presented a new approach for action representation. A binary motion energy image (MEI) which represents
where motion has occurred in images sequence is generated. A motion history image (MHI) which is a scalarvalued image that its intensity is a function of the temporal history of motion. They use the two components
(MEI and MHI) for representation and recognition of human movement. Moreover, the other approaches based
on encoding the information of the region of interest (ROI) as whole. The ROI is usually obtained through
background subtraction or tracking. Common global representations are derived from silhouettes, edges or
optical flow.
Beside, state-of-the-art approaches [2, 7, 9, 10, 11, 13, 14, 16, 17, 20] have reported good results on human
activity datasets. Among most methods, local spatio-temporal features and bag-of-features (BoF) representations
achieve remarkable performance for action recognition. Laptev et al. [10-11] are the first to introduce space-time
interest point by extending 2D Harris-Laplace detector. To produce denser space-time feature points, Dollar et
al.[18] use a pair of 1D Gabor-filter to convolve with a spatial Gaussian to select local maximal cuboids.
Moreover, Many classical image features have been generalized to videos, e.g., 3D-SIFT [17], extended SURF
[7], HOG3D [2], and local trinary patterns [14]. Among the local space-time features, dense trajectories [9] have
been shown to perform best on a variety of datasets. In order to add context information to representation,
Nicolas Ballas [16] use static grid to capture context and motion identifies regions with strong dynamics and
light provides coarse object segmentation for activity representation.
In this paper, our method falls in holistic approach category. We use MEI and MHI features are extracted
for each segment. Moreover, we also use HOG is extracted from MEI and MHI image to capture more
information about structure of movement. Also, we use HKM clustering to yield effective visual words in
activity representation phase.
FEATURE EXTRACTION
Feature extraction is important step in activity recognition system. Robust features will help the system
increase the performance. Silhouettes are robust to appearance variations due to internal texture and illumination
but they are unable to represent the internal motion of an object. To capture both motion and appearance that
decomposed motion-based recognition into first describing where there is motion (the spatial pattern) and then
describing how the motion is moving. There are some fundamental limitations of shape- and flow-based features
and how these can be overcome when the two feature types are combined. So, activity features have better
contain two properties above. In order to extract activity features, we propose effective features that are MEI and
MHI fusing with HOG.
Motion Energy Image
A motion energy image (MEI)[3] encodes where the motion occurred. Let I x, y, t be an image sequence
and let D x, y, t be a binary image sequence indicating regions of motion. In MEI Eτ (x, y, t) at t time and at
location (x, y) is defined by:
𝜏−1
𝐸𝜏 𝑥, 𝑦, 𝑡 =
𝐷 𝑥, 𝑦, 𝑡 − 𝑖
𝑖=1
Motion History Image
A motion history image (MHI) [3] encodes how motion the image is moving. MHI image is the weighted
sum of past images and weights decay back through time. Therefore, an MHI image contains the past images
within itself, where the most recent image is brighter than the earlier ones. In MHI 𝐻𝜏 (𝑥, 𝑦, 𝑡) at t time and at
location (x, y) is defined by:
𝜏 𝑖𝑓 𝐷 𝑥, 𝑦, 𝑡 = 1
𝐻𝜏 (𝑥, 𝑦, 𝑡) =
max 0, 𝐻𝜏 𝑥, 𝑦, 𝑡 − 1 − 1 𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒
ISBN: 978-604-82-1375-6
43
Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM
where the motion mask 𝐷 𝑥, 𝑦, 𝑡 is a binary image obtained from subtraction of frames, and τ is the
maximum duration a motion is stored. In general, τ is chosen as the constant 25
Figure 6. Illustration of segment-based approach
Histogram of Oriented Gradient
HOG was proposed by Navneet Dalal in [15]. It describes the spatial appearance and shape of objects. The
use of the aggregating of histograms gives invariance ability with translation and rotation. As well as, the use of
overlapping grid can partly overcome the variations such as noise, partial occlusion, and changes in viewpoint.
The HOG descriptor is extracted as follows:
Step 1: Normalize the image using Histogram Equalization to reduce the illumination effect.
Step 2: Compute gradients (x and y direction) using Sobel filter.
Step 3: Accumulate weighted votes for orientation gradient over spatial cells.
Step 4: Normalize contrast within overlapping blocks of cells.
Step 5: Concatenate the HOG descriptors from all bocks of dense overlapping gird into a feature vector.
In our experiment, we adopted the HOG descriptor with cell size is 20x20, block size is 2x2 and 9 bins for
histogram.
Segment-based Approach and Activity Representation
Segment-based Approach
Segment-based approach is the method that divides video into fixed length segments. Segment-based
approaches can be divided into two types: non-overlapping and overlapping segments. For non-overlapping
segments, a video is divided into continuous and equal length segments. This means information about the
semantic boundary of a segment is not taken account. However, this information is important because it keeps
semantic meaning of each segment. This method also has the advantage that the subsequent ranking algorithm
does not have to deal with problems arising from length differences. A variant of this fixed length method uses
overlapping segments. In this method, a video is divided into overlapping and equal length segments. This
approaches can be used that try to identify lexically and semantically coherent segments.
For all used methods we have to determine the length of the segments or the number of segments for a
video. For the activity recognition task as described above long segments clearly have two disadvantages: longer
segments have a higher risk of covering several subtopics and thus give a lower score on each of the included
subtopics. In the second place, long segments run the risk that they include the relevant fragment but that the
beginning of the segment is nevertheless too far away from the jump-in point that should be found. Short
segments on the other hand might get high rankings based on just a view words. Furthermore, short segments
ISBN: 978-604-82-1375-6
44
Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM
make the recognizing process more costly. In our approach, we choose different length segments to select the
optimal one by experiments. In this paper, we adopted the segment length of 15 frames and use uniform segment
sampling with 50% of overlapping. This means the number of segments will be doubled for each overlapping
experiment.
Activity Representation
Bag of words (BOW) is the way of constructing a feature vector based on the number of occurrences of
word for classification. Each visual word is just a feature vector of patch. The major issue of BOW is vector
quantization algorithms to create effective clusters. The original BOW used k-means algorithm to quantize
feature vectors. Although k-means is used widely in clustering, its accuracy is not good in some cases. In 2006,
D. Nister and H. Stewenis [5] proposed generating a vocabulary tree using a hierarchical k-means clustering
scheme. Instead of solving one clustering problem with a large number of clusters, a tree organized hierarchy of
smaller clustering problem is solved. This help visual words capture more information about activity from
different levels of visual words.
In addition, binary weighting for histogram of word occurrences which indicates the presence and absence
of a visual word with values 1 and 0 respectively, was used. Generally speaking, all the weighting schemes
perform the nearest neighbor search in the vocabulary in the sense that each interest point is mapped to the most
similar visual word (i.e., the nearest cluster centroid). We argue that, for visual words, directly assigning an
interest point to its nearest neighbor is not an optimal choice, given the fact that two similar points may be
clustered into different clusters when increasing the size of visual vocabulary. On the other hand, simply
counting the votes is not optimal as well. For instance, two interest points assigned to the same visual word are
not necessarily equally similar to that visual word, meaning that their distances to the cluster centroid are
different. Ignoring their similarity with the visual word during weight assignment causes the contribution of two
interest points equal, and thus more difficult to assess the importance of a visual word in video.
Figure 2. Illustration of HKM clustering with branch factor of 3. [5]
Here we propose a straight-forward soft-weighting approach to weight the significance of visual words. For
each interest point descriptor, instead of searching only for the nearest visual word, we select the top-K nearest
visual words each level of hierarchy and weighting based on the percentage of the number of visual words at
each cluster.
ACTIVITY CLASSIFICATION
The classification is the final step for the activity recognition system. To perform reliable recognition, there
is first important problem that the features extracted from the training pattern are detectable that should have
more descriptive and distinctive information. Besides, we need a good model for classifying between activities to
have a good recognition rate that accepted. The state of art method for classification is SVM [19] have been use
in many researches. However, deep learning is an emerging trend and used in many researches with promising
results in recent years. In this paper, we adopted deep neural network that is a kind of deep learning. Deep
Neural Network is a neural network which has three or more hidden layers. In order to train deep neural network,
a traditional way to train a deep neural network is an optimization problem by specifying a supervised cost
function on the output layer with respect to the desired target. Neural Network is used to a gradient-based
optimization algorithm in order to adjust the weights and biases of the network so that its output has low cost on
samples in the training set. Unfortunately, deep networks trained in that manner have generally been found to
perform worse than neural networks with one or two hidden layers [6, 8]. To overcome this problem, Dumitru
Erhan et al. [4], the author answers the question ―Why Does Unsupervised Pre-training Help Deep Learning?‖.
The research indicates that pre-training is a kind of regularization mechanism, by minimizing variance and
introducing a bias towards configurations of the parameter space that are useful for unsupervised learning [4, 8].
The greedy layer wise unsupervised strategy provides an initialization procedure, after which the neural
network is fine-tuned to the global supervised objective. The algorithm of the deep network training is
decomposed in two steps:
ISBN: 978-604-82-1375-6
45
Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM
Step 1: greedily train subsets of the parameters of the network using a layer wise and unsupervised learning
criterion, by repeating the following steps for each layer.
Step 2: fine-tune all the parameters of the network using back-propagation and stochastic gradient descent.
In this paper, we adopted Sparse Auto-encoder [1] is unsupervised learning criterion to build deep neural
network with 5 layers (3 hidden layers).
Figure 3. Illustration of a deep neural network with 5 layers.
EPERIMENTAL SETUP
Data sets
We evaluate our approach on three different activity datasets (KTH, Weizmann, and YouTube) that we
gather from the author’s websites. There are three benchmark datasets to evaluate the performance of activity
recognition system.
The KTH dataset consists of six different types of action classes: walking, jogging, running, boxing,
waving, and clapping. Four different scenarios are used: outdoors, outdoors with zooming, outdoors with
different clothing and indoors. There is considerable variation in the performance and duration, and somewhat in
the viewpoint. The backgrounds are relatively static. Apart from the zooming scenario, there is only slight
camera movement. We use a leave-one-subject setup and test on each original sequence while training on all
other sequences together with their flipped versions.
Figure 4. Some samples from KTH dataset
The Weizmann dataset consists of ten different types of action classes: bending downwards, running,
walking, skipping, jumping-jack, jumping, forward, jumping in place, galloping sideways, waving with two
hands, and waving with one hand. The backgrounds are static and foreground silhouettes are included in the
dataset. The view-point is static. In addition to this dataset, two separate sets of sequences were recorded for
robustness evaluation. One set shows walking movement viewed from different angles. The second set shows
front parallel walking actions with slight variations (carrying objects, different clothing, and different styles).
Leave–one–out used to evaluate the performance of our approach. We train a multi-class classifier and report the
average accuracy over all classes.
Figure 5. Some samples from Weizmann dataset
ISBN: 978-604-82-1375-6
46
Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM
The YouTube dataset contains 11 activity categories: basketball shooting, biking/cycling, diving, golf
swinging, horseback riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volley ball spiking,
and walking with a dog. This dataset is challenging due to large variations in camera motion, object appearance
and pose, object scale, viewpoint, cluttered background, illumination conditions etc. Similar to the KTH actions
dataset, we train a multi-class classifier and report the average accuracy over all classes. We use a leave-onesubject setup and test on each original sequence while training on all other sequences together with their flipped
versions.
Figure 6. Some samples from YouTube dataset
Framework Evaluation
In our experiments, the videos were down sampled to the resolution of 160x120. After extracting MEI and
MHI for each segment with fixed length is 15, we use HOG to create descriptors. And, BOW is constructed with
HKM clustering from segment descriptors. We have adopted the number of visual words K to 10 for HKM. To
limit the complexity, we randomly selected 50,000 visual words from training features to clustering. Features are
assigned to their closest vocabulary word using Euclidean distance. And the top-N nearest visual words is 3 at
each level. The resulting histograms of visual word occurrences are used as video sequence representations.
For classification, we use deep neural network with 5 layers (input layer, 3 hidden layers, and output layer)
has the parameters such as the input layer is 1000 nodes, each hidden layer is 500 nodes, the number of output
layer is the number of activity classes in dataset, the learning rate is 0.2, and the number of loop is 1000. In order
to improve performance, we adopted Auto Sparse-encoder to pre-train deep neural network is proven that will
have better than traditional methods without pre-training.
EXPERIMENTAL RESULTS
Table 3. Compare with state of art results on kth dataset
ISBN: 978-604-82-1375-6
Methods
Xinghua Sun [2009] (SIFT + ZNK)
Hand [2009] (different local
features + grid layouts + object
detectors)
Alexander Klaser[2010] (feature
trajectories + HOG-HOFMBH)
Gilbert [2009] (hierarchical data
mining)
Nicolas Ballas[2013] (SpaceTime+Context)
Accuracy (%)
94
Our Approach
95.2
94.1
94.2
94.5
94.6
47
Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM
Table 4. Compare with state of art results on Weizmann dataset
Methods
Alexander Klaser[2010]
(Harris3D + HOG3D)
Xinghua Sun [2009] (SIFT +
ZNK)
Weiland and Boyer [2008]
(exemplar-based embedding +
silhouettes)
Fati and Mori[2008] (smoothed
optical flow + silhouettes
+ human tracks + AdaBoost)
Our Approach
Accuracy (%)
90.7
97.8
100
100
100
Table 5. Compare with state of art results on Youtube dataset
Methods
Liu [2009]
Alexander Klaser[2010] (LKT
trajectories + HOG-HOF-MBH)
Heng Wang[2013](Dense
trajectories +
HOG+HOF+MBH)
Our Approach
Accuracy (%)
71.2
79.8
84.1
84.5
In this paper, we formulate for activity representation by dividing video into segments and using BOW to
create histogram of word occurrences with HKM. Each segment descriptor is mixed by two properties: 1) motion
of object; 2) appearance of object movement. The relative importance of these elements is based on the nature of
the actvities that we aim to recognize. From previous experimential results [2, 20] and our, we argue that no one
single category of feature can deal with all kinds of activity datasets equally well. So it is quite necessary and
useful to use features that can capture different properties of activity to improve the activity recognition
performance. In this paper, we use segment-based approach to capture more information for action
representation. We use encodes where the motion occurred and MHI encodes how motion the image is moving.
Then, HOG are applied to capture appearance and structure of motion. And, HKM are used to improve BOW of
segment descriptors. Moreover, we use deep neural network to improve the performance of recognition rate.
Tables I, II and III compare our approach result with state-of-the-art results on KTH, Weizmann and
YouTube dataset respectively. On KTH, our recognition rate is 95.2%, more than the current best rate by 0.6%.
Due to the relatively small amount of data, our recognition rate equals the performance on Weizmann dataset
tops out at 100%. As well as, our recognition rate is 84.5%, better than the current best rate by 0.4% on
YouTube. Although recognition rate is improved not too high, this show that our approach is stable on crossdataset with the same configuration. In addition, our approach extracts activity features based on these
algorithms that is rapidly implementation and easy comprehension compare to existing techniques.
CONCLUSION
In this paper, we present the efficient approach for activity recognition. A video is divided into overlapping
segments. Each segment is extracted motion and appearance features with MHI and MEI are described by HOG.
To represent activity, we improve BOW model by using HKM and soft-weightings scheme in order to increase
clustering accuracy and create more robust activity representation. In classification phase, we use deep network
to identify the most likely class for input video. In addition, hybrid deep network with sparse Auto-encoder is
used to pre-train classifier for recognizing human activity cross activity datasets. Our approach systematically is
evaluated on several benchmark datasets such as KTH, Weizmann, and YouTube. The experimental results have
shown outcome performance compare to state of art methods.
In the future, we will investigate new features to improve appearance, motion properties as well as context
for activity representation. We also will integrate activity detection into classification.
REFERENCES
[1] Andrew Ng, ―CS294A Lecture notes: Sparse Autoencoder‖.
[2] Alexander Klaser, Learning human actions in videos, Phd Thesis, INRIA Grenoble, 2010.
[3] Bobick, A. and Davis, J.: The Recognition of Human Movement Using Temporal Templates. IEEE
Trans. On Pattern Analysis and Machine Intelligence, 2001.J.
ISBN: 978-604-82-1375-6
48
Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
Dimitru Erchan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol and Pascal Vincent, Why
Does Unsupervised Pre-Training Help Deep Learning?, Journal of Machine Learning Research, 2010.
D. Nister and H. Stewenis: Scalable recognition with a vocabulary tree. In proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2006.
Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., & Vincent, P.(2009b). The difficulty of training
deep architectures and the effect of unsupervised pre-training. AISTATS’2009 (pp. 153–160).
G. Willems, T. Tuytelaars, and L. Gool. An efficient dense and scaleinvariant spatio-temporal interest
point detector. In ECCV, 2008.
Hugo Larochelle, Yoshua Bengio, Jerome Louradour and Pascal Lamblin, Exploring Strategies for
Training Deep Neural Networks, Journal of Machine Learning Research, 2009.
H. Wang, A. Kl ¨ aser, C. Schmid, and C.-L. Liu: Dense trajectories and motion boundary descriptors
for action recognition. International Journal of Computer Vision, Mar. 2013.
Laptev: On space-time interest points. Int. J. Comput. Vision, vol. 64, no. 2-3, pp. 107–123, Sep. 2005.
Ivan Laptev, Learning realistic human actions from movies, Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition, 2008.
J. Aggarwal and M. Ryoo, ―Human activity analysis: A review,‖ ACM Comput. Surv., Apr. 2011.
J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos in the wild. InCVPR, 2009.
L. Yeffet and L. Wolf. Local trinary patterns for human action recognition. In ICCV, 2009.
Navneet Dalal, Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR, 2005.
Nicolas Ballas, Yi Yang and Zhen-zhong Lan. Space-Time Robust Video Representation for Action
Recognition. ICCV, 2013.
P. Scovanner, S. Ali, and M. Shah. A 3-dimensional SIFT descriptor and its application to action
recognition. InACM Conference on Multimedia, 2007.
Ronald Poppe, A survey on vision-based human action recognition, Image and Vision Computing 28,
976–990, 2010.
V.Vapnik: Statistical learning theory. John Wiley and Sons, New York, 1998.
Xinghua Sun, Mingyu Chen, Alexander Hauptmann, Action Recognition via Local Descriptor and
holistic features, Computer Vision and Pattern RecognitionWorkshop, IEEE, 2009.
ISBN: 978-604-82-1375-6
49
... contains 11 activity categories: basketball shooting, biking/cycling, diving, golf swinging, horseback riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volley ball spiking,... bending downwards, running, walking, skipping, jumping-jack, jumping, forward, jumping in place, galloping sideways, waving with two hands, and waving with one hand The backgrounds are static and. .. Does Unsupervised Pre-training Help Deep Learning?‖ The research indicates that pre-training is a kind of regularization mechanism, by minimizing variance and introducing a bias towards configurations