The problem of human actionrecognition can be defined as below.∙ Input: A video or a sequence of consecutive frames that contain a human action.∙ Output: Label of the action that that bel
TS Trần Thị Thanh Hải
Hà Nội – Năm 2018
PhD: Tran Thi Thanh Hai
HANOI – 2018
Trang 3SĐH.QT9.BM11
Độc lập – Tự do – Hạnh phúc
Họ và tên tác giả luận văn : Khổng Văn Minh
Đề tài luận văn: Kết hợp đặc trưng diện mạo và chuyển động trong biểu diễn hoạt động của người sử dụng mạng nơ ron tích chập
Chuyên ngành: Hệ thống thông tin
Mã số SV: CBC17021
Tác giả, Người hướng dẫn khoa học và Hội đồng chấm luận văn xác nhận tác giả đã sửa chữa, bổ sung luận văn theo biên bản họp Hội đồng ngày… ………… với các nội dung sau:
Ngày tháng năm
Trang 4In this thesis, I focus on solving the action recognition problem in video or a stack of secutive frames This problem plays an important role in surveillance systems that are verypopular nowadays There are two main solutions to solve this problem: using hand-craftedfeatures or using learned features using deep learning Both of the solutions have pros andcons and the solution that I study belongs to the secondategory Recently, advanced tech-niques relying on convolutional neural networks produced impressive improvement com-pared to traditional handcrafted features based techniques Besides, literature researchesalso showed that the use of different streams of data will help to increase recognition per-formance This paper proposes a method that exploits both RGB and optical flow for humanaction recognition Specifically, we deploy a two stream convolutional neural network thattakes RGB and optical flow computed from RGB stream as inputs Each stream has ar-chitecture of an existing 3D convolutional neural network (C3D) which has been shown to
con-be compact but efficient for the task of action recognition from video Each stream worksindependently then is combined by early fusion or late fusion to output the recognitionresults We show that the proposed two-stream 3D convolutional neural network (2streamC3D) outperforms one stream C3D on two benchmark datasets UCF101 (from 82.79% to89.11%), HMDB51 (from 45.71 % to 60.87%) and CMDFALL (from 65.35% to 71.77%)
Trang 5Firstly, I would like to express my deep gratitude to my supervisor PhD Tran Thi ThanhHai for supporting my research direction, which allowed me to explore new ideas in thefield of computer vision and machine learning I would like to thank for her supervision,encouragement, motivation, and support and her guidance helped me throughout the re-search work and in writing of the thesis
I would like to acknowledge the International Research Institute MICA, HUST for viding me the great research environment
pro-I wish to express my gratitude to the teachers in Computer vision department, Mpro-ICAfor giving me the opportunity to work and acquire great research experience
I would like to acknowledge the School of Information and Communication Technologyfor providing me the knowledge and the opportunity to study
I would like to thank my friends for supporting me in my study
Last but not least, I would like to convey my deepest gratitude to my family for theirsupports, and sacrifices during my studies
Trang 61 Introduction to Human Action Recognition 9
1.1 Human Action Recognition problem 9
1.2 Overview of human action recognition approach 12
1.2.1 Hand crafted feature based methods 12
1.2.2 Deep learning based methods 13
1.2.3 Purpose of thesis 13
2 State-of-the-art on HAR using CNN 15 2.1 Introduction to Convolutional Neural Networks 15
2.2 2D Convolutional Neural Networks 17
2.3 3D Convolutional Neural Networks 18
2.4 Multistream Convolutional Neural Networks 20
3 Proposed method for HAR using multistream C3D 23 3.1 General framework 23
3.2 RGB stream 23
3.3 Optical Flow Stream 25
3.4 Fusion of multistream 3D CNN 26
3.4.1 Early fusion 26
3.4.2 Late fusion 27
4 Experimental Results 28 4.1 Datasets 28
Trang 74.1.1 UCF101 dataset 28
4.1.2 HMDB51 dataset 28
4.1.3 CMDFALL dataset 29
4.2 Experiment setup 30
4.3 Single stream 34
4.4 Multiple stream 35
5 Conclusion 43 5.1 Pros and Cons 43
5.2 Discussion 43
Trang 8List of Figures
1-1 Human Action Recognition Problem 10
1-2 Human Action Recognition phases 11
1-3 Hand-crafted feature based method for Human Action Recognition 12
1-4 Deep learning method for Human Action Recognition problem 13
2-1 Main layers in Convolutional Neural Networks 15
2-2 Fusion techniques used in [ ]1 17
2-3 3D convolution operator 19
2-4 Two stream architecture for Human Action Recognition in [ ]2 21
3-1 General framework for human action recognition 24
3-2 Early fusion method by concatenate two L2-normalization feature vectors 26 3-3 Late fusion by averaging class score 27
4-1 The class labels in UCF101 dataset 29
4-2 The class labels in HMDB51 dataset 30
4-3 Experiment steps for each dataset 30
4-4 The step using C3D for experiment 32
4-5 C3D clip and video prediction 35
4-6 Confusion matrix of two stream on UCF101 36
4-7 Confusion matrix of two stream on HMBD51 36
4-8 Confusion matrix of two stream on CMDFALL 37
4-9 In HMDB51, the most confused action in the RGB stream is swing base-ball 60% of its videos are confused with throw 39
Trang 94-10 Most benefit classes in UCF101 when combining compared to RGB stream 39
4-11 Most benefit classes in HMDB51 when combining compared to RGB stream 40
4-12 Most benefit classes in HMDB51 when combining compared to RGB stream 40
4-13 Classes of UCF101 in which RGB stream perform better 40
4-14 Classes of UCF101 in which Flow stream perform better 41
4-15 Classes of HMDB51 in which RGB stream perform better 41
4-16 Classes of HMDB51 in which Flow stream perform better 41
4-17 Classes of CMDFALL in which RGB stream perform better 41
4-18 Classes of CMDFALL in which Flow stream perform better 42
Trang 103DCNN 3D Convolutional Neural Networks , 1 19
CNN Convolutional Neural Networks , 1 15 17 20, , HAR Human Action Recognition , , 1 9 23
HOG Histogram of Gradients 12
MBH Motion boundary histograms.12
SIFT Scale-invariant feature transform 12
Trang 11List of Tables
2.1 Result of fusion techniques on the 200,000 videos of the Sport1M test set.Hit@k indicate the fraction of test samples that contained a least one of theground truth labels in the top k predictions [ ] 1 18
2.2 C3D results on different tasks 20
2.3 Two-stream architecture mean accuracy (%) on UCF101 and HMDB51dataset 21
4.1 Class tree of CMDFALL dataset 31
4.2 Accuracy of action recognition on single and multiple streams C3D (%) 35
4.3 Comparision result on two popular benchmark datasets (%) 37
Trang 12Chapter 1
Introduction to Human Action
1.1 Human Action Recognition problem
Human action recognition is an important topic in computer vision domain It has manyapplications such as: surveillance system in hospital, abnormal activity detection in build-ing (bank, aeroport, hotel) or in human machine interaction There are various types ofhuman activities Depending on their complexity, we can categorize human activities intofour different levels: gestures, actions, interactions, and group activities
∙ Gestures are elementary movements of a person"s body part, and are atomic
compo-nents describing the meaningful motion of a person Example: "stretching an arm",
"raising a leg",
∙ Actions are single person activities that may be composed of multiple gestures
orga-nized temporally, such as: "walking", "waving", and "punching"
∙ Interactions are human activities that involve two or more persons and/or objects.
For example, "two person fighting" is an interaction between two humans, "drinkingwater" is an interaction between human and object
∙ Group of activities are the activities performed by conceptual groups composed of
Trang 13multiple persons and/or objects Example: "A group of persons marching",
Figure 1-1: Human Action Recognition Problem
In this thesis, we focus on the human action recognition The problem of human actionrecognition can be defined as below
∙ Input: A video or a sequence of consecutive frames that contain a human action.
∙ Output: Label of the action that that belongs to one of the predefined classes.
Human action recognition is a challenge for researchers in computer vision domain because
of noisy background, viewpoint changes, and variety in performing action of each person.Figure 1-1 illustrates the human action recognition problem
Key components of a visual recognition system
Figure 1-2 illustrate the two phases of a recognition system
∙ Training: Learning from the training dataset to obtain the parameters of the
recogni-tion model
∙ Recognition: Use the learned model from training phase to recognize new data.
Each phase in the system has the main components as below:
∙ Preprocessing data: Convert data to the form that are compatible for the model
Trang 14Figure 1-2: Human Action Recognition phases
∙ Feature extraction: From the preprocessed data, extract the suitable features for
rep-resenting the human action The features can be obtained by hand crafted or deeplearning techniques
∙ Classification: Use the features extracted from previous step to create the input for
the training or predicting
∙ Recognition: The new data is input through the step of preprocessing, feature
extrac-tion, then using the trained classifier for predicting the label
Trang 15Figure 1-3: Hand-crafted feature based method for Human Action Recognition
1.2 Overview of human action recognition approach
1.2.1 Hand crafted feature based methods
In this approach, human actions are represented by features that are manually designed
by high experience researchers Once features are extracted, they are inputs to a generictrainable classifier for action recognition The building blocks for hand-crafted feature-based approach is illustrated in the figure 1-3:
∙ Feature extraction: Takes input as image or video pixel and output the features for
that image or video
∙ Classification: A classifier that takes the feature as input and provides the output as
class label
There are many types of handcrafted features designed by experts to solve the humanaction recognition problem Many classical image features have been generalized to videos,e.g 3D-SIFT HOG, 3D Among local space-time features, dense trajectories have beenshown to perform best on variety of datasets The main idea is to densely sample featurepoints in each frame, and track them in the video based on optical flow Multiple descriptorsare computed along the trajectories of feature points to capture shape, appearance andmotion information.Motion boundary histograms MBH( ) give the best results among thesedescriptors The idea of dense trajectories has extended by the work of Wang and Schmid[ ] to improve of performance by considering the camera motion and achieved state-of-the-3
art in hand-crafted feature Despite its good performance, this method is computationallyintensive
Trang 16Figure 1-4: Deep learning method for Human Action Recognition problem
1.2.2 Deep learning based methods
On the other hand, a learning-based representation approach, specifically, deep learninguses computational models with multiple processing layers to learn multiple levels of ab-straction from data This learning encompasses a set of methods that enable the machine toprocess the data in raw form and automatically transform it into a suitable representationneeded for classification This is what we call trainable feature extractors This transfor-mation process is handled at different layers These layers are learned from raw data usinggeneral purpose learning procedure which does not need to be designed manually by ex-perts The performance of the human action recognition methods mainly depends on theappropriate and efficient representation of data
Recently, deep learning achieved very good result on image-based task [ ] This result4
inspires researchers to extend it into video classification specially to solve the human actionrecognition problem To deal with video input, the authors in [ ] use 2DCNN on individual1
frame and explore the temporal information by fusing information over temporal dimensionthrough the network In [ ], [ ], the authors uses 3D convolution operator to learn the5 6
temporal information In [ ], the authors decompose video into spatial and temporal part.2
Deep learning methods require large number of training data to achieve good result In[ ], the authors construct a large scale dataset named Sport1M which consists of 1 million1
videos downloaded from YouTube annotated with 487 classes Features learned from thisdataset can be very generic to other dataset such as UCF101 [ ]7
Trang 17variation Then instead of using only one RGB stream, we deploy both streams (RGB andoptical flow) Each stream goes through an independent C3D network then is combined atfully-connected or score level We experiment the proposed method on the popular chal-lenging benchmark datasets (UCF101 and HMDB51) and dataset built by MICA (CMD-FALL) and show how the two streams C3D outperforms the original one stream C3D.The thesis is organized as follows In chapter 2, we present state of the art on HumanAction Recognition using CNN In chapter 3, we describe our proposed methods using3D convolutional neural network for action recognition with two-stream architecture Inchapter 4, we report the result on UCF101, HMDB51, CMDFALL and analyse the result.Chapter 5 concludes and gives ideas for future works.
Trang 18Chapter 2
State-of-the-art on HAR using CNN
2.1 Introduction to Convolutional Neural Networks
Convolutional Neural Networks CNN( ) are biologically-inspirire variants of MultilayerPerceptrons They have been very effective in areas such as image recognition and classifi-cation There are four main types of layers to build ConvNet architectures: ConvolutionalLayer, Non-Linearity layer, Pooling Layer, and Fully-Connected Layer We will stack theselayers to form a full ConvNet architecture
Figure 2-1: Main layers in Convolutional Neural Networks
Trang 19Convolutional layer
The Conv layer is the core building block of a Convolutional Network The CONVlayer’s parameters consist of a set fo learnable filters Every filter is small spatially (alongwidth and height), but extends through the full depth of the input volume For example,
a typical filter on a first layer of a ConvNet might have size 5x5x3 (5 pixels width andheight, and 3 is the number of channels of an image (RGB)) During the forward pass, weslide (more precisely, convolve) each filter across the width and height of the input volumeand compute dot products between the entries of the filter and the input at any position
As we slide the filter over the width and height of the input volume we will produce a dimensional activation map that gives the responses of that filter at every spatial position.Intuitively, the network will learn filters that activate when they see some type of visualfeature such as an edge of some orientation or a blotch of some color on the first layer Now
2-we will have an entire set of filters in each CONV layer, and each of them will produce aseparate 2-dimensional activation map We will stack these activation maps along the depthdimension and produce the output volume
Non-Linearity layer (ReLU)
An additional operation called ReLU has been used after every Convolution operation.ReLU stands for Rectified Linear Unit and is a non-linear operation Its output is givenby: Output = Max(0, Input) ReLU is an element wise operation (applied per pixel) andreplaces all negative pixel values in the feature map by zero The purpose of ReLU is tointroduce non-linearity in our ConvNet, since most of the real-world data we would wantout ConvNet to learn would be non-linear Other non linear functions such as tanh orsigmoid can also be used instead of ReLU, but ReLU has been found to perform better inmost situations
Pooling layer
The Pooling Layer operates independently on every depth slice of the input and resizes
it spatially, using the MAX operation The most common form is a pooling layer withfilters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by
2 along both width and height, discarding 75% of the activations Every MAX operationwould in this case be taking a max over 4 numbers (little 2x2 region in some depth slice)
Trang 20The depth dimension remains unchanged.
Fully-Connected layer
Neurons in a fully connected layer have full connections to all activations in the vious layer, as seen in regular Neural Networks Their activations can hence be computedwith a matrix multiplication followed by a bias offset
pre-In this thesis, we focus on presenting some related works for action recognition usingCNN techniques We categorize them into three groups: methods based on 2D convolu-tional neural network; methods based on 3D convolutional neural network; methods usedmultiple streams
2.2 2D Convolutional Neural Networks
Figure 2-2: Fusion techniques used in [ ]1
Recently, 2D convnets have successfully obtained very good results on image based task[ ] Encouraged by these results, the authors in [ ] study multiple approaches for extending4 1
CNN on video input For baseline, they use a 2D CNN model operating on single frame toevaluate the contribution information of static appearance to the classification accuracy Tolearn the information lies in temporal domain and study how it influence the performance,they use the fusion techniques as in Figure 2-2:
∙ Early fusion: from sequence of frames, they get T consecutive frames to construct the
input of size 11×11× × T3 to a CNN In this paper, they use T=10, which is
Trang 21approx-imately a third of a second This technique combines the information immediately
on the pixel-level and allows to learn the local motion information of video
∙ Late fusion: they use 2 separate CNN towers, each take single frame as input The
two frames are chosen with the distance of 15 frames in the video The temporalinformation is then combined at the first fully-connected layer, which is high-levelabstraction This model can learn the global motion information in the video
∙ Slow fusion: This model is a balanced mixed between the two approaches by slowly
combining the temporal information across the networks The lower layer processthe local temporal information while higher layer can access to global temporal in-formation In this paper, the first convolutional layer apply to every T=4 consecutiveframes on an input clip of 10 frames with stride 2 The second and third layers aboveprocess with temporal extent T = 2 and stride 2 Thus, the third convolutional layerhas access to information across 10 frames
They have conducted experiment on large scale Sport1M dataset This dataset consists
of 1 million video downloaded from YouTube annotated with 487 classes The result inTable 2.1 shows that the slow fusion model performs best
Table 2.1: Result of fusion techniques on the 200,000 videos of the Sport1M test set Hit@kindicate the fraction of test samples that contained a least one of the ground truth labels inthe top k predictions [ ].1
Model Clip Hit@1 Video Hit@1 Video Hit@5
2.3 3D Convolutional Neural Networks
In [ ], [ ], the authors extend the 2D convolution operator in temporal dimension for video5 6
analysis task They propose to perform 3D convolutions in the convolution steps of CNNs
to compute features from both spatial and temporal dimension The 3D convolution is
Trang 22achieved by convolving 3D kernel to the fixed size cube formed by stacking multiple tiguous frames together as shown in Figure 2-3 By this construction, the feature maps
con-in the convolution layer is connected to multiple contiguous frames con-in the previous layer,thereby capturing motion information
In [ ], the experiment is performed on the TRECVID 2008 data and the KTH data.5
The TREVID 2008 data set consists of 49-hour real world videos data capture at LondonGatwick Airport The KTH dataset consist of 6 action classes performed by 25 subjects.The input in experiment with TREVID 2008 is 7-frame cube while with KTH dataset, this
is 9-frame cube The result shows that the 3D convolutional networks outperform the 2DCNN with noticeable margin
Figure 2-3: 3D convolution operator
In [ ], the authors proposed 3D convolution networks called to learn spatio-temporal6
feature in the large scale dataset Sport1M They show that C3D has great learning capacity,capture well the information and can process large number of video They have trained C3D
on large scale datasets: I380K and Sport1M The trained model can be used as a featureextractor on another dataset They prove that the 3D CNN architecture effectively learnthe features from video by conducting experiment on different tasks: Activity Recognition,Action Similarity Labeling, Scene and Object Recognition Table 2.2 shows the resultusing C3D in different tasks C3D outperforms most of the methods before by noticeablemargin Thus, C3D is very generic on capturing appearance and motion information invideos
Trang 23Table 2.2: C3D results on different tasksDataset Sport1M UCF101 ASLAN YUPENN UMD Object
Method [ ]8 [ ]9 [10] [11] [11] [12]
2.4 Multistream Convolutional Neural Networks
In [ ], the authors decomposes videos into spatial and temporal components by using RGB2
and optical flows These components are then fed into separate ConvNets to learn spatial
as well as temporal information about the appearance and movement of the objects in ascene Each stream is performing video recognition on its own and for final classification,softmax scores are combined by late fusion
∙ Spatial stream operates on individual video frame, perform action recognition from
still image
∙ Temporal stream operates on motion information of the videos in form of stacking
optical flow displacement between several consecutive frames
For spatial stream, the input for the networks is a randomly selected frame from video
A 224×224 sub-image is randomly cropped from the selected frame; it then undergoesrandom horizontal flipping and RGB jittering While in temporal stream, they study severaltechniques to form the input:
∙ Optical flow stacking: The optical flow is computed by Brox’s method By stacking
the horizontal and vertical of L consecutive frame they create the input volume ofsize 224×224× L2 for the network
∙ Trajectory stacking: An alternative motion representation, inspired by the
trajectory-based descriptors, replaces the optical flow
∙ Bi-directional optical flow: The optical flow in the above techniques is forward flow.
In bi-directional method, they stack L/2 forward flow computed from L/2 framesfollow current frame and L/2 backward flow from L/2 frames before current frame
Trang 24∙ Mean flow subtraction: For zero-centering the input for the networks, from each
displacement field, they subtract its mean vector
Figure 2-4: Two stream architecture for Human Action Recognition in [ ]2
They report that the mean flow subtraction is helpful, as it reduces the effect of globalmotion between the frames The bi-direction optical flow input performs best for the tem-poral stream However, for convnet fusion, the uni-directional optical flow with multi-tasklearning is the most benificial The result show that when combining multiple stream ofinformation, the performance has a significant improvement (6% over temporal and 14%over spatial nets) It means that the information in RGB and Optical flow image are com-plementary to each other
Two-stream model (fusion by averaging) 86.9 58.0
Two-stream model (fusion by SVM) 88.0 59.4
Table 2.3: Two-stream architecture mean accuracy (%) on UCF101 and HMDB51 dataset
In the work of [13], they build upon architecture of [ ] and study the fusion methods of2
the two networks both in spatial and temporal dimension
They study the spatial fusion techniques below:
∙ Sum fusion: Compute the sum of two feature maps at the same spatial location.
∙ Max fusion: Take the maximum over the two feature maps