Kết hợp đặ trưng diện mạo và chuyển động trong biểu diễn hoạt động của người sử dụng mạng nơ ron tích chập

The problem of human actionrecognition can be deﬁned as below.∙ Input: A video or a sequence of consecutive frames that contain a human action.∙ Output: Label of the action that that bel

Trang 1

BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI

NGƯỜI HƯỚNG DẪN KHOA HỌC :

TS Trần Thị Thanh Hải

Hà Nội – Năm 2018

Trang 2

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

-KHONG VAN MINH

COMBINATION OF APPEARANCE AND MOTION INFORMATION IN HUMAN ACTION REPRESENTATION USING CONVOLUTIONAL NEURAL NETWORK

FIELD OF STUDY : INFORMATION SYSTEM

MASTER’S THESIS

IN INFORMATION SYSTEM

SUPERVISOR:

PhD: Tran Thi Thanh Hai

HANOI – 2018

Trang 3

SĐH.QT9.BM11

CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

Độc lập – Tự do – Hạnh phúc

BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ

Họ và tên tác giả luận văn : Khổng Văn Minh

Đề tài luận văn: Kết hợp đặc trưng diện mạo và chuyển động trong biểu diễn hoạt động của người sử dụng mạng nơ ron tích chập

Chuyên ngành: Hệ thống thông tin

Mã số SV: CBC17021

Tác giả, Người hướng dẫn khoa học và Hội đồng chấm luận văn xác nhận tác giả đã sửa chữa, bổ sung luận văn theo biên bản họp Hội đồng ngày… ………… với các nội dung sau:

………

Ngày tháng năm

CHỦ TỊCH HỘI ĐỒNG

Trang 4

In this thesis, I focus on solving the action recognition problem in video or a stack of secutive frames This problem plays an important role in surveillance systems that are verypopular nowadays There are two main solutions to solve this problem: using hand-craftedfeatures or using learned features using deep learning Both of the solutions have pros andcons and the solution that I study belongs to the secondategory Recently, advanced tech-niques relying on convolutional neural networks produced impressive improvement com-pared to traditional handcrafted features based techniques Besides, literature researchesalso showed that the use of different streams of data will help to increase recognition per-formance This paper proposes a method that exploits both RGB and optical flow for humanaction recognition Specifically, we deploy a two stream convolutional neural network thattakes RGB and optical flow computed from RGB stream as inputs Each stream has ar-chitecture of an existing 3D convolutional neural network (C3D) which has been shown to

con-be compact but efﬁcient for the task of action recognition from video Each stream worksindependently then is combined by early fusion or late fusion to output the recognitionresults We show that the proposed two-stream 3D convolutional neural network (2streamC3D) outperforms one stream C3D on two benchmark datasets UCF101 (from 82.79% to89.11%), HMDB51 (from 45.71 % to 60.87%) and CMDFALL (from 65.35% to 71.77%)

Trang 5

Firstly, I would like to express my deep gratitude to my supervisor PhD Tran Thi ThanhHai for supporting my research direction, which allowed me to explore new ideas in theﬁeld of computer vision and machine learning I would like to thank for her supervision,encouragement, motivation, and support and her guidance helped me throughout the re-search work and in writing of the thesis

I would like to acknowledge the International Research Institute MICA, HUST for viding me the great research environment

pro-I wish to express my gratitude to the teachers in Computer vision department, Mpro-ICAfor giving me the opportunity to work and acquire great research experience

I would like to acknowledge the School of Information and Communication Technologyfor providing me the knowledge and the opportunity to study

I would like to thank my friends for supporting me in my study

Last but not least, I would like to convey my deepest gratitude to my family for theirsupports, and sacriﬁces during my studies

Trang 6

1 Introduction to Human Action Recognition 9

1.1 Human Action Recognition problem 9

1.2 Overview of human action recognition approach 12

1.2.1 Hand crafted feature based methods 12

1.2.2 Deep learning based methods 13

1.2.3 Purpose of thesis 13

2 State-of-the-art on HAR using CNN 15 2.1 Introduction to Convolutional Neural Networks 15

2.2 2D Convolutional Neural Networks 17

2.3 3D Convolutional Neural Networks 18

2.4 Multistream Convolutional Neural Networks 20

3 Proposed method for HAR using multistream C3D 23 3.1 General framework 23

3.2 RGB stream 23

3.3 Optical Flow Stream 25

3.4 Fusion of multistream 3D CNN 26

3.4.1 Early fusion 26

3.4.2 Late fusion 27

4 Experimental Results 28 4.1 Datasets 28

Trang 7

4.1.1 UCF101 dataset 28

4.1.2 HMDB51 dataset 28

4.1.3 CMDFALL dataset 29

4.2 Experiment setup 30

4.3 Single stream 34

4.4 Multiple stream 35

5 Conclusion 43 5.1 Pros and Cons 43

5.2 Discussion 43

Trang 8

List of Figures

1-1 Human Action Recognition Problem 10

1-2 Human Action Recognition phases 11

1-3 Hand-crafted feature based method for Human Action Recognition 12

1-4 Deep learning method for Human Action Recognition problem 13

2-1 Main layers in Convolutional Neural Networks 15

2-2 Fusion techniques used in [ ]1 17

2-3 3D convolution operator 19

2-4 Two stream architecture for Human Action Recognition in [ ]2 21

3-1 General framework for human action recognition 24

3-2 Early fusion method by concatenate two L2-normalization feature vectors 26 3-3 Late fusion by averaging class score 27

4-1 The class labels in UCF101 dataset 29

4-2 The class labels in HMDB51 dataset 30

4-3 Experiment steps for each dataset 30

4-4 The step using C3D for experiment 32

4-5 C3D clip and video prediction 35

4-6 Confusion matrix of two stream on UCF101 36

4-7 Confusion matrix of two stream on HMBD51 36

4-8 Confusion matrix of two stream on CMDFALL 37

4-9 In HMDB51, the most confused action in the RGB stream is swing base-ball 60% of its videos are confused with throw 39

Trang 9

4-10 Most beneﬁt classes in UCF101 when combining compared to RGB stream 39

4-11 Most beneﬁt classes in HMDB51 when combining compared to RGB stream 40

4-12 Most beneﬁt classes in HMDB51 when combining compared to RGB stream 40

4-13 Classes of UCF101 in which RGB stream perform better 40

4-14 Classes of UCF101 in which Flow stream perform better 41

4-15 Classes of HMDB51 in which RGB stream perform better 41

4-16 Classes of HMDB51 in which Flow stream perform better 41

4-17 Classes of CMDFALL in which RGB stream perform better 41

4-18 Classes of CMDFALL in which Flow stream perform better 42

Trang 10

3DCNN 3D Convolutional Neural Networks , 1 19

CNN Convolutional Neural Networks , 1 15 17 20, , HAR Human Action Recognition , , 1 9 23

HOG Histogram of Gradients 12

MBH Motion boundary histograms.12

SIFT Scale-invariant feature transform 12

Trang 11

List of Tables

2.1 Result of fusion techniques on the 200,000 videos of the Sport1M test set.Hit@k indicate the fraction of test samples that contained a least one of theground truth labels in the top k predictions [ ] 1 18

2.2 C3D results on different tasks 20

2.3 Two-stream architecture mean accuracy (%) on UCF101 and HMDB51dataset 21

4.1 Class tree of CMDFALL dataset 31

4.2 Accuracy of action recognition on single and multiple streams C3D (%) 35

4.3 Comparision result on two popular benchmark datasets (%) 37

Trang 12

Chapter 1

Introduction to Human Action

Recognition

1.1 Human Action Recognition problem

Human action recognition is an important topic in computer vision domain It has manyapplications such as: surveillance system in hospital, abnormal activity detection in build-ing (bank, aeroport, hotel) or in human machine interaction There are various types ofhuman activities Depending on their complexity, we can categorize human activities intofour different levels: gestures, actions, interactions, and group activities

∙ Gestures are elementary movements of a person"s body part, and are atomic

compo-nents describing the meaningful motion of a person Example: "stretching an arm",

"raising a leg",

∙ Actions are single person activities that may be composed of multiple gestures

orga-nized temporally, such as: "walking", "waving", and "punching"

∙ Interactions are human activities that involve two or more persons and/or objects.

For example, "two person ﬁghting" is an interaction between two humans, "drinkingwater" is an interaction between human and object

∙ Group of activities are the activities performed by conceptual groups composed of

Trang 13

multiple persons and/or objects Example: "A group of persons marching",

Figure 1-1: Human Action Recognition Problem

In this thesis, we focus on the human action recognition The problem of human actionrecognition can be deﬁned as below

∙ Input: A video or a sequence of consecutive frames that contain a human action.

∙ Output: Label of the action that that belongs to one of the predeﬁned classes.

Human action recognition is a challenge for researchers in computer vision domain because

of noisy background, viewpoint changes, and variety in performing action of each person.Figure 1-1 illustrates the human action recognition problem

Key components of a visual recognition system

Figure 1-2 illustrate the two phases of a recognition system

∙ Training: Learning from the training dataset to obtain the parameters of the

recogni-tion model

∙ Recognition: Use the learned model from training phase to recognize new data.

Each phase in the system has the main components as below:

∙ Preprocessing data: Convert data to the form that are compatible for the model

Trang 14

Figure 1-2: Human Action Recognition phases

∙ Feature extraction: From the preprocessed data, extract the suitable features for

rep-resenting the human action The features can be obtained by hand crafted or deeplearning techniques

∙ Classiﬁcation: Use the features extracted from previous step to create the input for

the training or predicting

∙ Recognition: The new data is input through the step of preprocessing, feature

extrac-tion, then using the trained classiﬁer for predicting the label

Trang 15

Figure 1-3: Hand-crafted feature based method for Human Action Recognition

1.2 Overview of human action recognition approach

1.2.1 Hand crafted feature based methods

In this approach, human actions are represented by features that are manually designed

by high experience researchers Once features are extracted, they are inputs to a generictrainable classiﬁer for action recognition The building blocks for hand-crafted feature-based approach is illustrated in the ﬁgure 1-3:

∙ Feature extraction: Takes input as image or video pixel and output the features for

that image or video

∙ Classiﬁcation: A classiﬁer that takes the feature as input and provides the output as

class label

There are many types of handcrafted features designed by experts to solve the humanaction recognition problem Many classical image features have been generalized to videos,e.g 3D-SIFT HOG, 3D Among local space-time features, dense trajectories have beenshown to perform best on variety of datasets The main idea is to densely sample featurepoints in each frame, and track them in the video based on optical ﬂow Multiple descriptorsare computed along the trajectories of feature points to capture shape, appearance andmotion information.Motion boundary histograms MBH( ) give the best results among thesedescriptors The idea of dense trajectories has extended by the work of Wang and Schmid[ ] to improve of performance by considering the camera motion and achieved state-of-the-3

art in hand-crafted feature Despite its good performance, this method is computationallyintensive

Trang 16

Figure 1-4: Deep learning method for Human Action Recognition problem

1.2.2 Deep learning based methods

On the other hand, a learning-based representation approach, specifically, deep learninguses computational models with multiple processing layers to learn multiple levels of ab-straction from data This learning encompasses a set of methods that enable the machine toprocess the data in raw form and automatically transform it into a suitable representationneeded for classification This is what we call trainable feature extractors This transfor-mation process is handled at different layers These layers are learned from raw data usinggeneral purpose learning procedure which does not need to be designed manually by ex-perts The performance of the human action recognition methods mainly depends on theappropriate and efficient representation of data

Recently, deep learning achieved very good result on image-based task [ ] This result4

inspires researchers to extend it into video classiﬁcation specially to solve the human actionrecognition problem To deal with video input, the authors in [ ] use 2DCNN on individual1

frame and explore the temporal information by fusing information over temporal dimensionthrough the network In [ ], [ ], the authors uses 3D convolution operator to learn the5 6

temporal information In [ ], the authors decompose video into spatial and temporal part.2

Deep learning methods require large number of training data to achieve good result In[ ], the authors construct a large scale dataset named Sport1M which consists of 1 million1

videos downloaded from YouTube annotated with 487 classes Features learned from thisdataset can be very generic to other dataset such as UCF101 [ ]7

Trang 17

variation Then instead of using only one RGB stream, we deploy both streams (RGB andoptical ﬂow) Each stream goes through an independent C3D network then is combined atfully-connected or score level We experiment the proposed method on the popular chal-lenging benchmark datasets (UCF101 and HMDB51) and dataset built by MICA (CMD-FALL) and show how the two streams C3D outperforms the original one stream C3D.The thesis is organized as follows In chapter 2, we present state of the art on HumanAction Recognition using CNN In chapter 3, we describe our proposed methods using3D convolutional neural network for action recognition with two-stream architecture Inchapter 4, we report the result on UCF101, HMDB51, CMDFALL and analyse the result.Chapter 5 concludes and gives ideas for future works.

Trang 18

Chapter 2

State-of-the-art on HAR using CNN

2.1 Introduction to Convolutional Neural Networks

Convolutional Neural Networks CNN( ) are biologically-inspirire variants of MultilayerPerceptrons They have been very effective in areas such as image recognition and classiﬁ-cation There are four main types of layers to build ConvNet architectures: ConvolutionalLayer, Non-Linearity layer, Pooling Layer, and Fully-Connected Layer We will stack theselayers to form a full ConvNet architecture

Figure 2-1: Main layers in Convolutional Neural Networks

Trang 19

Convolutional layer

The Conv layer is the core building block of a Convolutional Network The CONVlayer’s parameters consist of a set fo learnable ﬁlters Every ﬁlter is small spatially (alongwidth and height), but extends through the full depth of the input volume For example,

a typical filter on a first layer of a ConvNet might have size 5x5x3 (5 pixels width andheight, and 3 is the number of channels of an image (RGB)) During the forward pass, weslide (more precisely, convolve) each filter across the width and height of the input volumeand compute dot products between the entries of the filter and the input at any position

As we slide the filter over the width and height of the input volume we will produce a dimensional activation map that gives the responses of that filter at every spatial position.Intuitively, the network will learn filters that activate when they see some type of visualfeature such as an edge of some orientation or a blotch of some color on the first layer Now

2-we will have an entire set of ﬁlters in each CONV layer, and each of them will produce aseparate 2-dimensional activation map We will stack these activation maps along the depthdimension and produce the output volume

Non-Linearity layer (ReLU)

An additional operation called ReLU has been used after every Convolution operation.ReLU stands for Rectiﬁed Linear Unit and is a non-linear operation Its output is givenby: Output = Max(0, Input) ReLU is an element wise operation (applied per pixel) andreplaces all negative pixel values in the feature map by zero The purpose of ReLU is tointroduce non-linearity in our ConvNet, since most of the real-world data we would wantout ConvNet to learn would be non-linear Other non linear functions such as tanh orsigmoid can also be used instead of ReLU, but ReLU has been found to perform better inmost situations

Pooling layer

The Pooling Layer operates independently on every depth slice of the input and resizes

it spatially, using the MAX operation The most common form is a pooling layer withﬁlters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by

2 along both width and height, discarding 75% of the activations Every MAX operationwould in this case be taking a max over 4 numbers (little 2x2 region in some depth slice)

Trang 20

The depth dimension remains unchanged.

Fully-Connected layer

Neurons in a fully connected layer have full connections to all activations in the vious layer, as seen in regular Neural Networks Their activations can hence be computedwith a matrix multiplication followed by a bias offset

pre-In this thesis, we focus on presenting some related works for action recognition usingCNN techniques We categorize them into three groups: methods based on 2D convolu-tional neural network; methods based on 3D convolutional neural network; methods usedmultiple streams

2.2 2D Convolutional Neural Networks

Figure 2-2: Fusion techniques used in [ ]1

Recently, 2D convnets have successfully obtained very good results on image based task[ ] Encouraged by these results, the authors in [ ] study multiple approaches for extending4 1

CNN on video input For baseline, they use a 2D CNN model operating on single frame toevaluate the contribution information of static appearance to the classiﬁcation accuracy Tolearn the information lies in temporal domain and study how it inﬂuence the performance,they use the fusion techniques as in Figure 2-2:

∙ Early fusion: from sequence of frames, they get T consecutive frames to construct the

input of size 11×11× × T3 to a CNN In this paper, they use T=10, which is

Trang 21

approx-imately a third of a second This technique combines the information immediately

on the pixel-level and allows to learn the local motion information of video

∙ Late fusion: they use 2 separate CNN towers, each take single frame as input The

two frames are chosen with the distance of 15 frames in the video The temporalinformation is then combined at the ﬁrst fully-connected layer, which is high-levelabstraction This model can learn the global motion information in the video

∙ Slow fusion: This model is a balanced mixed between the two approaches by slowly

combining the temporal information across the networks The lower layer processthe local temporal information while higher layer can access to global temporal in-formation In this paper, the ﬁrst convolutional layer apply to every T=4 consecutiveframes on an input clip of 10 frames with stride 2 The second and third layers aboveprocess with temporal extent T = 2 and stride 2 Thus, the third convolutional layerhas access to information across 10 frames

They have conducted experiment on large scale Sport1M dataset This dataset consists

of 1 million video downloaded from YouTube annotated with 487 classes The result inTable 2.1 shows that the slow fusion model performs best

Table 2.1: Result of fusion techniques on the 200,000 videos of the Sport1M test set Hit@kindicate the fraction of test samples that contained a least one of the ground truth labels inthe top k predictions [ ].1

Model Clip Hit@1 Video Hit@1 Video Hit@5

2.3 3D Convolutional Neural Networks

In [ ], [ ], the authors extend the 2D convolution operator in temporal dimension for video5 6

analysis task They propose to perform 3D convolutions in the convolution steps of CNNs

to compute features from both spatial and temporal dimension The 3D convolution is

Trang 22

achieved by convolving 3D kernel to the ﬁxed size cube formed by stacking multiple tiguous frames together as shown in Figure 2-3 By this construction, the feature maps

con-in the convolution layer is connected to multiple contiguous frames con-in the previous layer,thereby capturing motion information

In [ ], the experiment is performed on the TRECVID 2008 data and the KTH data.5

The TREVID 2008 data set consists of 49-hour real world videos data capture at LondonGatwick Airport The KTH dataset consist of 6 action classes performed by 25 subjects.The input in experiment with TREVID 2008 is 7-frame cube while with KTH dataset, this

is 9-frame cube The result shows that the 3D convolutional networks outperform the 2DCNN with noticeable margin

Figure 2-3: 3D convolution operator

In [ ], the authors proposed 3D convolution networks called to learn spatio-temporal6

feature in the large scale dataset Sport1M They show that C3D has great learning capacity,capture well the information and can process large number of video They have trained C3D

on large scale datasets: I380K and Sport1M The trained model can be used as a featureextractor on another dataset They prove that the 3D CNN architecture effectively learnthe features from video by conducting experiment on different tasks: Activity Recognition,Action Similarity Labeling, Scene and Object Recognition Table 2.2 shows the resultusing C3D in different tasks C3D outperforms most of the methods before by noticeablemargin Thus, C3D is very generic on capturing appearance and motion information invideos

Trang 23

Table 2.2: C3D results on different tasksDataset Sport1M UCF101 ASLAN YUPENN UMD Object

Method [ ]8 [ ]9 [10] [11] [11] [12]

2.4 Multistream Convolutional Neural Networks

In [ ], the authors decomposes videos into spatial and temporal components by using RGB2

and optical ﬂows These components are then fed into separate ConvNets to learn spatial

as well as temporal information about the appearance and movement of the objects in ascene Each stream is performing video recognition on its own and for ﬁnal classiﬁcation,softmax scores are combined by late fusion

∙ Spatial stream operates on individual video frame, perform action recognition from

still image

∙ Temporal stream operates on motion information of the videos in form of stacking

optical ﬂow displacement between several consecutive frames

For spatial stream, the input for the networks is a randomly selected frame from video

A 224×224 sub-image is randomly cropped from the selected frame; it then undergoesrandom horizontal ﬂipping and RGB jittering While in temporal stream, they study severaltechniques to form the input:

∙ Optical ﬂow stacking: The optical ﬂow is computed by Brox’s method By stacking

the horizontal and vertical of L consecutive frame they create the input volume ofsize 224×224× L2 for the network

∙ Trajectory stacking: An alternative motion representation, inspired by the

trajectory-based descriptors, replaces the optical ﬂow

∙ Bi-directional optical flow: The optical flow in the above techniques is forward flow.

In bi-directional method, they stack L/2 forward ﬂow computed from L/2 framesfollow current frame and L/2 backward ﬂow from L/2 frames before current frame

Trang 24

∙ Mean ﬂow subtraction: For zero-centering the input for the networks, from each

displacement ﬁeld, they subtract its mean vector

Figure 2-4: Two stream architecture for Human Action Recognition in [ ]2

They report that the mean flow subtraction is helpful, as it reduces the effect of globalmotion between the frames The bi-direction optical flow input performs best for the tem-poral stream However, for convnet fusion, the uni-directional optical flow with multi-tasklearning is the most benificial The result show that when combining multiple stream ofinformation, the performance has a significant improvement (6% over temporal and 14%over spatial nets) It means that the information in RGB and Optical flow image are com-plementary to each other

Two-stream model (fusion by averaging) 86.9 58.0

Two-stream model (fusion by SVM) 88.0 59.4

Table 2.3: Two-stream architecture mean accuracy (%) on UCF101 and HMDB51 dataset

In the work of [13], they build upon architecture of [ ] and study the fusion methods of2

the two networks both in spatial and temporal dimension

They study the spatial fusion techniques below:

∙ Sum fusion: Compute the sum of two feature maps at the same spatial location.

∙ Max fusion: Take the maximum over the two feature maps

Tiêu đề	Kết hợp đặc trưng diện mạo và chuyển động trong biểu diễn hoạt động của người sử dụng mạng nơ ron tích chập
Tác giả	Khổng Văn Minh
Người hướng dẫn	PhD. Trần Thị Thanh Hải
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Information System
Thể loại	master’s thesis
Năm xuất bản	2018
Thành phố	Hà Nội

Định dạng
Số trang	49
Dung lượng	2,52 MB