Kết hợp đặc trưng diện mạo và chuyển động trong biểu diễn hoạt động của người sử dụng mạng nơ ron tích chập =

BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI Khổng Văn Minh KẾT HỢP ĐẶC TRƯNG DIỆN MẠO VÀ CHUYỂN ĐỘNG TRONG BIỂU DIỄN HOẠT ĐỘNG CỦA NGƯỜI SỬ DỤNG MẠNG NƠ RON TÍCH CHẬP Chuyên ngành : Hệ thống thông tin LUẬN VĂN THẠC SĨ KHOA HỌC HỆ THỐNG THÔNG TIN NGƯỜI HƯỚNG DẪN KHOA HỌC : TS Trần Thị Thanh Hải Hà Nội – Năm 2018 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY KHONG VAN MINH COMBINATION OF APPEARANCE AND MOTION INFORMATION IN HUMAN ACTION REPRESENTATION USING CONVOLUTIONAL NEURAL NETWORK FIELD OF STUDY : INFORMATION SYSTEM MASTER’S THESIS IN INFORMATION SYSTEM SUPERVISOR: PhD: Tran Thi Thanh Hai HANOI – 2018 SĐH.QT9.BM11 CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập – Tự – Hạnh phúc BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ Họ tên tác giả luận văn : Khổng Văn Minh Đề tài luận văn: Kết hợp đặc trưng diện mạo chuyển động biểu diễn hoạt động người sử dụng mạng nơ ron tích chập Chun ngành: Hệ thống thơng tin Mã số SV: CBC17021 Tác giả, Người hướng dẫn khoa học Hội đồng chấm luận văn xác nhận tác giả sửa chữa, bổ sung luận văn theo biên họp Hội đồng ngày… .………… với nội dung sau: …………………………………………………………………………………………………… …………………………………………………………………………………………………… …………………………………………………………………………………………………… …………………………………………………………………………………………………… …………………………………………………………………………………………………… …………………………………………………………………………………………………… …………………………………………………………………………………… Ngày Giáo viên hướng dẫn CHỦ TỊCH HỘI ĐỒNG tháng năm Tác giả luận văn Abstract In this thesis, I focus on solving the action recognition problem in video or a stack of consecutive frames This problem plays an important role in surveillance systems that are very popular nowadays There are two main solutions to solve this problem: using hand-crafted features or using learned features using deep learning Both of the solutions have pros and cons and the solution that I study belongs to the secondategory Recently, advanced techniques relying on convolutional neural networks produced impressive improvement compared to traditional handcrafted features based techniques Besides, literature researches also showed that the use of different streams of data will help to increase recognition performance This paper proposes a method that exploits both RGB and optical flow for human action recognition Specifically, we deploy a two stream convolutional neural network that takes RGB and optical flow computed from RGB stream as inputs Each stream has architecture of an existing 3D convolutional neural network (C3D) which has been shown to be compact but efficient for the task of action recognition from video Each stream works independently then is combined by early fusion or late fusion to output the recognition results We show that the proposed two-stream 3D convolutional neural network (2stream C3D) outperforms one stream C3D on two benchmark datasets UCF101 (from 82.79% to 89.11%), HMDB51 (from 45.71 % to 60.87%) and CMDFALL (from 65.35% to 71.77%) Acknowledgments Firstly, I would like to express my deep gratitude to my supervisor PhD Tran Thi Thanh Hai for supporting my research direction, which allowed me to explore new ideas in the field of computer vision and machine learning I would like to thank for her supervision, encouragement, motivation, and support and her guidance helped me throughout the research work and in writing of the thesis I would like to acknowledge the International Research Institute MICA, HUST for providing me the great research environment I wish to express my gratitude to the teachers in Computer vision department, MICA for giving me the opportunity to work and acquire great research experience I would like to acknowledge the School of Information and Communication Technology for providing me the knowledge and the opportunity to study I would like to thank my friends for supporting me in my study Last but not least, I would like to convey my deepest gratitude to my family for their supports, and sacrifices during my studies Contents Introduction to Human Action Recognition 1.1 Human Action Recognition problem 1.2 Overview of human action recognition approach 12 1.2.1 Hand crafted feature based methods 12 1.2.2 Deep learning based methods 13 1.2.3 Purpose of thesis 13 State-of-the-art on HAR using CNN 15 2.1 Introduction to Convolutional Neural Networks 15 2.2 2D Convolutional Neural Networks 17 2.3 3D Convolutional Neural Networks 18 2.4 Multistream Convolutional Neural Networks 20 Proposed method for HAR using multistream C3D 23 3.1 General framework 23 3.2 RGB stream 23 3.3 Optical Flow Stream 25 3.4 Fusion of multistream 3D CNN 26 3.4.1 Early fusion 26 3.4.2 Late fusion 27 Experimental Results 4.1 28 Datasets 28 4.1.1 UCF101 dataset 28 4.1.2 HMDB51 dataset 28 4.1.3 CMDFALL dataset 29 4.2 Experiment setup 30 4.3 Single stream 34 4.4 Multiple stream 35 Conclusion 43 5.1 Pros and Cons 43 5.2 Discussion 43 List of Figures 1-1 Human Action Recognition Problem 10 1-2 Human Action Recognition phases 11 1-3 Hand-crafted feature based method for Human Action Recognition 12 1-4 Deep learning method for Human Action Recognition problem 13 2-1 Main layers in Convolutional Neural Networks 15 2-2 Fusion techniques used in [1] 17 2-3 3D convolution operator 19 2-4 Two stream architecture for Human Action Recognition in [2] 21 3-1 General framework for human action recognition 24 3-2 Early fusion method by concatenate two L2-normalization feature vectors 26 3-3 Late fusion by averaging class score 27 4-1 The class labels in UCF101 dataset 29 4-2 The class labels in HMDB51 dataset 30 4-3 Experiment steps for each dataset 30 4-4 The step using C3D for experiment 32 4-5 C3D clip and video prediction 35 4-6 Confusion matrix of two stream on UCF101 36 4-7 Confusion matrix of two stream on HMBD51 36 4-8 Confusion matrix of two stream on CMDFALL 37 4-9 In HMDB51, the most confused action in the RGB stream is swing baseball 60% of its videos are confused with throw 39 4-10 Most benefit classes in UCF101 when combining compared to RGB stream 39 4-11 Most benefit classes in HMDB51 when combining compared to RGB stream 40 4-12 Most benefit classes in HMDB51 when combining compared to RGB stream 40 4-13 Classes of UCF101 in which RGB stream perform better 40 4-14 Classes of UCF101 in which Flow stream perform better 41 4-15 Classes of HMDB51 in which RGB stream perform better 41 4-16 Classes of HMDB51 in which Flow stream perform better 41 4-17 Classes of CMDFALL in which RGB stream perform better 41 4-18 Classes of CMDFALL in which Flow stream perform better 42 Acronyms 3DCNN 3D Convolutional Neural Networks 1, 19 CNN Convolutional Neural Networks 1, 15, 17, 20 HAR Human Action Recognition 1, 9, 23 HOG Histogram of Gradients 12 MBH Motion boundary histograms 12 SIFT Scale-invariant feature transform 12 Figure 4-4: The step using C3D for experiment Prepare the setting files There are two setting files to prepare: input-list and output prefix In the provided example, they are: input_list_ f rm.txt, input_list_video.txt, and out put_list_pre f ix.txt in C3D_HOME/examples/c3d_ f eature_extraction/prototxt The input list file is a text file where each line contain information for a clip that you are inputting into C3D for extracting features Each line has the following format: < string_path >< starting_ f rame >< label > where < label > is only used for training, testing, or fine-tuning, but NOT for extracting features, thus can be ignored (in the provided example, they are filled with 0s) < string_path > is the full path and filename of the video (e.g input/avi/v_ApplyEyeMakeup_ g01_c01.avi) or the full path to the folder containing frames of the video (e.g input/frm/v_ ApplyEyeMakeup_g01_c01/) < starting_ f rame > is used to specify the starting frame of the clip The output prefix file is used to specify the locations for extracting features to be saved Each line is formatted as 32 < out put_pre f ix > Each line in the prefix file is corresponded to a line in the input list file (in the same order, e.g line in prefix file is the output prefix for the clip of line in the list file) C3D will save features are out put_pre f ix.[ f eature_name] (e.g prefix.fc6) The output prefix is used only in feature extraction Training and finetune After preparing the input and setting files the next step for training or finetuning is computing volume mean from list This is done by the tools compute_volume_mean_from_list.bin (for frame input) or compute_volume_mean_from_list_videos.bin (for video input) in C3D: Usage: GLOG_logtostderr=1 compute_volume_mean_from_list input_chunk_list length height width sampling_rate output_file [dropping rate] Arguments: input_chunk_list: the same as the list file used in feature extraction length: the length of the clip used in training (e.g 16) height, width: size of frame e.g 128, 171 sampling_rate: this is used to adjust the frame rate in you clip (e.g clip length=16, sampling=1, then your clip is a 16-consecutive frame video chunk Or if clip length=16, while sampling rate=2, then you clip is 32-frame long clips, but you sample of every frames) output_file: the output mean file dropping_rate: In case you dataset is too large (e.g 1M), you may want to compute the mean from a subset of your clips Setting this to n, meaning the dropping rate is 1:n, choose sample among every n clips for computing mean We train or finetune the network by train_net.bin, finetune_net.bin: train_net.bin solver_proto_file [resume_point_file] finetune_net.bin solver_proto_file pretrained_net Feature extraction Use extract_image_features tool to extract features The arguments used by this tools is follow: 33 extract_image_features.bin In which: : is prototxt file (provided in example) which points to your input list file : is the C3D pre-trained model that you downloaded : GPU ID you would like to run (starting from 0), if this is set to -1, then it will use CPU : your mini batch size Default is 50, but you can modify this number, depend on your GPU memory : Number of mini-batches you want to extract features For examples, if you have 100 clips to extract features and you are using mini-batch size of 50, then this parameter should be set to However, if you have 101 clips to be extracted features, then this number should be set to : Your output prefix file : You can list as many feature names as possible as long as they are in the names of the output blobs of the network (see prototxt file for all layers, but they look like fc6-1, fc7-1, fc8-1, pool5, conv5b, prob, ) For example: extract_image_features.bin model.prototxt sport1m_iter_1.9M 50 output_list_prefix.txt fc7-1 fc6-1 prob 4.3 Single stream For evaluation we measure clip and video accuracy The clip score is obtained from the prob layer of the networks From the score for each clip, we take the index of max score to form the class label for each clip We obtain video score by averaging clip scores as shown in Figure 4-5 We take the maximum score value to decide the class label We also experiment with SVM classifier applied on fc6 layer as input The two stream score is 34 Table 4.2: Accuracy of action recognition on single and multiple streams C3D (%) UCF101 HMDB51 CMDFALL RGB Flow RGB Flow RGB Flow Clip accuracy 80.55 62.52 49.48 38.76 64.79 55.57 Video accuracy 82.79 75.22 45.71 47.24 65.35 59.35 Linear SVM on FC6 82.97 77.24 50.31 47.81 69.80 65.48 Early fusion 89.11 60.87 71.77 Late fusion 86.30 55.42 66.83 obtained by early fusion apply on fc6 layer or late fusion apply on prob as discussed above From Table 4.2 we see that the RGB stream has better performance than the Flow stream for our method based on clip accuracy, video accuracy and linear SVM on fc6 layer The performance of RGB stream is approximate to the result reported in [6] so this result can be used for further research The performance of O-C3D is not high as expected We will try some other options for the architecture and learning techniques to boost the result in future work Figure 4-5: C3D clip and video prediction 4.4 Multiple stream The results are shown in Table 4.2, 4.3 After combining the two streams, the performance has increased in both the fusion methods On UCF101, the late fusion has outperformed RGB stream by 4.62% and outperformed Flow stream by 13.66%, the early fusion method has outperformed RGB stream by 6.14% and outperformed Flow stream by 11.87% On HMDB51, the late fusion has outperformed RGB stream by 9.71% and outperformed Flow 35 Figure 4-6: Confusion matrix of two stream on UCF101 Figure 4-7: Confusion matrix of two stream on HMBD51 stream by 8.18%, the early fusion method has outperformed RGB stream by 10.56% and outperformed Flow stream by 13.06% On CMDFALL, the late fusion has outperformed RGB stream by 1.48% and outperformed Flow stream by 7.48%, the early fusion method has outperformed RGB stream by 1.97% and outperformed Flow stream by 6.29% We show the confusion matrix of experiment in Figure 4-6 with UCF101 dataset, Figure 47 with HMDB51 and Figure 4-8 with CMDFALL for visualization We observe that the distributions of confused classes are different between the two streams For RGB stream, the distribution focuses on mostly on some regions indicating that there are some pairs of classes which are confused to each other while in Flow stream, the value for each pair is not too large For further detail, from video score in prob layer of the networks we compute f1-score for each class in both datasets We use this measurement to find out which class benefits from this two stream architecture In Figure 4-15, 4-16, 4-13, 4-14, 4-17, 4-18 we show the 36 Figure 4-8: Confusion matrix of two stream on CMDFALL Table 4.3: Comparision result on two popular benchmark datasets (%) Method UCF101 HMDB51 Hand-crafted [3] IDT 85.9 57.2 2D CNN [1] Slow fusion 65.4 3D CNN [14] Res3D (fine-tuned) 85.8 54.9 [15] LRCN (RGB) 68.19 [15] LRCN (Flow) 77.46 [15] LRCN (fusion) 82.66 Multi-stream [2] Spatial stream 73 40.5 2D CNN [2] Temporal Stream 83.7 54.6 [2] Two-stream (avg.) 86.9 58.0 [2] Two-stream (SVM) 88.0 59.4 [13] Two-stream VGG 92.5 65.4 Early fusion (Our) 89.11 60.87 Late fusion (Our) 86.30 55.42 case where each stream performs better than other In UCF101, most of the classes benefit from using two stream architecture Compare to the rgb stream, the class that most benefits from combining is shown in Figure 4-10 BandMarching, BodyWeightSquats, HandstandWalking is example of high, medium, low performance classes benefit from using two stream architecture In RGB stream, the most confused pairs is: BoxingPunchingBag with BoxingSpeedBag (30.6%), BrushingTeeth with ApplyEyeMakeup(30.6%), MilitaryParade with BandMarching(30.3%) BoxingPunchingBag and BoxingSpeedBag is from the same type of sport and be performed in the gym, BrushingTeeth and ApplyEyeMakeup have senarios in a room of the house MilitaryParade and BandMarching both have many person who walk in procession In Flow stream, the largest value of confusion fall into the 37 pair: BrushingTeeth and ShavingBeard(47.22%), Kayaking and Rafting(33.33%) BrushingTeeth and ShavingBeard only have motion of the hand, Kayaking and Rafting are similar type of sport In HMDB51, the classes in which RGB stream performs better mostly in group and while Flow stream performs better in group 1, 2, Compare to the rgb stream, the class that most benefits from combining is show in Fig 4-12 RGB mostly fail to recognize the pair of classes with high recognition rate such as: swing baseball and throw (60%), cartwheel and flic flac (50%), sword exercise and draw sword (50%) We found that these pairs share some similar scenes with each other Swing baseball and throw mostly from the same baseball match, sword exercise and draw sword perform by the same person in the scene, cartwheel and flic flac are two movements in gymnastics While in the temporal stream the confusion value for each pair of classes is small: swing baseball and throw (0%), cartwheel and flic flac (17%), sword exercise and draw sword (0%) However, it confused with some other classes with small number of videos Swing baseball confused with fencing, golf, kick, kick ball Cartwheel confused with handstand, sword, push Sword exercise confused with sword, shoot bow, brush hair, chew In the temporal stream, the most confused pair is catch with shoot ball (33.33%) that shared many scene in football which have a shooter and goalkeeper In CMDFALL, the classes in which the streams fail to recognize each other are in the following groups: (left hand pick up and right hand pick up) with 36.5% in RGB stream and 45.9% in Flow stream, (front fall, back fall, right fall, left fall) with 50.8% in RGB stream and 53.7% in Flow stream, (sit on chair then fall left, sit on chair then fall right) with 32% in RGB stream and 15.7% in Flow stream They are the same activities but are performed in different directions or by different hand From Figure 4-17, 4-18, we can see that RGB stream performs better than Flow stream in the fall classes while Flow stream performs better in the non-fall classes Table II show the comparision result of some methods on two datasets: UCF101 and HMDB51 Although our method outperforms iDT[3] (hand-crafted) on UCF101 and get comparable result on HMDB51, outperform [1] (2D CNN), original two-stream [2], LRCN [15] (multiple streams) and also recent architecture of 3D CNN in [14], [13] still has higher 38 performance than our method Figure 4-9: In HMDB51, the most confused action in the RGB stream is swing baseball 60% of its videos are confused with throw Figure 4-10: Most benefit classes in UCF101 when combining compared to RGB stream 39 Figure 4-11: Most benefit classes in HMDB51 when combining compared to RGB stream Figure 4-12: Most benefit classes in HMDB51 when combining compared to RGB stream Figure 4-13: Classes of UCF101 in which RGB stream perform better 40 Figure 4-14: Classes of UCF101 in which Flow stream perform better Figure 4-15: Classes of HMDB51 in which RGB stream perform better Figure 4-16: Classes of HMDB51 in which Flow stream perform better Figure 4-17: Classes of CMDFALL in which RGB stream perform better 41 Figure 4-18: Classes of CMDFALL in which Flow stream perform better 42 Chapter Conclusion 5.1 Pros and Cons This thesis presented a solution to improve the performance of human action recognition The proposed framework combined different information of video stream (RGB, Optical Flow) in a two streams 3D CNN architecture (C3D) The experiment on two benchmark datasets UCF101, HMDB51, CMDFALL showed that the proposed method outperformed the original C3D (using only RGB stream) In our framework, late fusion and early fusion have been studied In both cases, the two streams C3D achieved better performance than one stream C3D, that confirmed the advantages of combining optical flow with RGB This result shows the advantage when combining flow stream with rgb stream The performance of early fusion method is higher than the late fusion method on these two datasets The two stream outperformed [3], [1], [15], [2] Although the result is better, the training time is longer because we need to train networks independently 5.2 Discussion There are some point that can be consider to improve C3D In [16], the authors explore the effect of performing 3D convolutions over longer temporal durations at the input layer In [17], they use additional stream with Dynamic Image which is presented in [18] For future work, we observe that averaging the clip score to form the video score is somehow 43 not reasonable We noted this point and have conducted experiment to improve C3D by using rankpooling in [19] instead of averaging But the result is not good as expected We need to more researches at this point Further more, this architecture has large number of parameters which is address by [14] This paper introduces Res3D which is new 3D architecture built upon ResNet with times smaller and faster 44 Bibliography [1] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei Large-scale video classification with convolutional neural networks In CVPR, 2014 [2] Karen Simonyan and Andrew Zisserman Two-stream convolutional networks for action recognition in videos In Advances in neural information processing systems, pages 568–576, 2014 [3] Heng Wang and Cordelia Schmid Action recognition with improved trajectories In Proceedings of the IEEE international conference on computer vision, pages 3551– 3558, 2013 [4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In F Pereira, C J C Burges, L Bottou, and K Q Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105 Curran Associates, Inc., 2012 [5] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu 3d convolutional neural networks for human action recognition IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2013 [6] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri Learning spatiotemporal features with 3d convolutional networks In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015 [7] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah UCF101: A dataset of 101 human actions classes from videos in the wild CoRR, abs/1212.0402, 2012 [8] Q V Le, W Y Zou, S Y Yeung, and A Y Ng Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’11, pages 3361–3368, Washington, DC, USA, 2011 IEEE Computer Society [9] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov Unsupervised learning of video representations using lstms CoRR, abs/1502.04681, 2015 45 [10] Xiaojiang Peng, Limin Wang, Xingxing Wang, and Yu Qiao Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice CoRR, abs/1405.4506, 2014 [11] C Feichtenhofer, A Pinz, and R P Wildes Bags of spacetime energies for dynamic scene recognition In 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 00, pages 2681–2688, June 2014 [12] Xiaofeng Ren and M Philipose Egocentric recognition of handled objects: Benchmark and analysis, 06 2009 [13] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman Convolutional twostream network fusion for video action recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1933–1941, 2016 [14] Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri Convnet architecture search for spatiotemporal feature learning arXiv preprint arXiv:1708.05038, 2017 [15] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell Long-term recurrent convolutional networks for visual recognition and description CoRR, abs/1411.4389, 2014 [16] Gül Varol, Ivan Laptev, and Cordelia Schmid Long-term temporal convolutions for action recognition CoRR, abs/1604.04494, 2016 [17] L Jing, Y Ye, X Yang, and Y Tian 3d convolutional neural network with multimodel framework for action recognition In 2017 IEEE International Conference on Image Processing (ICIP), pages 1837–1841, Sept 2017 [18] Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, and Stephen Gould Dynamic image networks for action recognition In CVPR, 2016 [19] Basura Fernando, Efstratios Gavves, Jose Oramas, Amir Ghodrati, and Tinne Tuytelaars Modeling video evolution for action recognition In CVPR, 2015 46 ... NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ Họ tên tác giả luận văn : Khổng Văn Minh Đề tài luận văn: Kết hợp đặc trưng diện mạo chuyển động biểu diễn hoạt động người sử dụng mạng nơ ron tích chập Chuyên... the frame rate in you clip (e.g clip length=16, sampling=1, then your clip is a 16-consecutive frame video chunk Or if clip length=16, while sampling rate=2, then you clip is 32-frame long clips,... convolutional layer apply to every T=4 consecutive frames on an input clip of 10 frames with stride The second and third layers above process with temporal extent T = and stride Thus, the third convolutional

Định dạng
Số trang	49
Dung lượng	4,67 MB