Kết hợp thông tin không gian thời gian và áp dụng kĩ thuật chuyển hướng góc nhìn cho bài toán nhận dạng hành động con người sử dụng đa góc nhìn

MINISTRY OF EDUCATION AND TRAINING LỂ TUẤN DŨNG HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY - Tuan Dung LE HỆ THỐNG THÔNG TIN IMPROVING MULTI-VIEW HUMAN ACTION RECOGNITION WITH SPATIAL-TEMPORAL POOLING AND VIEW SHIFTING TECHNIQUES MASTER OF SCIENCE THESIS IN INFORMATION SYSTEM 2017-2018 Hanoi – 2018 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY - Tuan Dung LE IMPROVING MULTI-VIEW HUMAN ACTION RECOGNITION WITH SPATIAL-TEMPORAL POOLING AND VIEW SHIFTING TECHNIQUES Speciality: Information System MASTER OF SCIENCE THESIS IN INFORMATION SYSTEM SUPERVISOR : Dr Thi Oanh NGUYEN Hanoi – 2018 Master student: Tuan Dung LE – CBC17016 Page ACKNOWLEDGEMENT First of all, I sincerely thank the teachers in the School of Information and Communication Technology as well as all the teachers at the Hanoi University of Technology has taught me the knowledge and valuable experience during the past years I would like to thank the two supervisors, Dr Nguyen Thi Oanh - lecturer in Information Systems and Communication, Institute of Information and Communication Technology, Hanoi University of Technology and Dr Tran Thi Thanh Hai, MICA Research Institute has guided me to complete this master thesis I have learned a lot from them, not only the knowledge of the field of computer vision but also working and studying skills such as writing papers, preparing slides and presenting to the crowd Finally, I would like to send my thanks to my family, friends and people who have always supported me in the process of studying and researching this thesis Hanoi, March 2018 Master student Tuan Dung LE Master student: Tuan Dung LE – CBC17016 Page TABLE OF CONTENT ACKNOWLEDGEMENT TABLE OF CONTENT LIST OF FIGURES LIST OF TABLES LIST OF ABBREVIATIONS AND DEFINITIONS OF TERMS .9 INTRODUCTION .10 CHAPTER HUMAN ACTION RECOGNITION APPROACHES .12 1.1 Overview 12 1.2 Baseline method: combination of multiple 2D views in the Bag-of-Words model .20 CHAPTER PROPOSED FRAMEWORK 24 2.1 General framework 24 2.2 Combination of spatial/temporal information and Bag-of-Words model .25 2.2.1 Combination of spatial information and Bag-of-Words model (S-BoW) 25 2.2.2 Combination of temporal information and Bag-of-Words model (T-BoW) 26 2.3 View shifting technique 27 CHAPTER EXPERIMENTS 30 3.1 Setup environment 30 3.2 Setup 30 3.3 Datasets 30 3.3.1 Western Virginia University Multi-view Action Recognition Dataset (WVU) 30 3.3.2 Northwestern-UCLA Multiview Action 3D (N-UCLA) 32 3.4 Performance measurement .33 3.5 Experiment results 35 3.5.1 WVU dataset .35 3.5.2 N-UCLA dataset 40 CONCLUSION & FUTURE WORK .43 REFERENCES 44 Master student: Tuan Dung LE – CBC17016 Page APPENDIX 47 Master student: Tuan Dung LE – CBC17016 Page LIST OF FIGURES Figure 1 a) human body in frame, b) binary silhouttes, c) 3D Human Pose (visual hull), d) motion history volume, e) Motion Context, f) Gaussian blob human body model, g) cylindrical/ellipsoid human body model [1] .14 Figure Construct HOG-HOF descriptive vector based on SSM matrix[6] 16 Figure a) Original video of walking action with viewpoints 0𝑜 and 45𝑜 , their volumes and silhouettes, b) epipolar geometry in case of extracted actor body silhouettes, c) epipolar geometry in case of dynamic scene with dynamic actor and static background without extracting silhouettes[9] 16 Figure MHI (middle row) and MEI (last row) template [15] .18 Figure Illustration of spatio-temporal interest point detected in a people clapping’s video [16] .21 Figure Three ways to combine multiple 2D views information in the BoW model [11] .21 Figure Proposed framework 24 Figure 2 Dividing space domain based on bounding box and centroid 26 Figure Illustration of T-BoW model .27 Figure Illustration of view shifting in testing phase .28 Figure Ilustration of 12 action classes in the WVU Multi-view actions dataset .31 Figure Cameras setup for capturing WVU dataset .31 Figure 3 Ilustration of 10 action classes in the N-UCLA Multi-view Actions 3D dataset 32 Figure Cameras setup for capturing N-UCLA dataset 33 Figure Illustration of confusion matrix 35 Figure Confusion matrix: a) Basic BoW model with codebook D3, accuracy 70,83%; b) S-BoW model with spatial parts codebook D3, accuracy 82,41% 37 Figure Confusion matrices: a) S-BoW model with spatial parts, codebook D3, accuracy 78,24%; b) S-BoW model with spatial parts and view shifting, codebook D3, accuracy 96,67% 38 Figure Confusion matrices: a) Basic BoW model, codebook D3, accuracy 59,57%; b) S-BoW mofel with spatial parts, codebook D3, accuracy 63,40% 41 Master student: Tuan Dung LE – CBC17016 Page Figure Illustration of view shifting on N-UCLA dataset 42 Master student: Tuan Dung LE – CBC17016 Page LIST OF TABLES Table Accuracy (%) of basic BoW model on WVU dataset 36 Table Accuracy (%) of T-BoW model on WVU dataset 36 Table 3 Accuracy (%) of S-BoW model on WVU dataset 38 Table Accuracy (%) of S-BoW model with (w) and without (w/o) view shifting technique on WVU dataset 39 Table Comparison with others methods on WVU Dataset 39 Table Accuracy (%) of basic model on N-UCLA dataset 40 Table Accuracy (%) of T-BoW model on N-UCLA dataset .40 Table Accuracy (%) of the combination of S-BoW model and view shifting on N-UCLA dataset 41 Table Accuracy (%) of S-BoW model with (w) and without (w/o) view shifting technique on N-UCLA dataset 42 Master student: Tuan Dung LE – CBC17016 Page LIST OF ABBREVIATIONS AND DEFINITIONS OF TERMS Index Abbreviation Full name MHI Motion History Image MEI Motion Energy Image LMEI Localized Motion Energy Image STIP Spatio-Temporal Interest Point SSM Self-Similarities Matrix HOG Histogram of Oriented Gradient HOF Histogram of Optical Flow IXMAS INRIA Xmas Acquisition Sequences BoW Bag-of-Words 10 ROIs Region of Interest Master student: Tuan Dung LE – CBC17016 Page INTRODUCTION In the growing social scene from the 3.0 era (automation of information technology and electronic production) to the new 4.0 (a new convergence of technologies such as the Internet Things - Internet, collaboration robots, 3D printing and cloud computing, and the emergence of new business models), automatically collecting and processing information by the computer is very necessary This leads to higher demands on the interaction between humans and machines both in precision and speed Thus, the problems of object recognition, motion recognition, speech recognition are now attracting a lot of interest of scientists and companies around the world Nowadays, video data is easily generated by devices such as digital cameras, laptops, mobile phones, and video-sharing websites Human action recognition in the video, contributing to the automated exploitation of the resources of this rich data source Applications related to human action recognition problems such as: Security and traditional monitoring systems include networks of cameras and are monitored by humans With the increase in the number of cameras as well as these systems being deployed in multiple locations, the supervisor's efficiency and accuracy issues are required to cover the entire system The task of computer vision is to find a solution that can replace or assist the supervisor Automatic recognition of abnormalities from surveillance systems is a matter that attracts a lot of research The problem of enhancing interaction between humans and machines is still challenging, the visual cues are the most important method of non-verbal communication Effectively exploiting gesture-based communication will create a more accurate and natural human-computer interaction A typical application in the field is the "smart home", intelligent response to the gesture, the action of the user However, these applications are still incomplete and still attract more research In addition, human action recognition problem is also applied in a number of other applications, such as robots, content-based video analysis, content-based and recovery-based video compression, video indexing, and virtual reality games Master student: Tuan Dung LE – CBC17016 Page 10 Figure Cameras setup for capturing N-UCLA dataset 3.2 Setup Environment and Parameters o Programming language : Python o Library: numpy, scikit-learn, openCV o Tool: STIP code Ivan Laptev, ffmpeg o Opera system: Ubuntu 16.04 o Device : 8GB Ram, Intel Core i5 CPU 2.60 GHz  Extract STIP feature We will extract STIP features for each video and save it to a corresponding file We use STIP code version 2.0 on linux provided by Ivan Laptez (Appendix 1) An extracted file consists of STIP feature vectors, a feature corresponds to a line  Create codebook by Random Forest model Because G.Burghouts et al.[5] did not provide their source code, so we implement our proposed method by our new code using random option for selecting negatives sample for random forest We will train a Random Forest model for each action class 𝑖 Training data are: o 1000 positive STIP feature vectors (HOG/HOF descriptor) (picked up randomly from STIP files of action 𝑖) o 1000 negative STIP feature vectors (picked up randomly from STIP files of others actions) The number 1000 is chose after analyzing the number of local feature vector STIP of particular videos and actions With an action, if we can not collect 1000 feature vectors then we will take the maximum number of feature vectors The more number of positive and negative vectors is used, the more distinct Random Forest Master student: Tuan Dung LE – CBC17016 Page 33 model’s structure is It is helpful for clustering data However, if the number is too large, this may cause a bad effect on the performance of algorithm Every times we train a Random Forest model, we will create a new one because of the randomness of Random Forest algorithm and the randomness of choosing positive/negative feature vectors So, we will train three different codebook D1, D2, D3 With each codebook, we will sequently experiment test configurations: adding temporal information, adding spatial information and applying view shifting technique Final accuracy is obtained by combining the results on three different codebooks D1, D2, D3 Training Random Forest model with parameters in scikit-learn library: o max_depth = 5: Maximum 25 = 32 leaves / tree o n_estimators = 10: Number of tree in forest o max_features = ‘auto’: Number of maximum features is used to find best split for a tree After each training time, Random Forest models are saved Each leaf of a Random Forest model represents for a visual-word (a cluster) With each training video, we pass STIP features of this video through the learned forests then we obtain leaf’s positions that features fall into Finally, we have a 320-bin normalized histogram to describe a video  Classify by using Support Vector Machine Using the codebook which is generated from Random Forest model of action 𝑖, all training videos is represented by the visual-word histogram Then we use these descriptor to particularly train a binary SVM for action 𝑖 (one vs all strategy) Training SVM parameters in scikit-learn library: o Kernel = chi2_kernel: Using chi-square kernel χ2 o Class_weight = ‘balanced’: This option is used when we have an unbalanced data between positive và negative samples Master student: Tuan Dung LE – CBC17016 Page 34 o C = 1: Penalty parameter for regularization term o Probability = ‘true’: Allow calculating posterior probability Output of each binary SVM correspond to action 𝑖 is the posterior probability of action class 𝑖 3.3 Performance measurement In order to evaluate our proposed framework, we use the prediction accuracy and confusion matrix Accuracy: This value is calculated by the ratio of the number of samples which are correctly predicted and the number of all samples used for the assessment Confusion matrix: A specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix) Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e commonly mislabelling one as another) Figure Illustration of confusion matrix 3.4 Experiment results 3.4.1 WVU dataset First of all, we conducted experiments with baseline method [1], the accuracy obtained with codebooks is shown in Table 3.1 Unexpectedly, we obtain a poor Master student: Tuan Dung LE – CBC17016 Page 35 performance on WVU dataset at 68.36% prediction accuracy with baseline method After analyzing WVU dataset, we found main reasons why the accuracy was quite lower than we expected The fist reason is the difference between training and testing samples that we have mentioned in the section 3.3.1 In addition, WVU dataset has several actions that is extremely hard to discriminate because of the similarity of these extracted STIP features We note that STIP extractor captures the points which have a huge intensity variation in two dimensions (spatial and temporal) At these points, the trajectory of the motion suddenly changes For example, clapping, wave one hand and wave two hands have a quite similar trajectory of the moving arm This can lead to the confusion between these action classes The confusion matrix (Fig 3.6a) shows two action groups which are hard to discriminate with the baseline BoW model: the fist class contains clapping, waving one and two hands; and the second group contains punching, kicking, throwing, bowling Codebook D1 D2 D3 Average Accuracy 68,05 66,20 70,83 68,36 Table Accuracy (%) of basic BoW model on WVU dataset Table 3.2 shows obtained results with the T-BoW model by dividing the temporal domain into and bins We realize slight improvements of performance by 3.09% and 1.39% in term of accuracy, respectively In our opinion, when an action duration in the dataset is short, we can not see a big performance improvement Moreover, when using more parts in time domain, action descriptor can be less robust due to intra-class variation This causes the decrease of performance if the number of temporal parts is too high T-BoW D1 D2 D3 Average t = (baseline) 68,05 66,20 70,83 68,36 t=2 71,76 68,98 73,61 71,45 t=3 68,52 68,05 72,68 69,75 Table Accuracy (%) of T-BoW model on WVU dataset Master student: Tuan Dung LE – CBC17016 Page 36 Figure Confusion matrix: a) Basic BoW model with codebook D3, accuracy 70,83%; b) S-BoW model with spatial parts codebook D3, accuracy 82,41% For S-BoW model, we fistly test the S-BoW with two parameters: s = and s = The results in Table 3.3 show a signifiant increase in the accuracy by 12.97% from 68.36% to 81.33% (with s = 4) The bounding box is only divided into spatial bins along the height of the subjects This method helps discriminate the STIP features that is detected at different moving regions Fig 3.7a demonstrates that dividing bounding box into bins have a good discrimination between clapping and waving hand, kicking and bowling The confusion problem still exists between classes waving one hand and waving two hands; punching, throwing and bowling When we divide spatially the human body with s = (Fig 3.7b), the accuracy in this case is not good as we expected because this division way help discriminate waving hand and waving hands but also causes the confusion of different pairs of actions because of arbitrary view of test samples S-BoW D1 D2 D3 Average s = (base line) 68,05 66,20 70,83 68,36 s=3 76,39 77,78 79,16 77,78 s=4 81,02 80,56 82,41 81,33 s=6 78,71 78.71 78,24 78,55 Master student: Tuan Dung LE – CBC17016 Page 37 Table 3 Accuracy (%) of S-BoW model on WVU dataset \ Figure Confusion matrices: a) S-BoW model with spatial parts, codebook D3, accuracy 78,24%; b) S-BoW model with spatial parts and view shifting, codebook D3, accuracy 96,67% View shifting technique is evaluated in our third experiment The results are assumed in Table 3.4 with its confusion matrix in Fig 3.7 We found that this technique extremely appropriates for WVU dataset because it find out the most similar orientation between the testing sample and the training sample We can see significant improvements of performance when using shift-view technique, even when no spatial information is took into consideration Without additional spatial information (s=1), we have an increase of 13.58% in comparison with baseline (Table 3.1) With additional spatial information, the accuracy reaches 92.28% with spatial bins and 90.71% with spatial bins Master student: Tuan Dung LE – CBC17016 Page 38 Codebook View D1 D2 D3 Average w/o w w/o w w/o w w/o w s=1 68,05 82,87 66,20 81.02 70,83 81,94 68,36 81,94 s=4 81,02 90,91 80,56 91,41 82,41 89,82 81,33 90.71 s=6 78,71 90,74 78,71 92,59 78,24 93,52 78,55 92,28 shifting Table Accuracy (%) of S-BoW model with (w) and without (w/o) view shifting technique on WVU dataset We now compare our method with state-of-art methods on WVU dataset (Table 3.5) In the best case, we achieve 92,28% accuracy when using spatial pooling with 6-bin and applying view shifting technique for testing phase This result show that our proposed method outperforms the others state-of-the-art methods (include a method using advanced convolution neural network) (*) Note that because our method is dependent on extracting STIP features, this could not perform effectively if the subject is almost static So standing still class is recognition based on number of extracted STIP features Methods Accuracy (%) HOG-SVM[20] 82,00 LMEI[20] 90,00 ConvNet LSTM[21] 90,00 Baseline*[11] 68,36 Our* 92,28 Table Comparison with others methods on WVU Dataset Master student: Tuan Dung LE – CBC17016 Page 39 3.4.2 N-UCLA dataset Similar to WVU dataset, we firstly conducted experiments with baseline method, and obtained the results which is displayed in Table 3.6 The average accuracy on three codebooks gained 57,81% After analyzing confusion matrix (Fig 3.7a) and reviewing N-UCLA dataset, we found two main reasons why the accuracy was quite low The first reason is that this dataset has a group of similar actions: pick up by one hand, pick up by two hands and drop trash The second reason is that this dataset is not accurately segmented Then we have a big set of action classes that people walking on its samples almost the video’s time (e.g pick up with one hand, pick up with two hands, drop trash, carry and walk around In this case, both local features and BOW model will be less robust Codebook D1 D2 D3 Average Accuracy 59,09 54,78 59,57 57,81 Table Accuracy (%) of basic model on N-UCLA dataset We will sequentially apply T-BoW model, S-BoW model and combination of S-BoW model with view shifting technique Table 3.7 shows obtained results with the T-BoW model by dividing the temporal domain into and bins We realize slight improvements of performance by approximately 1,5% Codebook D1 D2 D3 Average T = (baseline) 59,09 54,78 59,57 57,81 T=2 59,09 56,22 59,09 58,13 T=3 60,77 56,70 60,29 59,25 Table Accuracy (%) of T-BoW model on N-UCLA dataset For S-BoW model, Table 3.8 show that the best case is dividing the bounding box of subject into spatial parts, obtained 63,24% accuracy (about 5,5% improvement compared to basic BoW model) After comparing two confusion matrices (Fig 3.8a & 3.8b), we saw that S-BoW model with spatial parts helps Master student: Tuan Dung LE – CBC17016 Page 40 reducing some confusion cases between action classes but not completely solves any particular pair Codebook D1 D2 D3 Average s = (baseline) 59,09 54,78 59,57 57,81 s=3 61,48 62,68 63,16 62,44 s=4 61,72 62,20 63,88 62,60 s=6 63,40 62,91 63,40 63,24 Table Accuracy (%) of the combination of S-BoW model and view shifting on N-UCLA dataset Figure Confusion matrices: a) Basic BoW model, codebook D3, accuracy 59,57%; b) S-BoW model with spatial parts, codebook D3, accuracy 63,40% Final experiment on N-UCLA dataset is the combination of S-BoW model and view shifting technique The results on Table 3.9 show us that the accuracy is lower when combinating two techniques compared to using only S-BoW model This reduction can be explained by a characteristic of N-UCLA dataset This dataset provides only three viewpoints that covered total 90𝑜 angle of subject’s performance, the shift of views will cause more confusion To demonstrate this claim, assume that Master student: Tuan Dung LE – CBC17016 Page 41 we have three sequence from three fixed viewpoints 𝑆1 , 𝑆2 , 𝑆3 (Fig 3.9) and these sequences will be classified by corresponding classifiers 𝑆𝑉𝑀1 , 𝑆𝑉𝑀2 , 𝑆𝑉𝑀3 After shifting views by one step, sequences 𝑆2 , 𝑆3 will be classified by corrresponding classifiers 𝑆𝑉𝑀1 , 𝑆𝑉𝑀2 We need new sequence 𝑆4 from new viewpoint to be classified by classifier 𝑆𝑉𝑀3 , but this sequence is not available in this dataset In this case, view shifting will use sequence 𝑆1 instead of not-exist sequence 𝑆4 and then may lead to inaccurate prediction Codebook View D1 w/o D2 D3 Average w w/o w w/o w w/o w shifting s=1 59,09 53,11 54,78 53,34 59,57 58,37 57,81 54,94 s=4 61,72 59,09 62,20 58,13 63,88 58,37 62,60 58,53 s=6 63,40 62,68 62,91 61,48 63,40 57,66 63,24 60,61 Table Accuracy (%) of S-BoW model with (w) and without (w/o) view shifting technique on N-UCLA dataset Figure Illustration of view shifting on N-UCLA dataset Master student: Tuan Dung LE – CBC17016 Page 42 CONCLUSION & FUTURE WORK In this thesis, we have proposed two improvements for multi-view human action recognition The first improvement was a spatial-temporal pooling technique to distinguish small difference between two actions in both spatial and temporal dimension The second one was a view shifting strategy in testing phase to test all configurations of views to find the best result We have tested our method on two public benchmark multi-view datasets of human action The impact of each improvement has been conducted and analyzed It indicated that spatial pooling usually show a promising result Temporal pooling improves lightly the results but it could be ignored if we would speed up the system When the camera configuration has approximately uniform distribution, the application of shifting test is very efficient In the future, we need to evaluate the proposed method on other bigger multi-view datasets and study the impact of available cameras number on the system performance Master student: Tuan Dung LE – CBC17016 Page 43 REFERENCES Iosifidis, Alexandros, Anastasios Tefas, and Ioannis Pitas "Multi-view human action recognition: A survey." Intelligent Information Hiding and Multimedia Signal Processing, 2013 Ninth International Conference on IEEE, 2013 D Weinland, R Ronfard, and E Boyer, Free viewpoint action recognition using motion history volumes, CVIU, vol 104, no.2, pp 249–257, 2006 M Holte, T Moeslund, N Nikolaidis, and I Pitas, 3D human action recognition for multi-view camera systems, 3D Imaging Modeling Processing Visualization Transmission, 2011 S Y Cheng and M M Trivedi, Articulated human body pose inference from voxel data using a kinematically constrained gaussian mixture model, Computer Vision and Pattern Recognition Workshops, 2007 I Mikic, M M Trivedi, E Hunter, and P Cosman, Human body model acquisition and tracking using voxel data, International Journal of Computer Vision, vol 53, no.3, pp 199–223, 2003 Junejo, I N., Dexter, E., Laptev, I., & Perez, P (2011) View-independent action recognition from temporal self-similarities IEEE transactions on pattern analysis and machine intelligence, 33(1), 172-185 Lewandowski, Michał, Dimitrios Makris, and Jean-Christophe Nebel "View and style-independent action manifolds for human activity recognition." Computer Vision– European Conference on Computer Vision 2010 (2010): 547-560 Shen, Yuping, and Hassan Foroosh "View-invariant action recognition using fundamental ratios." Computer Vision and Pattern Recognition, 2008 Computer Vision and Pattern Recognition 2008 IEEE Conference on IEEE, 2008 Gondal, Iqbal, and Manzur Murshed "On dynamic scene geometry for viewinvariant action matching." Computer Vision and Pattern Recognition, 2011 IEEE Conference on IEEE, 2011 Master student: Tuan Dung LE – CBC17016 Page 44 10 Sargano, Allah Bux, Plamen Angelov, and Zulfiqar Habib "Human Action Recognition from Multiple Views Based on View-Invariant Feature Descriptor Using Support Vectơ Machines." Applied Sciences 6.10 (2016): 309 11 Burghouts, Gertjan, et al "Improved action recognition by combining multiple 2D views in the bag-of-words model." Advanced Video and Signal Based Surveillance, 2013 10th IEEE International Conference on IEEE, 2013 12 F Zhu, L Shao and M Lin, “Multi-view action recognition using local similarity random forests and sensor fusion”, Pat Rec Letters, vol 24, pp 20–24, 2013 13 A Iosifidis, A Tefas and I Pitas, “Multi-view action recognition under occlusion based on fuzzy distances and neural networks”, European Signal Processing Conference, 2012 14 Herath, Samitha, Mehrtash Harandi, and Fatih Porikli "Going deeper into action recognition: A survey." Image and vision computing 60 (2017): 4-21 15 Bobick, Aaron F., and James W Davis "The recognition of human movement using temporal templates." IEEE Transactions on pattern analysis and machine intelligence 23.3 (2001): 257-267 16 Laptev, Ivan, and Tony Lindeberg "Space-time interest points." 9th International Conference on Computer Vision, Nice, France IEEE conference proceedings, 2003 17 P Dollar, V Rabaud, G Cottrell, and S Belongie Behavior recognition via sparse spatio-temporal features In Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005 18 G Willems, T Tuytelaars, and L Van Gool An efficient dense and scaleinvariant spatio-temporal interest point detector In European Conference on Computer Vision, 2008 19 Wang, Heng, et al "Evaluation of local spatio-temporal features for action recognition." BMVC 2009-British Machine Vision Conference BMVA Press, 2009 Master student: Tuan Dung LE – CBC17016 Page 45 20 Kavi, Rahul, and Vinod Kulathumani "Real-time recognition of action sequences using a distributed video sensor network." Journal of Sensor and Actuator Networks 2.3 (2013): 486-508 21 Kavi, Rahul, et al "Multiview fusion for activity recognition using deep neural networks." Journal of Electronic Imaging 25.4 (2016): 043010 22 Gertjan J Burghouts, Klamer Schutte, Henri Bouma, and Richard JM den Hollander 2014 Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos Machine Vision and Applications 25, (2014), 85–98 23 Moosmann, Frank, Bill Triggs, and Frederic Jurie "Fast discriminative visual codebooks using randomized clustering forests." Advances in neural information processing systems 2007 24 Shukla, Parul, Kanad Kishore Biswas, and Prem K Kalra "Action Recognition using Temporal Bag-of-Words from Depth Maps." Machine Vision and Applications 2013 25 Ullah, Muhammad Muneeb, Sobhan Naderi Parizi, and Ivan Laptev "Improving bag-of-features action recognition with non-local cues." British Machine Vision Conference Vol 10 2010 Master student: Tuan Dung LE – CBC17016 Page 46 APPENDIX INSTALLING AND USING STIP CODE TO EXTRACT STIP FEATURE Step 1: Download STIP source code version 2.0 on Linux: https://www.di.ens.fr/~laptev/download.html#stip Step 2: Extract zip file into a folder Step 3: Write a batch.sh file (using code to generate this file) to extract STIP feature for each video and save in txt format Each line will extract STIP feature of a particular videos @echo off stipdet –f (file video’s name) –o (file lưu) –vis no … Step 4: Go to STIP code folder, open Terminal and run prepared batch file Master student: Tuan Dung LE – CBC17016 Page 47 ... technique to distinguish small difference between two actions in both spatial and temporal dimension The second one was a view shifting strategy in testing phase to test all configurations of... ,

Định dạng
Số trang	47
Dung lượng	1,39 MB