Luận án tiến sĩ tái định danh trong hệ thống camera giám sát tự động

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY NGUYEN THUY BINH PERSON RE-IDENTIFICATION IN A SURVEILLANCE CAMERA NETWORK DOCTORAL DISSERTATION OF ELECTRONICS ENGINEERING Hanoi−2020 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY NGUYEN THUY BINH PERSON RE-IDENTIFICATION IN A SURVEILLANCE CAMERA NETWORK Major: Electronics Engineering Code: 9520203 DOCTORAL DISSERTATION OF ELECTRONICS ENGINEERING SUPERVISORS: 1.Assoc Prof Pham Ngoc Nam 2.Assoc Prof Le Thi Lan Hanoi−2020 DECLARATION OF AUTHORSHIP I, Nguyen Thuy Binh, declare that the thesis titled "Person re-identification in a surveillance camera network" has been entirely composed by myself I assure some points as follows: This work was done wholly or mainly while in candidature for a Ph.D research degree at Hanoi University of Science and Technology The work has not be submitted for any other degree or qualifications at Hanoi University of Science and Technology or any other institutions Appropriate acknowledge has been given within this thesis where reference has been made to the published work of others The thesis submitted is my own, except where work in the collaboration has been included The collaborative contributions have been clearly indicated Hanoi, 24/11/ 2020 PhD Student SUPERVISORS i ACKNOWLEDGEMENT This dissertation was written during my doctoral course at School of Electronics and Telecommunications (SET) and International Research Institute of Multimedia, Information, Communication and Applications (MICA), Hanoi University of Science and Technology (HUST) I am so grateful for all people who always support and encourage me for completing this study First, I would like to express my sincere gratitude to my advisors Assoc Prof Pham Ngoc Nam and Assoc Prof Le Thi Lan for their effective guidance, their patience, continuous support and encouragement, and their immense knowledge I would like to express my gratitude to Dr Vo Le Cuong and Dr Ha thi Thu Lan for their help I would like to thank to all member of School of Electronics and Telecommunications, International Research Institute of Multimedia, Information, Communications and Applications (MICA), Hanoi University of Science and Technology (HUST) as well as all of my colleagues in Faculty of Electrical-Electronic Engineering, University of Transport and Communications (UTC) They have always helped me on research process and given helpful advises for me to overcome my own difficulties Moreover, the attention at scientific conferences has always been a great experience for me to receive many the useful comments During my PhD course, I have received many supports from the Management Board of School of Electronics and Telecommunications, MICA Institute, and Faculty of Electrical-Electronic Engineering My sincere thank to Assoc Prof Nguyen Huu Thanh, Dr Nguyen Viet Son and Assoc Prof Nguyen Thanh Hai who gave me a lot of support and help Without their precious support, it has been impossible to conduct this research Thanks to my employer, University of Transport and Communications (UTC) for all necessary support and encouragement during my PhD journey I am also grateful to Vietnam’s Program 911, HUST and UTC projects for their generous financial support Special thanks to my family and relatives, particularly, my beloved husband and our children, for their never-ending support and sacrifice Hanoi, 2020 Ph.D Student ii CONTENTS DECLARATION OF AUTHORSHIP i ACKNOWLEDGEMENT ii CONTENTS vi SYMBOLS vi LIST OF TABLES x LIST OF FIGURES xiv INTRODUCTION CHAPTER LITERATURE REVIEW 1.1 Person ReID classifications 1.1.1 Single-shot versus Multi-shot 1.1.2 Closed-set versus Open-set person ReID 1.1.3 Supervised and unsupervised person ReID 10 1.2 Datasets and evaluation metrics 11 1.2.1 Datasets 1.2.2 Evaluation metrics 11 16 1.3 Feature extraction 16 1.3.1 Hand-designed features 1.3.2 Deep-learned features 17 20 1.4 Metric learning and person matching 1.4.1 Metric learning 25 25 1.4.2 Person matching 28 1.5 Fusion schemes for person ReID 29 1.6 Representative frame selection 31 1.7 Fully automated person ReID systems 33 1.8 Research on person ReID in Vietnam 34 CHAPTER MULTI-SHOT PERSON RE-ID THROUGH REPRESENTATIVE FRAMES SELECTION AND TEMPORAL FEATURE POOLING 36 2.1 Introduction 36 2.2 Proposed method 36 2.2.1 Overall framework 2.2.2 Representative image selection 36 37 iii 2.2.3 Image-level feature extraction 44 2.2.4 Temporal feature pooling 2.2.5 Person matching 49 50 2.3 Experimental results 55 2.3.1 Evaluation of representative frame extraction and temporal feature pooling schemes 55 2.3.2 Quantitative evaluation of the trade-off between the accuracy and computational time 61 2.3.3 Comparison with state-of-the-art methods 63 2.4 Conclusions and Future work 65 CHAPTER PERSON RE-ID PERFORMANCE IMPROVEMENT BASED ON FUSION SCHEMES 67 3.1 Introduction 67 3.2 Fusion schemes for the first setting of person ReID 69 3.2.1 Image-to-images person ReID 3.2.2 Images-to-images person ReID 69 75 3.2.3 Obtained results on the first setting 76 3.3 Fusion schemes for the second setting of person ReID 3.3.1 The proposed method 82 82 3.3.2 Obtained results on the second setting 86 3.4 Conclusions 89 CHAPTER QUANTITATIVE EVALUATION OF AN END-TO-END PERSON REID PIPELINE 91 4.1 Introduction 91 4.2 An end-to-end person ReID pipeline 92 4.2.1 Pedestrian detection 4.2.2 Pedestrian tracking 4.2.3 Person ReID 92 97 98 4.3 GOG descriptor re-implementation 99 4.3.1 Comparison the performance of two implementations 4.3.2 Analyze the effect of GOG parameters 99 99 4.4 Evaluation performance of an end-to-end person ReID pipeline 101 4.4.1 The effect of human detection and segmentation on person ReID in singleshot scenario 102 iv 4.4.2 The effect of human detection and segmentation on person ReID in multishot scenario 104 4.5 Conclusions and Future work 107 PUBLICATIONS 112 Bibliography 113 v ABBREVIATIONS No Abbreviation Meaning ACF Aggregate Channel Features AIT Austrian Institute of Technology AMOC Accumulative Motion Context BOW Bag of Words CAR Learning Compact Appearance Representation CIE The International Commission on Illumination CFFM Comprehensive Feature Fusion Mechanism CMC Cummulative Matching Characteristic CNN Convolutional Neural Network 10 CPM Convolutional Pose Machines 11 CVPDL Cross-view Projective Dictionary Learning 12 CVPR Conference on Computer Vision and Pattern Recognition 13 DDLM Discriminative Dictionary Learning Method 14 DDN Deep Decompositional Network 15 DeepSORT Deep learning Simple Online and Realtime Tracking 16 DFGP Deep Feature Guided Pooling 17 DGM Dynamic Graph Matching 18 DPM Deformable Part-Based Model 19 ECCV European Conference on Computer Vision 20 FAST 3D Fast Adaptive Spatio-Temporal 3D 21 FEP Flow Energy Profile 22 FNN Feature Fusion Network 23 FPNN Filter Pairing Neural Network 24 GOG Gaussian of Gaussian 25 GRU Gated Recurrent Unit 26 HOG Histogram of Oriented Gradients 27 HUST Hanoi University of Science and Technology 28 IBP Indian Buffet Process 29 ICCV International Conference on Computer Vision 30 ICIP International Conference on Image Processing vi 31 IDE ID-Discriminative Embedding 32 iLIDS-VID Imagery Library for Intelligent Detection Systems 33 ILSVRC ImageNet Large Scale Visual Recognition Competition 34 ISR TIterative Spare Ranking 35 KCF Kernelized Correlation Filter 36 KDES Kenel DEScriptor 37 KISSME Keep It Simple and Straightforward MEtric 38 kNN k-Nearest Neighbour 39 KXQDA Kernel Cross-view Quadratic Discriminative Analysis 40 LADF Locally-Adaptive Decision Functions 41 LBP Local Binary Pattern 42 LDA LinearDiscriminantAnalysis 43 LDFV Local Descriptor and coded by Feature Vector 44 LMNN Large Margin Nearest Neighbor 45 LMNN-R Large Margin Nearest Neighbor with Rejection 46 LOMO LOcal Maximal Occurrence 47 LSTM Long-Short Term Memory 48 LSTMC Long Short-Term Memory network with a Coupled gate 49 mAP mean Average Precision 50 MAPR Multimedia Analysis and Pattern Recognition 51 Mask R-CNN Mask Region with CNN 52 MCT Multi -Camera Tracking 53 MCCNN Multi-Channel CNN 54 MCML Maximally Collapsing Metric Learning 55 MGCAM Mask-Guided Contrastive Attention Model 56 ML Machine Learning 57 MLAPG Metric Learning by Accelerated Proximal Gradient 58 MLR Metric Learning to Rank 59 MOT Multiple Object Tracking 60 MSCR Maximal Stable Color Region 61 MSVF Maximally Stable Video Frame 62 MTMCT Multi-Target Multi-Camera Tracking 63 Person ReID Person Re -Identification 64 Pedparsing Pedestrian Parsing 65 PPN Pose Prediction Network vii 66 PRW Person Re-identification in the Wild 67 QDA Quadratic Discriminative Analysis 68 RAiD Re-Identification Across indoor-outdoor Dataset 69 RAP Richly Annotated Pedestrian 70 ResNet Residual Neural Network 71 RHSP Recurrent High-Structured Patches 72 RKHS Reproducing Kernel Hilbert Space 73 RNN Recurrent Neural Network 74 ROIs Region of Interests 75 SDALF Symmetry Driven Accumulation of Local Feature 76 SCNCD Salient Color Names based Color Descriptor 77 SCNN Siamese Convolutional Neural Network 78 SIFT Scale-Invariant Feature Transform 79 SILTP Scale Invariant Local Ternary Pattern 80 SPD Symmetric Positive Definite 81 SMP Stepwise Metric Promotion 82 SORT Simple Online and Realtime Tracking 83 SPIC Signal Processing: Image Communication 84 SVM Support Vector Machine 85 TAPR Temporally Aligned Pooling Representation 86 TAUDL Tracklet Association Unsupervised Deep Learning 87 TCSVT Transactions on Circuits and Systems for Video Technology 88 TII Transactions on Industrial Informatics 89 TPAMI Transactions on Pattern Analysis and Machine Intelligence 90 TPDL Top-push Distance Learning 91 Two-stream MR Two-stream Multirate Recurrent Neural Network 92 UIT University of Information Technology 93 UTAL Tracklet Association Unsupervised Deep Learning 94 VIPeR View-point Invariant Pedestrian Recognition 95 VNU-HCM Vietnam National University - Ho Chi Minh City 96 WH Weighted color Histogram 97 WHOS Weighted Histograms of Overlapping Stripes 98 WSC Weight-based Sparse Coding 99 XQDA Cross-view Quadratic Discriminative Analysis 100 YOLO You Only Look One viii which contains the predicted object although the ground truth is in another grid cell Therefore, a grid cell which contains two or more center of bounding boxes will not be detected This is the drawback of YOLO-v1 and the number of grid cells is increased In order to overcome this drawback of YOLOv1, YOLOv2 used Anchor boxes with pre-defined shape obtained in the training phase through k-means cluster algorithm on ground truth bounding boxes Moreover, some changes are mentioned in YOLOv2, such as using batch normalization, grid_size = 13 × 13, box_number = 5, image_dimension = 416 × 416 With these improvement, YOLOv2 can detect objects with any size of input images and mean Average Precision (mAP) is increased from 63.4% in YOLOv1 to 78.6% in YOLOv2 In recent years, YOLO is developed with version YOLOv3 which employ DarkNet as backbone with a more complex structure for feature extraction This improved version YOLOv3 adopts pyramid features to overcome the shortcoming of YOLOv1 and YOLOv2 that is not is able to maintain a high accuracy when dealing with a small-size object in high resolution images This help YOLOv3 to reduce computation time while to ensure precision in the detection task Bounding boxes + confidence S × S grid on input Final detections Class probability map (a) 448 7 112 56 3 448 28 14 112 Conv Layer 7x7x64-s-2 Maxpool Layer 2x2-s-2 56 192 Conv Layer 3x3x192 Maxpool Layer 2x2-s-2 256 Conv Layers 1x1x128 3x3x256 1x1x256 3x3x512 Maxpool Layer 2x2-s-2 7 3 28 14 512 1024 Conv Layers Conv Layers Conv Layers 1x1x512 1x1x256 ×4 ×2 3x3x1024 3x3x1024 3x3x1024 3x3x512 3x3x1024 1x1x512 3x3x1024-s-2 3x3x1024 Maxpool Layer 2x2-s-2 } 7 1024 1024 4096 Conn Layer 30 Conn Layer } (b) Figure 4.5: a) An input image is divided in × grid cell b) The architecture of an YOLO detector [152] 95 Mask R-CNN detector Mask R-CNN is an extension of Faster R-CNN [150] An advantage of Mask R-CNN is to create simultaneously a bounding box and a corresponding mask for a detected object In comparison with Faster R-CNN in term of structure, Mask R-CNN integrates an object proposal generator into a detection network, so convolutional features can be shared from the object proposal network to the detection network This help to reduce significantly computation cost while still maintaining a high mAP Figure 4.6 shows the difference between Faster R-CNN and Mask R-CNN in their architecture As shown in this Figure, Faster R-CNN includes two networks connected together: (1) region proposal network and (2) feature extraction network which performs on proposed region and classify detected objects Based on this architecture, Mask R-CNN adds a sub-network for an object mask prediction in parallel with the existing sub-network for bounding box recognition In [111], the authors claimed that Mask R-CNN is simple to train and the above addition is easy to perform with a small overhead to Faster R-CNN network For the training phase, Mask R-CNN is trained on COCO dataset, a large-scale and enough diversity one It takes one to two days for training Mask R-CNN network on a single 8-GPU machine and this model can run at about 5f ps High speed in both the training and test phases accompany with framework’s flexibility and high accuracy will assist on segmentation task class box RoIAlign conv (a) Faster R-CNN conv (b) Mask R-CNN Figure 4.6: The architecture of a) Faster R-CNN [150] and b) Mask R-CNN [111] Pedestrian parsing segmentation Pedestrian parsing is a novel Deep Decompositional Network (DDN) proposed by Luo et al [112] The purpose of this network is to parse a given pedestrian image into some semantic regions, such as hair, head, body, arms, and legs In literature, 96 majority of existing methods based on template matching or Bayesian inference In pedestrian parsing framework, low-level visual features are mapped to the label maps of body parts through DDN in which complex pose variations are accurately estimated with good robustness to background clutter and occlusions The benefit of DDN is to jointly figure out occluded regions and segments body parts by employing three kinds of hidden layers including occlusion estimation layers, completion layers, and decomposition layers Figure 4.7 shows DDN architecture for parsing pedestrian task The target of occlusion estimation layers is to suggest a binary mask indicating which part of a pedestrian is invisible Meanwhile the completion layers generates low-level features of the invisible part based on the original features and the occlusion mask These generated features are transformed directly to label maps through the decomposition layers These hidden networks are pre-trained and then fine-tune the entire network using the stochastic gradient descent Figure (a) Figure 4.7: DDN architecture for Pedestrian Parsing [112] 4.2.2 Pedestrian tracking While in SORT algorithm, IoU ratios between detected boxes are considered as elements of the cost matrix in data association, DeepSORT uses both information on motion and appearance for computing a measurement metric in tracking This metric is expressed as follows: ci,j = λd(1) (i, j) + (1 − λ)d(2) (i, j) (4.1) where ci,j is the similarity between the i-th track and the j-th bounding box detection; d(1) (i, j) and d(2) (i, j) are the two metrics calculated based on motion and appearance information, respectively While d(1) (i, j) is calculated based on Mahalanobis distance, d(2) (i, j) is the smallest cosine distance between the i-th track and the j-th bounding box detection in the appearance space; hyperparameter λ controls this association This technique is investigated in details in the study of Nguyen et al [153] In which the combination between DeepSORT and one of the two state-of-the-art human detec97 tion methods (YOLO and Mask R-CNN) are considered and compared to each other Extensive experiments are conducted on MOT17 -a benchmark dataset provided in MOT Challenge and their own dataset, called COMVIS_MICA Based on obtained results, the author make a deep analysis on the behavior of human detection and tracking method in terms of both detection and tracking performance as well as resource requirement for a realistic application In this study, the authors claimed that Mask RCNN is better than YOLO in human detection task However, Mask R-CNN requires a larger resource for implementation Relying on this suggestion, this chapter proposes to use the combination between Mask R-CNN and DeepSORT for human detection and tracking tasks in the fully-automated person ReID system, respectively 4.2.3 Person ReID Person ReID is the last stage of a full person ReID pipeline with the task is to determine which gallery images describe a query pedestrian In the history, a huge number of studies pay attention to this problem and achieve numerous important milestones For the previous chapters, person ReID is considered in different aspects and some strategies are introduce to improve its performance In Chapter an effective framework through key frames selection and temporal pooling is proposed and obtains impressive results Additionally, three fusion schemes are presented in Chapter with different combinations of both hand-designed and deep-learned features Based on this, the best combination is determined and some useful suggestions are provided for the research community In this chapter, the author try to integrate proposed framework for person ReID into a full where implementation rate and memory requirement should be considered In this chapter, GOG descriptor and XQDA technique are used for feature extraction and metric learning, respectively This method outperforms a number of the state of the art methods for single-shot person ReID [154] In order to handle multi-shot problem, some works turn multi-shot problem into single-shot one by applying different pooling techniques, such as max-, min-, or average-pooling Some others prefer to compare two sets of feature vectors, namely set-to-set matching technique [82, 83, 155] In this chapter, average-pooling technique is exploited to get the final signature of each person Average-pooling means to take the average value of all extracted feature vectors which are corresponding to all instance images of a given person In order to bring person ReID to practical applications, GOG descriptor is re-implemented in C++ and the optimal parameters of this descriptor are selected through intensive experiments The experimental results show that the proposed approach allows to extract GOG times faster than the available source code and achieve remarkably high accuracy for person ReID The re-implementation for GOG descriptor in C++ and the choice of optimal parameters for this descriptor are described in more detail in the following section 98 4.3 GOG descriptor re-implementation The code for GOG descriptor is publicly provided by the authors in [49] However, the authors implemented it in Matlab In order to apply GOG feature in practice, we re-implement GOG extraction in C++ The main purpose of this section is to compare computational speed when GOG descriptor is implemented in the two different ways: in Matlab (original manner) and in C++ (proposed implementation), in which dlib, opencv libraries are employed Consuming time for each step for extracting GOG descriptor will be shown clearly Based on obtained results, the set of optimal parameters is chosen with the hope to achieve a higher performance To compare to the original work [49], extensive experiments are conducted on VIPeR dataset [29] For evaluation, this dataset is divided into two equal parts, one for training phase and the other used for test phase 4.3.1 Comparison the performance of two implementations Two implementations are compared in terms of ReID accuracy and computational time For ReID accuracy, Figure 4.8a) shows the CMC curves obtained by two implementations in Lab color space Obviously, the two CMC curves are very close, approximately overlapped This indicates that this implementation allows to produce similar results as the source code provided by the authors in [49] Concerning computational time, for one image with the resolution 128 × 48, this implementation takes 0.109 s (∼ 10fps) while that of the authors in [49] needs 0.286 s on Computer Intel(R) Core(TM) i5-6200U 2.3GHz, 8GB RAM-DDR3 1600MHz This means that the re-implementation in C++ allows to extract GOG times faster than the implementation of [49] The obtained frame rate of ∼10fps can satisfy the real-time requirement of several surveillance applications Figure 4.8b) shows computational time of each step in GOG feature extraction for one person image 4.3.2 Analyze the effect of GOG parameters There are several parameters used in GOG feature extraction In order to evaluate the effect of these parameters on person ReID, we chose two important parameters that are the number of regions or stripes (N) and the number of gradient bins Figure 4.9a) shows the matching rate at rank-1 when the number of regions varies from to 49 In [49] the authors fixed the number of region is It can be observed that using 13, 15 or 17 regions allows to obtain the best ReID results in rank-1 However, the results from 15 region outperform the other ones at other ranks (e.g., rank 5, 10, 15, 20) Figure 4.9b) indicates the variation of matching rates at important ranks of 1, 5, 10, 15, 20 when the number of gradient bins is changed As seen in this Figure, the best performance is achieved when the number of gradient bins is 99 region combine 0.001 pixel feature 0.016 region flatten 0.018 100 Matching rate (%) 90 80 70 60 50 40 42.28% VIPeR_RGB_original code [6] 41.90% VIPeR_RGB_re-implement code 30 10 15 patch flatten 0.074 20 Rank (a) (b) Figure 4.8: a) ReID accuracy of the source code provided in [49] and that of the re-implementation and b) computation time (in s) for each step in extracting GOG feature on an image in C++ 60 100 80 Matching rate (%) Matching rates at rank-1 (%) 50 40 30 20 bins=4 bins=5 bins=6 bins=7 bins=8 bins=9 bins=10 60 40 20 10 0 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 The number of regions 10 15 Rank (a) (b) Figure 4.9: The matching rates at rank-1 with different number of regions (N) 100 20 Based on the above analysis, in these experiments, we choose the number of regions is 15 and the number of gradient bins is Others parameters are kept as in the experiments of [49] 100 90 Matching rate (%) 80 70 60 43.13%_org_15_5_RGB 44.30%_org_15_5_Lab 39.87%_org_15_5_HSV 39.56%_org_15_5_nRnG 51.04%_org_15_5_Fusion 50 40 30 10 15 20 Rank Figure 4.10: CMC curves on VIPeR dataset when extracting GOG features with the optimal parameters Figure 4.10 presents CMC curves which are evaluated on different color spaces (RGB/Lab/HSV/nRnG) and the fusion of these color spaces, with the optimal parameters chosen for GOG The best result (51.04%) is obtained when using 15 regions and fusing all color spaces It is worth to note that the best result when using the provided source code of the authors [49] with the default parameters is 49.70% This shows that by choosing the optimal parameters, the accuracy at rank-1 can increase 1.34% 4.4 Evaluation performance of an end-to-end person ReID pipeline From the above obtained results, this chapter presents a full person ReID pipeline and estimates the influence of the two previous steps (human detection and segmentation) on person ReID step For this, some extensive experiments are conducted on both single-shot dataset (VIPeR [29]) and multi-shot one (PRID-2011 [32]) Each dataset is split into two halves, one for the training phase and one for the test phase This process is repeated randomly 10 times and the reported results are the average one from these These experiments are followed the settings introduced in [45] for VIPeR and in [58] for PRID-2011 101 100 90 Matching rates (%) 80 70 60 50 40 30 20 10 51.04% Manual detection with the proposed method 47.15% Automatic segmentation with the proposed method 57.25% Manual segmentation with the proposed method 10 15 20 Rank Figure 4.11: CMC curves of three evaluated scenarios on VIPER dataset when applying the method proposed in Chapter 4.4.1 The effect of human detection and segmentation on person ReID in single-shot scenario As full frame in VIPER dataset is not available, in this experiment, we can only evaluate the effect of person segmentation Two methods of person image segmentation are considered: manual segmentation via Interactive Segmentation Tool and automatic segmentation based on Pedparsing method The obtained results when applying the method proposed in Chapter are shown in Figure 4.11 By observing this Figure, it is clear that manual segmentation obtains the best results The manual segmentation allows to obtain 6.21% of improvement over manual detection This means that background plays an important role in person ReID performance Figure 4.12 shows an example of segmentation and ReID results in VIPeR dataset In Figure 4.12a), two original images are segmented in two different manners: manual and automatic segmentation While Figure 4.12b) and c) illustrate the ReID results with the two query persons in three different cases The first rows present the original images, the second rows and the last rows are corresponding to manual and automatic segmentation images, respectively By observing we can see that, for the first query person (Figure 4.12b)), the true matching can be found immediately in the first ranks in all three cases For the second query person (Figure 4.12c)),with the original images, the true matching is found at the rank-3, however, when applying manual segmentation the true matching is brought to the rank-1 Also, due to loss of information in case of using automatic segmentation, the true matching can not be found within the first ten ranks Moreover, we make an additional experiment for single-shot approach on PRID 2011 dataset A bounding box for each individual is chosen randomly After that, the proposed framework is applied on these bounding boxes Because of choosing only 102 (a) The query person The query person Rank lists of the gallery person Rank lists of the gallery person (b) (c) 100 100 90 90 80 80 Matching rate (%) Matching rate (%) Figure 4.12: Examples for results of a)segmentation and b), c) person ReID in all three cases of using the original images, manually segmented images, automatically segmented images of two different persons in VIPeR dataset 70 60 50 57.42% Manual-detect 50.67% Auto-detect by ACF 48.20% Auto-detect by YOLO 40 70 60 50 40 30 46.52% Manual-det+Auto-seg(Pedparsing) 42.14% Auto-det(ACF)+Auto-seg(Pedparsing) 37.87% Auto-det(YOLO)+Auto-seg(Pedparsing) 47.42% Auto-det+Auto-seg(MaskRCNN) 30 10 15 20 Rank 10 15 20 Rank (a) (b) Figure 4.13: CMC curves of three evaluated scenarios on PRID 2011 dataset in single-shot approach (a) Without segmentation and (b) with segmentation 103 a random image for each person, the identities as well as the number of persons in this experiment are not similar to those in multi-shot case on PRID 2011, which are often used in the most existing works In this experiment, for detection human stage, we utilize ACF, YOLO, and Mask R-CNN And, for segmentation, Mask R-CNN and Pedparsing are applied It is worth noting that Mask R-CNN is used for detection and segmentation purpose simultaneously For feature extraction, GOG descriptor is also employed for person representation Performance of an end-to-end person ReID pipeline is evaluated on some different cases indicated in Fig.4.13 with two cases ( without/with segmentation) This Figure provides several conclusions as follow First, when comparing corresponding curves on left and right side Figures, segmentation stage provides worse results compared to those in case of applying only detection This result means that background removal with binary masks, which may a reason for information loss and smoothness of an image, is not an optimal choice to improve the performance of person ReID task The matching rates at rank-1 when applying segmentation process are reduced by 10.9%, 8.53%, 10.33% when compared to those in case of manual, ACF, and YOLO methods, respectively Moreover, in comparison between considered detectors, obtained results indicate that ACF detector achieves a better performance than YOLO one in both cases (without/with segmentation) The matching rates at rank-1 when using ACF detector are higher by 2.47% and 4.27% compared to YOLO detector in case of without and with segmentation, respectively In addition, the effectiveness of ACF detector can achieve the performance of manual detection One remarkable point is that Mask-RCNN provides an impressive results that is competitive over manual detection This bring a hopefulness for an end-to-end person ReID pipeline to be practical 4.4.2 The effect of human detection and segmentation on person ReID in multi-shot scenario Figure 4.14 indicates the matching rates on PRID 2011 dataset when employing GOG descriptors with XQDA technique with manual detection provided by Hirzer et al [32], one of the considered automatic detection techniques and automatic segmentation using Pedparsing after the automatic detection stage It is worth noting that in order to make a comparison between automatic detection and manual detection, only the person ROIs from automatic detection whose IoU is greater than 0.4 are kept Figure 4.15 shows an example of the obtained results in the first two steps detection and segmentation and in the last step person ReID in the fully automatic ReID system In Figure 4.15a), first, human detection method is applied on full frames to extract human ROIs and then, segmentation technique is conducted on these bounding boxes Figure 4.15b) and c) show the ReID results when dealing with the two different query persons For the first person, the true matching person are found at the rank-1, 104 100 Matching rate (%) 95 90 85 80 90.56% Manual_Detection with the proposed method 91.01% Auto_Detection with the proposed method 88.76% Auto_Detection+Segmentation with the proposed method 75 10 15 20 Rank (a) Figure 4.14: CMC curves of three evaluated scenarios on PRID 2011 dataset when applying the proposed method in Chapter however, for the second person, the true matching person is not always determined at the first rank These results illustrate clearly the influence of human detection and segmentation on person ReID performance It is interesting to see that the person re-identification results in case of using automatic detection is slightly better than those of manual detection The reason is that automatic detection results very well-aligned bounding boxes while manual detection defines a larger bounding box for pedestrians Additionally, a pedestrian is represented by multiple images which help to overcome the errors generated in the human detection phase Previous studies have discussed about the effect of background on person ReID accuracy [113, 156, 157] However, the obtained results in Chapter show that when human detection is relative good, human segmentation is not required as the quality of current automatic object segmentation methods are far from perfect It can be shown in Fig 4.14, the matching rates at rank-1 are reduced by 2.25% It is an helpful recommendation for a full person ReID pipeline Finally, Table 4.1 shows the comparison of the proposed method with state of the art methods The obtained results on PRID-2011 dataset confirm the feasibility of building a fully automatic person ReID pipeline including human detection, tracking, and person ReID The matching rates in case of incorporating one of considered human detection method and person ReID framework proposed in Chapter are 91.0%, 98.4%, 99.3, and 99.9% at important ranks (1, 5, 10, 20) These values are much higher than all of aforementioned deep learning-based approaches [78, 34, 58, 138, 57] Moreover, according to the quantitative evaluation on the trade-off between person ReID accuracy and computation time shown in Chapter 2, total time for person re-identification is 11.5s, 20.5s, or 100.9s in case of using four key frames, frames within a cycle or all frames, respectively If a practical person ReID system is not required to return the 105 Captured full frames Automatic detection Automatic segmentation c (a) The query person Ranked lists of the gallery person (b) The query person Ranked lists of the gallery person (c) Figure 4.15: Examples for results of a)human detection and segmentation and b), c) person ReID in all three cases of using the original images, manually segmented images, automatically segmented images of two different persons in PRID-2011 dataset 106 correct matching in the first ranks, four key frames will be used in order to reduce computation time and memory requirement but ensure person ReID accuracy when considering 20 first ranks Besides, current human detection and tracking algorithms not only response real-time requirement but also ensure a high accuracy These bring the feasibility of building a fully automatic person ReID system in practical Table 4.1: Comparison of the proposed method with state of the art methods for PRID 2011 (the two best results are in bold) Methods HOG3D+DVR[102] TAPR[78] LBP-Color+LSTM[34] GOG+LSTM[58] DFCP[138] RNN[57] The proposed method with manual detection with automatic detection with automatic detection and segmentation R=1 40.0 68.6 53.6 70.4 51.6 70.0 90.6 91.0 R=5 71.1 94.4 82.9 93.4 83.1 90.0 98.4 98.4 R=10 84.5 97.4 92.8 97.6 91.0 95.0 99.2 99.3 R=20 92.2 98.9 97.9 99.3 95.5 97.0 100 99.9 88.8 98.36 99.0 99.6 4.5 Conclusions and Future work In this chapter, the author attempt to build a fully person ReID system which has three main steps including person detection, segmentation and person ReID Based on obtained results, the author can confirm that the two previous steps affect on person ReID accuracy However, the effect is much reduced thanks to the robustness of the descriptor and metric learning The obtained results allow to give two suggestions First, if automatic person detection step provide a relatively good performance, segmentation is not required This helps to improve the computational time as segmentation step is time consuming Second, multi-shot is preferred choice because this scenario considers all instances of one person Therefore, it allows to remove poor detection results if they occur in few instances However, due to limitation of research time, person tracking step has not been examined in this thesis In the future work, the influence of person tracking on the overall performance of person ReID system will be considered to give a complete recommendation for developing fully automatic surveillance systems The main results in this chapter are included in three publications: 2nd , 3th , and 7th ones 107 CONCLUSION AND FUTURE WORKS Conclusion Through this thesis, two main contributions are proposed The first contribution is an effective method for video-based person re-identification through representative frames selections and feature pooling As widely observed, in video-based person ReID each person has multiple images and the number of these images is even dozens or hundreds This cause a significant burdens on computation speed and memory requirements An observation in practice that each pedestrian’s trajectory may include some walking cycles Consequently, the first step in the proposed method is to extract walking cycles and then, four key frames are selected from each cycle From this, feature extraction step as well as person matching are only performed these representative frames In order to provide an exhaustive evaluation, experiments are conducted in three different scenarios including all frames, walking cycle and four key frames The matching rates at rank-1 on PRID 2011 are 77.19%, 79.10%, and 90.56% for four key frames, one walking cycle and all frames schemes, while those on iLIDS-VID dataset are 41.09%, 44.14%, and 70.13%, respectively The obtained results show that the proposed method outperforms different state-of-the-art methods including deep learning ones Additionally, the trade off between the person ReID accuracy and computation time is fully investigated The obtained results indicate the advantage and drawback of each scheme and different recommendations on the use of these schemes are given to the research community The second contribution in this thesis is the fusion schemes proposed for both settings of person ReID In the first setting, we formulated person ReID as a classificationbased information retrieval problem where the model of person appearance is learned from the gallery images and the identity of interested person is determined by the probability that his/her probe image belonging to the model Both hand-designed and deep-learned features are used in feature extraction step and SVM classifier is proposed to learn the person appearance model Three fusion schemes including early fusion, product-rule and query-adaptive late fusion schemes are proposed Several experiments are conducted on CAVIAR4REID and RAiD datasets, the obtained results prove the effectiveness of fusion schemes through the matching rates at rank-1 are 94.44%, 99.72% and 100% in case A, case B of CAVIAR4REID and RAiD dataset, respectively In the second setting, the proposed method of the first contribution is extended by adding fusion schemes In this case, fusion schemes are built on product-rule and sum-rule operators To leverage the role of each feature in the fusion schemes, weights assigned 108 for each of considered features are either equal or adaptive to the content of the query person The obtained results indicate that although GOG and ResNet are the most powerful features in person representation for person ReID, their effectiveness can still be improved when integrating them into the fusion schemes The experiments are performed on two benchmark datasets (PRID-2011 and iLIDS-VID) with remarkable improvement The matching rates at rank-1 are increased up to 5.65% and 14.13% on PRID-2011 and iLIDS-VID, respectively compared to those in case of using only the single feature Beside, the author also evaluate the performance of a fully automated person ReID system including person detection, tracking and ReID Concerning human detection, three state-of-the-art methods which are ACF, YOLO and Mask R-CNN are employed In order to eliminate the effect of background, Pedparsing is proposed to use in segmentation step It is worth noting that Mask R-CNN performs human detection and segmentation simultaneously In the person ReID step, two state-of-the-art methods that are GOG descriptor and XQDA are used in feature extraction and metric learning, respectively Additionally, to meet the real-time requirement in a practical system, GOG descriptor is re-implemented in C++ and its optimal parameters are chosen Two suggestions are provided in this work First, if automatic person detection step provide a relatively good performance, segmentation is not required This helps to improve the computational time as segmentation step is time consuming Second, multi-shot is preferred choice because this scenario considers all instances of one person Therefore, it allows to remove poor detection results if they occur in few instances However, due to limitation of research time, person tracking step has not been examined in this thesis In the future work, the influence of person tracking on the overall performance of person ReID system will be considered Moreover, a fully automatic surveillance systems will be deployed and evaluated Future works In this thesis, different progresses have been obtained for person re-identification However, it is still a long journey to go in order to reach our final goal In the future, we want to continue to some research works based on the results of this dissertation In this section, we summarize the selected directions we would like to pursuit after this dissertation divided into two categories: short term and long term future works Short term • To evaluate the proposed methods, extensive experiments have been conducted on several bench-marking datasets However, due to the limitation in hardware resources, small and medium size datasets (e.g, the number of persons is 632 109

Định dạng
Số trang	143
Dung lượng	6,36 MB