(Luận án tiến sĩ) tái định danh trong hệ thống camera giám sát tự động

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY NGUYEN THUY BINH PERSON RE-IDENTIFICATION IN A SURVEILLANCE CAMERA NETWORK DOCTORAL DISSERTATION OF ELECTRONICS ENGINEERING Hanoi−2020 luan an MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY NGUYEN THUY BINH PERSON RE-IDENTIFICATION IN A SURVEILLANCE CAMERA NETWORK Major: Electronics Engineering Code: 9520203 DOCTORAL DISSERTATION OF ELECTRONICS ENGINEERING SUPERVISORS: 1.Assoc Prof Pham Ngoc Nam 2.Assoc Prof Le Thi Lan Hanoi−2020 luan an DECLARATION OF AUTHORSHIP I, Nguyen Thuy Binh, declare that the thesis titled "Person re-identification in a surveillance camera network" has been entirely composed by myself I assure some points as follows: This work was done wholly or mainly while in candidature for a Ph.D research degree at Hanoi University of Science and Technology The work has not be submitted for any other degree or qualifications at Hanoi University of Science and Technology or any other institutions Appropriate acknowledge has been given within this thesis where reference has been made to the published work of others The thesis submitted is my own, except where work in the collaboration has been included The collaborative contributions have been clearly indicated Hanoi, 24/11/ 2020 PhD Student SUPERVISORS i luan an ACKNOWLEDGEMENT This dissertation was written during my doctoral course at School of Electronics and Telecommunications (SET) and International Research Institute of Multimedia, Information, Communication and Applications (MICA), Hanoi University of Science and Technology (HUST) I am so grateful for all people who always support and encourage me for completing this study First, I would like to express my sincere gratitude to my advisors Assoc Prof Pham Ngoc Nam and Assoc Prof Le Thi Lan for their effective guidance, their patience, continuous support and encouragement, and their immense knowledge I would like to express my gratitude to Dr Vo Le Cuong and Dr Ha thi Thu Lan for their help I would like to thank to all member of School of Electronics and Telecommunications, International Research Institute of Multimedia, Information, Communications and Applications (MICA), Hanoi University of Science and Technology (HUST) as well as all of my colleagues in Faculty of Electrical-Electronic Engineering, University of Transport and Communications (UTC) They have always helped me on research process and given helpful advises for me to overcome my own difficulties Moreover, the attention at scientific conferences has always been a great experience for me to receive many the useful comments During my PhD course, I have received many supports from the Management Board of School of Electronics and Telecommunications, MICA Institute, and Faculty of Electrical-Electronic Engineering My sincere thank to Assoc Prof Nguyen Huu Thanh, Dr Nguyen Viet Son and Assoc Prof Nguyen Thanh Hai who gave me a lot of support and help Without their precious support, it has been impossible to conduct this research Thanks to my employer, University of Transport and Communications (UTC) for all necessary support and encouragement during my PhD journey I am also grateful to Vietnam’s Program 911, HUST and UTC projects for their generous financial support Special thanks to my family and relatives, particularly, my beloved husband and our children, for their never-ending support and sacrifice Hanoi, 2020 Ph.D Student ii luan an CONTENTS DECLARATION OF AUTHORSHIP i ACKNOWLEDGEMENT ii CONTENTS vi SYMBOLS vi LIST OF TABLES x LIST OF FIGURES xiv INTRODUCTION CHAPTER LITERATURE REVIEW 1.1 Person ReID classifications 1.1.1 Single-shot versus Multi-shot 1.1.2 Closed-set versus Open-set person ReID 1.1.3 Supervised and unsupervised person ReID 10 1.2 Datasets and evaluation metrics 11 1.2.1 Datasets 11 1.2.2 Evaluation metrics 16 1.3 Feature extraction 16 1.3.1 Hand-designed features 17 1.3.2 Deep-learned features 20 1.4 Metric learning and person matching 25 1.4.1 Metric learning 25 1.4.2 Person matching 28 1.5 Fusion schemes for person ReID 29 1.6 Representative frame selection 31 1.7 Fully automated person ReID systems 33 1.8 Research on person ReID in Vietnam 34 CHAPTER MULTI-SHOT PERSON RE-ID THROUGH REPRESENTATIVE FRAMES SELECTION AND TEMPORAL FEATURE POOLING 36 2.1 Introduction 36 2.2 Proposed method 36 2.2.1 Overall framework 2.2.2 Representative image selection 36 37 iii luan an 2.2.3 Image-level feature extraction 44 2.2.4 Temporal feature pooling 49 2.2.5 Person matching 50 2.3 Experimental results 55 2.3.1 Evaluation of representative frame extraction and temporal feature pooling schemes 55 2.3.2 Quantitative evaluation of the trade-off between the accuracy and computational time 61 2.3.3 Comparison with state-of-the-art methods 63 2.4 Conclusions and Future work 65 CHAPTER PERSON RE-ID PERFORMANCE IMPROVEMENT BASED ON FUSION SCHEMES 67 3.1 Introduction 67 3.2 Fusion schemes for the first setting of person ReID 3.2.1 Image-to-images person ReID 69 69 3.2.2 Images-to-images person ReID 75 3.2.3 Obtained results on the first setting 76 3.3 Fusion schemes for the second setting of person ReID 3.3.1 The proposed method 82 82 3.3.2 Obtained results on the second setting 86 3.4 Conclusions 89 CHAPTER QUANTITATIVE EVALUATION OF AN END-TO-END PERSON REID PIPELINE 91 4.1 Introduction 91 4.2 An end-to-end person ReID pipeline 92 4.2.1 Pedestrian detection 4.2.2 Pedestrian tracking 92 97 4.2.3 Person ReID 98 4.3 GOG descriptor re-implementation 99 4.3.1 Comparison the performance of two implementations 4.3.2 Analyze the effect of GOG parameters 99 99 4.4 Evaluation performance of an end-to-end person ReID pipeline 101 4.4.1 The effect of human detection and segmentation on person ReID in singleshot scenario 102 iv luan an 4.4.2 The effect of human detection and segmentation on person ReID in multishot scenario 104 4.5 Conclusions and Future work 107 PUBLICATIONS 112 Bibliography 113 v luan an ABBREVIATIONS No Abbreviation Meaning ACF Aggregate Channel Features AIT Austrian Institute of Technology AMOC Accumulative Motion Context BOW Bag of Words CAR Learning Compact Appearance Representation CIE The International Commission on Illumination CFFM Comprehensive Feature Fusion Mechanism CMC Cummulative Matching Characteristic CNN Convolutional Neural Network 10 CPM Convolutional Pose Machines 11 CVPDL Cross-view Projective Dictionary Learning 12 CVPR Conference on Computer Vision and Pattern Recognition 13 DDLM Discriminative Dictionary Learning Method 14 DDN Deep Decompositional Network 15 DeepSORT Deep learning Simple Online and Realtime Tracking 16 DFGP Deep Feature Guided Pooling 17 DGM Dynamic Graph Matching 18 DPM Deformable Part-Based Model 19 ECCV European Conference on Computer Vision 20 FAST 3D Fast Adaptive Spatio-Temporal 3D 21 FEP Flow Energy Profile 22 FNN Feature Fusion Network 23 FPNN Filter Pairing Neural Network 24 GOG Gaussian of Gaussian 25 GRU Gated Recurrent Unit 26 HOG Histogram of Oriented Gradients 27 HUST Hanoi University of Science and Technology 28 IBP Indian Buffet Process 29 ICCV International Conference on Computer Vision 30 ICIP International Conference on Image Processing vi luan an 31 IDE ID-Discriminative Embedding 32 iLIDS-VID Imagery Library for Intelligent Detection Systems 33 ILSVRC ImageNet Large Scale Visual Recognition Competition 34 ISR TIterative Spare Ranking 35 KCF Kernelized Correlation Filter 36 KDES Kenel DEScriptor 37 KISSME Keep It Simple and Straightforward MEtric 38 kNN k-Nearest Neighbour 39 KXQDA Kernel Cross-view Quadratic Discriminative Analysis 40 LADF Locally-Adaptive Decision Functions 41 LBP Local Binary Pattern 42 LDA LinearDiscriminantAnalysis 43 LDFV Local Descriptor and coded by Feature Vector 44 LMNN Large Margin Nearest Neighbor 45 LMNN-R Large Margin Nearest Neighbor with Rejection 46 LOMO LOcal Maximal Occurrence 47 LSTM Long-Short Term Memory 48 LSTMC Long Short-Term Memory network with a Coupled gate 49 mAP mean Average Precision 50 MAPR Multimedia Analysis and Pattern Recognition 51 Mask R-CNN Mask Region with CNN 52 MCT Multi -Camera Tracking 53 MCCNN Multi-Channel CNN 54 MCML Maximally Collapsing Metric Learning 55 MGCAM Mask-Guided Contrastive Attention Model 56 ML Machine Learning 57 MLAPG Metric Learning by Accelerated Proximal Gradient 58 MLR Metric Learning to Rank 59 MOT Multiple Object Tracking 60 MSCR Maximal Stable Color Region 61 MSVF Maximally Stable Video Frame 62 MTMCT Multi-Target Multi-Camera Tracking 63 Person ReID Person Re -Identification 64 Pedparsing Pedestrian Parsing 65 PPN Pose Prediction Network vii luan an 66 PRW Person Re-identification in the Wild 67 QDA Quadratic Discriminative Analysis 68 RAiD Re-Identification Across indoor-outdoor Dataset 69 RAP Richly Annotated Pedestrian 70 ResNet Residual Neural Network 71 RHSP Recurrent High-Structured Patches 72 RKHS Reproducing Kernel Hilbert Space 73 RNN Recurrent Neural Network 74 ROIs Region of Interests 75 SDALF Symmetry Driven Accumulation of Local Feature 76 SCNCD Salient Color Names based Color Descriptor 77 SCNN Siamese Convolutional Neural Network 78 SIFT Scale-Invariant Feature Transform 79 SILTP Scale Invariant Local Ternary Pattern 80 SPD Symmetric Positive Definite 81 SMP Stepwise Metric Promotion 82 SORT Simple Online and Realtime Tracking 83 SPIC Signal Processing: Image Communication 84 SVM Support Vector Machine 85 TAPR Temporally Aligned Pooling Representation 86 TAUDL Tracklet Association Unsupervised Deep Learning 87 TCSVT Transactions on Circuits and Systems for Video Technology 88 TII Transactions on Industrial Informatics 89 TPAMI Transactions on Pattern Analysis and Machine Intelligence 90 TPDL Top-push Distance Learning 91 Two-stream MR Two-stream Multirate Recurrent Neural Network 92 UIT University of Information Technology 93 UTAL Tracklet Association Unsupervised Deep Learning 94 VIPeR View-point Invariant Pedestrian Recognition 95 VNU-HCM Vietnam National University - Ho Chi Minh City 96 WH Weighted color Histogram 97 WHOS Weighted Histograms of Overlapping Stripes 98 WSC Weight-based Sparse Coding 99 XQDA Cross-view Quadratic Discriminative Analysis 100 YOLO You Only Look One viii luan an which contains the predicted object although the ground truth is in another grid cell Therefore, a grid cell which contains two or more center of bounding boxes will not be detected This is the drawback of YOLO-v1 and the number of grid cells is increased In order to overcome this drawback of YOLOv1, YOLOv2 used Anchor boxes with pre-defined shape obtained in the training phase through k-means cluster algorithm on ground truth bounding boxes Moreover, some changes are mentioned in YOLOv2, such as using batch normalization, grid_size = 13 × 13, box_number = 5, = 416 × 416 With theseper improvement, YOLOv2ofcan ding window andimage_dimension region one set of class probabilities grid cell, regardless thedetect objects with any size number of inputofimages LO sees the entire image boxes and B mean Average Precision (mAP) is increased from mplicitly encodes63.4% contexAt to test78.6% time we multiply the class probabiliin YOLOv1 in YOLOv2 Inconditional recent years, YOLO is developed with well as their appearance ties and the individual box confidence predictions, version YOLOv3 which employ DarkNet as backbone with a more complex structure thod [14], mistakes backtruth truth for feature extraction pyramid Pr(ClassiThis |Object)improved ∗ Pr(Object)version ∗ IOUpredYOLOv3 = Pr(Classadopts (1) features to i ) ∗ IOUpred bjects because it can’t see and YOLOv2 that is not is able to maintain a less than half theovercome number the shortcoming which gives ofusYOLOv1 class-specific confidence scores for each high accuracy when dealing with a small-size object in highofresolution o Fast R-CNN box These scores encode both the probability that classimages This help to reduce in computation whilethe topredicted ensure precision in the detection appearing the box and time how well box fits the able representations of YOLOv3 obobject images and testedtask on artection methods like DPM ince YOLO is highly genak down when applied to s f-the-art detection systems ly identify objects in imBounding boxes + confidence alize some objects, espeese tradeoffs further in our g code is open source A lso available to download S × S grid on input Final detections Class probability map (a) nents of object detection Figure 2: The Model Our system models detection as a regresOur network uses features sion problem It divides the image into an S × S grid and for each ach bounding box It also grid cell predicts B bounding boxes, confidence for those boxes, oss all classes for an imand C class probabilities These predictions are encoded as an our network reasons globS × S × (B ∗ + C) tensor the objects in the image to-end training and realFor evaluating YOLO on PASCAL VOC, we use S = 7, gh average precision B = PASCAL VOC has 20 }labelled classes so C = 20 ×4 } ×2 mage into an S × S grid Our final prediction is a × × 30 tensor o a grid cell, that grid cell 2.1 Network Design bject (b) Figure 3: The Architecture Our detection network has 24 convolutional layers followed by fully connected layers Alternating × convolutional layers reduce the features space from preceding layers We pretrain the convolutional layers on the ImageNet classification ding boxes and confidence We implement this model as a convolutional neuralofnetFigure 4.5: a) An input divided in double × 7the grid cell b) architecture an YOLO detector task at half the resolution (224 ×image 224 inputisimage) and then resolution for The detection fidence scores reflect how work and evaluate it on the PASCAL VOC detection dataset [152] ox contains an object and [9] The initial convolutional layers ofuse thesum-squared networkerror extract model We because it is easy to opThe final output of our network is the × × 30 tensor ox is that it predicts Fortimize, however it does not perfectly align with our goal of of predictions 95 the fully connected layers features from the image while truth maximizing average precision It weights localization err(Object) ∗ IOUpred2.2 If no Training predict the output probabilitiesror and coordinates equally with classification error which may not be ideal nfidence scores should be our convolutional Also, in everyby image grid cells not contain any Our network inspired themany GoogLeNet We pretrain layers onarchitecture the ImageNet luan isan object This pushes the “confidence” scores of those cells 1000-class dataset [30] For pretraining we use fidence score to equal thecompetition model for image classification [34] Our network has 24 448 7 112 56 3 448 Conv Layer 7x7x64-s-2 Maxpool Layer 2x2-s-2 28 14 112 56 192 Conv Layer 3x3x192 Maxpool Layer 2x2-s-2 256 Conv Layers 1x1x128 3x3x256 1x1x256 3x3x512 Maxpool Layer 2x2-s-2 3 28 512 7 3 14 1024 Conv Layers 1x1x256 3x3x512 1x1x512 3x3x1024 Maxpool Layer 2x2-s-2 7 1024 Conv Layers 1x1x512 3x3x1024 3x3x1024 3x3x1024-s-2 1024 Conv Layers 3x3x1024 3x3x1024 4096 Conn Layer 30 Conn Layer Mask R-CNN detector Mask R-CNN is an extension of Faster R-CNN [150] An advantage of Mask R-CNN is to create simultaneously a bounding box and a corresponding mask for a detected object In comparison with Faster R-CNN in term of structure, Mask R-CNN integrates an object proposal generator into a detection network, so convolutional features can be shared from the object proposal network to the detection network This help to reduce significantly computation cost while still maintaining a high mAP Figure 4.6 shows the difference between Faster R-CNN and Mask R-CNN in their architecture As shown in this Figure, Faster R-CNN includes two networks connected together: (1) region proposal network and (2) feature extraction network which performs on proposed region and classify detected objects Based on this architecture, Mask R-CNN adds a sub-network for an object mask prediction in parallel with the existing sub-network for bounding box recognition In [111], the authors claimed that Mask R-CNN is simple to train and the above addition is easy to perform with a small overhead to Faster R-CNN network For the training phase, Mask R-CNN is trained on COCO dataset, a large-scale and enough diversity one It takes one to two days for training Mask R-CNN network on a single 8-GPU machine and this model can run at about 5f ps High speed in both the training and test phases accompany with framework’s flexibility and high accuracy will assist on segmentation task Mask R-CNN Kaiming He Georgia Gkioxari Piotr Dollár Ross Girshick rXiv:1703.06870v3 [cs.CV] 24 Jan 2018 Facebook AI Research (FAIR) Abstract class We present a conceptually simple, flexible, and general box framework for object instance segmentation Our approach efficiently detects objects in an image while simultaneously RoIAlign generating a high-quality segmentation mask for each inconv conv stance The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition Mask R-CNN is(a) simple to R-CNN train and adds only a small Faster (b) Mask R-CNN Figure The Mask R-CNN framework for instance segmentation overhead to Faster R-CNN, running at fps Moreover, Mask R-CNNFigure is easy 4.6: to generalize to other tasks, alThe architecture of e.g., a) Faster R-CNN [150] and b) Mask R-CNN [111] segmentation, where the goal is to classify each pixel into lowing us to estimate human poses in the same framework a fixed set of categories without differentiating object inWe show top results in all three tracks of the COCO suite stances.1 Given this, one might expect a complex method of challenges, including instance segmentation, boundingis required to achieve good results However, we show that box object detection, and person keypoint detection Witha surprisingly simple, flexible, and fast system can surpass out bells and whistles, Mask R-CNN outperforms all exprior state-of-the-art instance segmentation results isting, single-model entries on every task, including the Pedestrian parsing novel DeepandDecompositional Network (DDN) by Our method, called Mask R-CNN, extendsproposed Faster R-CNN COCO 2016 challenge winners is We ahope our simple [36] by adding a branch for predicting segmentation masks effective will serveThe as a solid baselineofand helpnetwork is to parse a given pedestrian image Luo etapproach al [112] purpose this on each Region of Interest (RoI), in parallel with the exease future research in instance-level recognition Code isting branch classification bounding box regreshas beensome made available at: https://github.com/ into semantic regions, such as hair, head, body,for arms, andand legs In literature, sion (Figure 1) The mask branch is a small FCN applied facebookresearch/Detectron to each RoI, predicting a segmentation mask in a pixel-topixel manner Mask R-CNN is simple to implement and 96train given the Faster R-CNN framework, which facilitates Introduction a wide range of flexible architecture designs Additionally, The vision community has rapidly improved object dethe mask branch only adds a small computational overhead, tection and semantic segmentation results over a short peenabling a fast system and rapid experimentation Pedestrian parsing segmentation luan an majority of existing methods based on template matching or Bayesian inference In pedestrian parsing framework, low-level visual features are mapped to the label maps of body parts through DDN in which complex pose variations are accurately estimated with good robustness to background clutter and occlusions The benefit of DDN is to jointly figure out occluded regions and segments body parts by employing three kinds of hidden layers including occlusion estimation layers, completion layers, and decomposition layers Figure 4.7 shows DDN architecture for parsing pedestrian task The target of occlusion estimation layers is to suggest a binary mask indicating which part of a pedestrian is invisible Meanwhile the completion layers generates low-level features of the invisible part based on the original features and the occlusion mask These generated features are transformed directly to label maps through the decomposition layers These hidden networks are pre-trained and then fine-tune the entire network using the stochastic gradient descent Figure DDN architecture, which combines occlusion estimation,(a) data completion, and data transformation in an unified deep network the joint representation of images and label maps for face and W c2 are encoders that find the compact representation Figure 4.7: aDDN for parsing Zhu et al [29] proposed deep architecture network to of Pedestrian noisy data byParsing projecting[112] high dimensional data into a transform a face image under arbitrary pose and lighting low dimensional space W c1 and W c2 are decoders that to a canonical view Mnih et al Jain et al [10] used reconstruct the data xc is reconstructed from xo and xd as convolutional neural networks (CNN), which consider data follows, of one modality as input and the corresponding data of the z = ρ(W c2 ρ(W c1 (xo xd ) + bc1 ) + bc2 ), (2) other modality as output The decomposition layers in DDN c2 c2 are While similar toin CNN, but with fully-connected layers that between ρ(W c1 ρ(W z + uare ) + considered uc1 ), (3) xc = SORT algorithm, IoU ratios detected boxes as capture the global structures of the pedestrians elements of the cost matrix in data association, both information on whereDeepSORT z is the compactuses representation and denotes the element-wise product Network Architecture motion and appearance for computing a measurement metric incompleted tracking This On the top of DDN, the feature xc ismetric decomFig.2 (a) shows the architecture of DDN, the input of posed (transformed) into several label maps {y , , yM } iswhich expressed asvector follows: t2 is a feature x, and the output is a set of with the corresponding weight matrices W t1 , W1t2 , , WM , t2 label maps {y1 , , yM } of body parts Each layer is fully t1 t2 n and biases b , b1 , , bM Each label map yi ∈ [0, 1] is connected with the next upper layer, and there are one estimated by (1) two down-sampling layer, two occlusion estimation layers, ci,j = λd (i, j) + (1 − λ)d(2) (i, j) (4.1) completion layers, and two decomposition layers This yi = τ (Wit2 ρ(W t1 xc + bt1 ) + bti2 ), (4) architecture works well for pedestrian parsing More layers where ci,j is the similarity between the i-th track and the j-th bounding box detection; where yij = indicates the pixel belongs to the background can be added for more complex problems (1) (2) and yij = based indicates on the pixel is on the corresponding body thej) bottom is down-sampled to calculated d At(i, andofdDDN, (i,the j)input are xthe two metrics motion and appearance part xd x is also mapped to a binary occlusion mask xo ∈ (1) o1 information, While j) is calculated based on Mahalanobis distance, [0, 1]n through tworespectively weight matrices W , W o2d , and(i, biases o1 o2 o d Training Algorithms b (2) , b Notice that x is at the same size as x in order to dreduce (i,the j)number is theof smallest cosine distance between the i-th track and the j-th bounding parameters in the network xoi = Training DDN is done by estimating a set of weight if the i-th element ofin thethe feature is occluded, andspace; xoi = hyperparameter λ controls this association box detection appearance matrices and corresponding biases It is challenging beo otherwise x is computed as cause of the huge amount of parameters If the dimensions This technique is investigated in details in the study of Nguyen et al [153] In which of the input feature vector and the output label maps are (1) xo = τ (W o2 ρ(W o1 x + bo1 ) + bo2 ), 000the and 10, 000 our network has millions of parameters the combination between DeepSORT and one8,of two state-of-the-art human detecWe pre-train DDN in a layer-wise manner to initialize the where τ (x) = 1/(1 + exp(−x)) and ρ(x) = max(0, x) parameters, and then fine-tune the entire network The first occlusion estimation layer employs the rectified linear function [15] ρ(x) as the activation function The 97 3.1 Pre-training Occlusion Estimation Layer second layer models binary data with the sigmoid function In the middle of DDN, the completion layers are modeled The occlusion estimation layers infer a binary mask xo as the denoising autoencoder (DAE) [26], which utilizes the from an input feature x We cannot employ RBMs as element-wise product of xo and xd as input, and outputs in [8] to unsupervised pre-train these layers, because our 4.2.2 Pedestrian tracking luan an tion methods (YOLO and Mask R-CNN) are considered and compared to each other Extensive experiments are conducted on MOT17 -a benchmark dataset provided in MOT Challenge and their own dataset, called COMVIS_MICA Based on obtained results, the author make a deep analysis on the behavior of human detection and tracking method in terms of both detection and tracking performance as well as resource requirement for a realistic application In this study, the authors claimed that Mask RCNN is better than YOLO in human detection task However, Mask R-CNN requires a larger resource for implementation Relying on this suggestion, this chapter proposes to use the combination between Mask R-CNN and DeepSORT for human detection and tracking tasks in the fully-automated person ReID system, respectively 4.2.3 Person ReID Person ReID is the last stage of a full person ReID pipeline with the task is to determine which gallery images describe a query pedestrian In the history, a huge number of studies pay attention to this problem and achieve numerous important milestones For the previous chapters, person ReID is considered in different aspects and some strategies are introduce to improve its performance In Chapter an effective framework through key frames selection and temporal pooling is proposed and obtains impressive results Additionally, three fusion schemes are presented in Chapter with different combinations of both hand-designed and deep-learned features Based on this, the best combination is determined and some useful suggestions are provided for the research community In this chapter, the author try to integrate proposed framework for person ReID into a full where implementation rate and memory requirement should be considered In this chapter, GOG descriptor and XQDA technique are used for feature extraction and metric learning, respectively This method outperforms a number of the state of the art methods for single-shot person ReID [154] In order to handle multi-shot problem, some works turn multi-shot problem into single-shot one by applying different pooling techniques, such as max-, min-, or average-pooling Some others prefer to compare two sets of feature vectors, namely set-to-set matching technique [82, 83, 155] In this chapter, average-pooling technique is exploited to get the final signature of each person Average-pooling means to take the average value of all extracted feature vectors which are corresponding to all instance images of a given person In order to bring person ReID to practical applications, GOG descriptor is re-implemented in C++ and the optimal parameters of this descriptor are selected through intensive experiments The experimental results show that the proposed approach allows to extract GOG times faster than the available source code and achieve remarkably high accuracy for person ReID The re-implementation for GOG descriptor in C++ and the choice of optimal parameters for this descriptor are described in more detail in the following section 98 luan an 4.3 GOG descriptor re-implementation The code for GOG descriptor is publicly provided by the authors in [49] However, the authors implemented it in Matlab In order to apply GOG feature in practice, we re-implement GOG extraction in C++ The main purpose of this section is to compare computational speed when GOG descriptor is implemented in the two different ways: in Matlab (original manner) and in C++ (proposed implementation), in which dlib, opencv libraries are employed Consuming time for each step for extracting GOG descriptor will be shown clearly Based on obtained results, the set of optimal parameters is chosen with the hope to achieve a higher performance To compare to the original work [49], extensive experiments are conducted on VIPeR dataset [29] For evaluation, this dataset is divided into two equal parts, one for training phase and the other used for test phase 4.3.1 Comparison the performance of two implementations Two implementations are compared in terms of ReID accuracy and computational time For ReID accuracy, Figure 4.8a) shows the CMC curves obtained by two implementations in Lab color space Obviously, the two CMC curves are very close, approximately overlapped This indicates that this implementation allows to produce similar results as the source code provided by the authors in [49] Concerning computational time, for one image with the resolution 128 × 48, this implementation takes 0.109 s (∼ 10fps) while that of the authors in [49] needs 0.286 s on Computer Intel(R) Core(TM) i5-6200U 2.3GHz, 8GB RAM-DDR3 1600MHz This means that the re-implementation in C++ allows to extract GOG times faster than the implementation of [49] The obtained frame rate of ∼10fps can satisfy the real-time requirement of several surveillance applications Figure 4.8b) shows computational time of each step in GOG feature extraction for one person image 4.3.2 Analyze the effect of GOG parameters There are several parameters used in GOG feature extraction In order to evaluate the effect of these parameters on person ReID, we chose two important parameters that are the number of regions or stripes (N) and the number of gradient bins Figure 4.9a) shows the matching rate at rank-1 when the number of regions varies from to 49 In [49] the authors fixed the number of region is It can be observed that using 13, 15 or 17 regions allows to obtain the best ReID results in rank-1 However, the results from 15 region outperform the other ones at other ranks (e.g., rank 5, 10, 15, 20) Figure 4.9b) indicates the variation of matching rates at important ranks of 1, 5, 10, 15, 20 when the number of gradient bins is changed As seen in this Figure, the best performance is achieved when the number of gradient bins is 99 luan an region combine 0.001 pixel feature 0.016 region flatten 0.018 0 M a tc h in g r a te ( % ) 4 2 % % V IP e R _ R G B _ o r ig in a l c o d e [6 ] V IP e R _ R G B _ r e - im p le m e n t c o d e 1 patch flatten 0.074 R a n k (a) (b) Figure 4.8: a) ReID accuracy of the source code provided in [49] and that of the re-implementation and b) computation time (in s) for each step in extracting GOG feature on an image in C++ 0 M a tc h in g r a te ( % ) M a tc h in g r a te s a t r a n k - ( % ) b in b in b in b in b in b in b in s = s = s = s = s = s = s = 0 1 2 3 3 4 T h e n u m b e r o f r e g io n s 1 R a n k (a) (b) Figure 4.9: The matching rates at rank-1 with different number of regions (N) 100 luan an Based on the above analysis, in these experiments, we choose the number of regions is 15 and the number of gradient bins is Others parameters are kept as in the experiments of [49] 0 M a tc h in g r a te ( % ) 4 3 9 5 % _ % _ % _ % _ % _ o rg o rg o rg o rg o rg _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ R G B L a b H S V n R n G F u s io n 1 R a n k Figure 4.10: CMC curves on VIPeR dataset when extracting GOG features with the optimal parameters Figure 4.10 presents CMC curves which are evaluated on different color spaces (RGB/Lab/HSV/nRnG) and the fusion of these color spaces, with the optimal parameters chosen for GOG The best result (51.04%) is obtained when using 15 regions and fusing all color spaces It is worth to note that the best result when using the provided source code of the authors [49] with the default parameters is 49.70% This shows that by choosing the optimal parameters, the accuracy at rank-1 can increase 1.34% 4.4 Evaluation performance of an end-to-end person ReID pipeline From the above obtained results, this chapter presents a full person ReID pipeline and estimates the influence of the two previous steps (human detection and segmentation) on person ReID step For this, some extensive experiments are conducted on both single-shot dataset (VIPeR [29]) and multi-shot one (PRID-2011 [32]) Each dataset is split into two halves, one for the training phase and one for the test phase This process is repeated randomly 10 times and the reported results are the average one from these These experiments are followed the settings introduced in [45] for VIPeR and in [58] for PRID-2011 101 luan an 0 M a tc h in g r a te s ( % ) 5 % % % M a n u a l d e te c tio n w ith th e p r o p o s e d m e th o d A u to m a tic s e g m e n ta tio n w ith th e p r o p o s e d m e th o d M a n u a l s e g m e n ta tio n w ith th e p r o p o s e d m e th o d 1 R a n k Figure 4.11: CMC curves of three evaluated scenarios on VIPER dataset when applying the method proposed in Chapter 4.4.1 The effect of human detection and segmentation on person ReID in single-shot scenario As full frame in VIPER dataset is not available, in this experiment, we can only evaluate the effect of person segmentation Two methods of person image segmentation are considered: manual segmentation via Interactive Segmentation Tool and automatic segmentation based on Pedparsing method The obtained results when applying the method proposed in Chapter are shown in Figure 4.11 By observing this Figure, it is clear that manual segmentation obtains the best results The manual segmentation allows to obtain 6.21% of improvement over manual detection This means that background plays an important role in person ReID performance Figure 4.12 shows an example of segmentation and ReID results in VIPeR dataset In Figure 4.12a), two original images are segmented in two different manners: manual and automatic segmentation While Figure 4.12b) and c) illustrate the ReID results with the two query persons in three different cases The first rows present the original images, the second rows and the last rows are corresponding to manual and automatic segmentation images, respectively By observing we can see that, for the first query person (Figure 4.12b)), the true matching can be found immediately in the first ranks in all three cases For the second query person (Figure 4.12c)),with the original images, the true matching is found at the rank-3, however, when applying manual segmentation the true matching is brought to the rank-1 Also, due to loss of information in case of using automatic segmentation, the true matching can not be found within the first ten ranks Moreover, we make an additional experiment for single-shot approach on PRID 2011 dataset A bounding box for each individual is chosen randomly After that, the proposed framework is applied on these bounding boxes Because of choosing only 102 luan an (a) The query person The query person Rank lists of the gallery person Rank lists of the gallery person (b) (c) 0 0 9 8 M a tc h in g r a te ( % ) M a tc h in g r a te ( % ) Figure 4.12: Examples for results of a)segmentation and b), c) person ReID in all three cases of using the original images, manually segmented images, automatically segmented images of two different persons in VIPeR dataset 5 % % % M a n u a l- d e te c t A u to -d e te c t b y A C F A u to -d e te c t b y Y O L O 4 % % % % M a A u A u A u n u to to to - a l- d d e t( d e t( d e t+ e t+ A C Y O A u A u F )+ L O to - to -s e A u to )+ A u s e g (M g (P -s e to -s a s e d g (P e g k R p a r e d (P e C N s in g ) p a r s in g ) d p a r s in g ) N ) 1 R a n k 1 R a n k (a) (b) Figure 4.13: CMC curves of three evaluated scenarios on PRID 2011 dataset in single-shot approach (a) Without segmentation and (b) with segmentation 103 luan an a random image for each person, the identities as well as the number of persons in this experiment are not similar to those in multi-shot case on PRID 2011, which are often used in the most existing works In this experiment, for detection human stage, we utilize ACF, YOLO, and Mask R-CNN And, for segmentation, Mask R-CNN and Pedparsing are applied It is worth noting that Mask R-CNN is used for detection and segmentation purpose simultaneously For feature extraction, GOG descriptor is also employed for person representation Performance of an end-to-end person ReID pipeline is evaluated on some different cases indicated in Fig.4.13 with two cases ( without/with segmentation) This Figure provides several conclusions as follow First, when comparing corresponding curves on left and right side Figures, segmentation stage provides worse results compared to those in case of applying only detection This result means that background removal with binary masks, which may a reason for information loss and smoothness of an image, is not an optimal choice to improve the performance of person ReID task The matching rates at rank-1 when applying segmentation process are reduced by 10.9%, 8.53%, 10.33% when compared to those in case of manual, ACF, and YOLO methods, respectively Moreover, in comparison between considered detectors, obtained results indicate that ACF detector achieves a better performance than YOLO one in both cases (without/with segmentation) The matching rates at rank-1 when using ACF detector are higher by 2.47% and 4.27% compared to YOLO detector in case of without and with segmentation, respectively In addition, the effectiveness of ACF detector can achieve the performance of manual detection One remarkable point is that Mask-RCNN provides an impressive results that is competitive over manual detection This bring a hopefulness for an end-to-end person ReID pipeline to be practical 4.4.2 The effect of human detection and segmentation on person ReID in multi-shot scenario Figure 4.14 indicates the matching rates on PRID 2011 dataset when employing GOG descriptors with XQDA technique with manual detection provided by Hirzer et al [32], one of the considered automatic detection techniques and automatic segmentation using Pedparsing after the automatic detection stage It is worth noting that in order to make a comparison between automatic detection and manual detection, only the person ROIs from automatic detection whose IoU is greater than 0.4 are kept Figure 4.15 shows an example of the obtained results in the first two steps detection and segmentation and in the last step person ReID in the fully automatic ReID system In Figure 4.15a), first, human detection method is applied on full frames to extract human ROIs and then, segmentation technique is conducted on these bounding boxes Figure 4.15b) and c) show the ReID results when dealing with the two different query persons For the first person, the true matching person are found at the rank-1, 104 luan an 100 Matching rate (%) 95 90 85 80 90.56% Manual_Detection with the proposed method 91.01% Auto_Detection with the proposed method 88.76% Auto_Detection+Segmentation with the proposed method 75 10 15 20 Rank (a) Figure 4.14: CMC curves of three evaluated scenarios on PRID 2011 dataset when applying the proposed method in Chapter however, for the second person, the true matching person is not always determined at the first rank These results illustrate clearly the influence of human detection and segmentation on person ReID performance It is interesting to see that the person re-identification results in case of using automatic detection is slightly better than those of manual detection The reason is that automatic detection results very well-aligned bounding boxes while manual detection defines a larger bounding box for pedestrians Additionally, a pedestrian is represented by multiple images which help to overcome the errors generated in the human detection phase Previous studies have discussed about the effect of background on person ReID accuracy [113, 156, 157] However, the obtained results in Chapter show that when human detection is relative good, human segmentation is not required as the quality of current automatic object segmentation methods are far from perfect It can be shown in Fig 4.14, the matching rates at rank-1 are reduced by 2.25% It is an helpful recommendation for a full person ReID pipeline Finally, Table 4.1 shows the comparison of the proposed method with state of the art methods The obtained results on PRID-2011 dataset confirm the feasibility of building a fully automatic person ReID pipeline including human detection, tracking, and person ReID The matching rates in case of incorporating one of considered human detection method and person ReID framework proposed in Chapter are 91.0%, 98.4%, 99.3, and 99.9% at important ranks (1, 5, 10, 20) These values are much higher than all of aforementioned deep learning-based approaches [78, 34, 58, 138, 57] Moreover, according to the quantitative evaluation on the trade-off between person ReID accuracy and computation time shown in Chapter 2, total time for person re-identification is 11.5s, 20.5s, or 100.9s in case of using four key frames, frames within a cycle or all frames, respectively If a practical person ReID system is not required to return the 105 luan an Captured full frames Automatic detection Automatic segmentation c (a) The query person Ranked lists of the gallery person (b) The query person Ranked lists of the gallery person (c) Figure 4.15: Examples for results of a)human detection and segmentation and b), c) person ReID in all three cases of using the original images, manually segmented images, automatically segmented images of two different persons in PRID-2011 dataset 106 luan an correct matching in the first ranks, four key frames will be used in order to reduce computation time and memory requirement but ensure person ReID accuracy when considering 20 first ranks Besides, current human detection and tracking algorithms not only response real-time requirement but also ensure a high accuracy These bring the feasibility of building a fully automatic person ReID system in practical Table 4.1: Comparison of the proposed method with state of the art methods for PRID 2011 (the two best results are in bold) Methods HOG3D+DVR[102] TAPR[78] LBP-Color+LSTM[34] GOG+LSTM[58] DFCP[138] RNN[57] The proposed method with manual detection with automatic detection with automatic detection and segmentation R=1 40.0 68.6 53.6 70.4 51.6 70.0 90.6 91.0 R=5 71.1 94.4 82.9 93.4 83.1 90.0 98.4 98.4 R=10 84.5 97.4 92.8 97.6 91.0 95.0 99.2 99.3 R=20 92.2 98.9 97.9 99.3 95.5 97.0 100 99.9 88.8 98.36 99.0 99.6 4.5 Conclusions and Future work In this chapter, the author attempt to build a fully person ReID system which has three main steps including person detection, segmentation and person ReID Based on obtained results, the author can confirm that the two previous steps affect on person ReID accuracy However, the effect is much reduced thanks to the robustness of the descriptor and metric learning The obtained results allow to give two suggestions First, if automatic person detection step provide a relatively good performance, segmentation is not required This helps to improve the computational time as segmentation step is time consuming Second, multi-shot is preferred choice because this scenario considers all instances of one person Therefore, it allows to remove poor detection results if they occur in few instances However, due to limitation of research time, person tracking step has not been examined in this thesis In the future work, the influence of person tracking on the overall performance of person ReID system will be considered to give a complete recommendation for developing fully automatic surveillance systems The main results in this chapter are included in three publications: 2nd , 3th , and 7th ones 107 luan an CONCLUSION AND FUTURE WORKS Conclusion Through this thesis, two main contributions are proposed The first contribution is an effective method for video-based person re-identification through representative frames selections and feature pooling As widely observed, in video-based person ReID each person has multiple images and the number of these images is even dozens or hundreds This cause a significant burdens on computation speed and memory requirements An observation in practice that each pedestrian’s trajectory may include some walking cycles Consequently, the first step in the proposed method is to extract walking cycles and then, four key frames are selected from each cycle From this, feature extraction step as well as person matching are only performed these representative frames In order to provide an exhaustive evaluation, experiments are conducted in three different scenarios including all frames, walking cycle and four key frames The matching rates at rank-1 on PRID 2011 are 77.19%, 79.10%, and 90.56% for four key frames, one walking cycle and all frames schemes, while those on iLIDS-VID dataset are 41.09%, 44.14%, and 70.13%, respectively The obtained results show that the proposed method outperforms different state-of-the-art methods including deep learning ones Additionally, the trade off between the person ReID accuracy and computation time is fully investigated The obtained results indicate the advantage and drawback of each scheme and different recommendations on the use of these schemes are given to the research community The second contribution in this thesis is the fusion schemes proposed for both settings of person ReID In the first setting, we formulated person ReID as a classificationbased information retrieval problem where the model of person appearance is learned from the gallery images and the identity of interested person is determined by the probability that his/her probe image belonging to the model Both hand-designed and deep-learned features are used in feature extraction step and SVM classifier is proposed to learn the person appearance model Three fusion schemes including early fusion, product-rule and query-adaptive late fusion schemes are proposed Several experiments are conducted on CAVIAR4REID and RAiD datasets, the obtained results prove the effectiveness of fusion schemes through the matching rates at rank-1 are 94.44%, 99.72% and 100% in case A, case B of CAVIAR4REID and RAiD dataset, respectively In the second setting, the proposed method of the first contribution is extended by adding fusion schemes In this case, fusion schemes are built on product-rule and sum-rule operators To leverage the role of each feature in the fusion schemes, weights assigned 108 luan an for each of considered features are either equal or adaptive to the content of the query person The obtained results indicate that although GOG and ResNet are the most powerful features in person representation for person ReID, their effectiveness can still be improved when integrating them into the fusion schemes The experiments are performed on two benchmark datasets (PRID-2011 and iLIDS-VID) with remarkable improvement The matching rates at rank-1 are increased up to 5.65% and 14.13% on PRID-2011 and iLIDS-VID, respectively compared to those in case of using only the single feature Beside, the author also evaluate the performance of a fully automated person ReID system including person detection, tracking and ReID Concerning human detection, three state-of-the-art methods which are ACF, YOLO and Mask R-CNN are employed In order to eliminate the effect of background, Pedparsing is proposed to use in segmentation step It is worth noting that Mask R-CNN performs human detection and segmentation simultaneously In the person ReID step, two state-of-the-art methods that are GOG descriptor and XQDA are used in feature extraction and metric learning, respectively Additionally, to meet the real-time requirement in a practical system, GOG descriptor is re-implemented in C++ and its optimal parameters are chosen Two suggestions are provided in this work First, if automatic person detection step provide a relatively good performance, segmentation is not required This helps to improve the computational time as segmentation step is time consuming Second, multi-shot is preferred choice because this scenario considers all instances of one person Therefore, it allows to remove poor detection results if they occur in few instances However, due to limitation of research time, person tracking step has not been examined in this thesis In the future work, the influence of person tracking on the overall performance of person ReID system will be considered Moreover, a fully automatic surveillance systems will be deployed and evaluated Future works In this thesis, different progresses have been obtained for person re-identification However, it is still a long journey to go in order to reach our final goal In the future, we want to continue to some research works based on the results of this dissertation In this section, we summarize the selected directions we would like to pursuit after this dissertation divided into two categories: short term and long term future works Short term • To evaluate the proposed methods, extensive experiments have been conducted on several bench-marking datasets However, due to the limitation in hardware resources, small and medium size datasets (e.g, the number of persons is 632 109 luan an ... Figure 1.3a) the person appears on both cameras, while she appears only on the camera- A in Figure 1.3b) Camera- A Camera- B Camera- A (a) Close-set person ReID Camera- B (b) Open-set person ReID Figure... static non-overlapping cameras These images suffer from large variations in illuminations, view-point, poses, etc Figure 1.5 shows camera layout for PRID-2011 dataset, two cameras are installed... out due to strong occlusions, sudden disappearance/appearance or number of reliable images for each person in each camera view less than five After filtering, there are 385 persons in camera view

Định dạng
Số trang	143
Dung lượng	21,24 MB