Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 27 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
27
Dung lượng
809,01 KB
Nội dung
MINISTRY OF EDUCATION AND TRAINING UNIVERSITY OF SCIENCE AND TECHNOLOGY NGUYEN THUY BINH PERSON RE-IDENTIFICATION IN A SURVEILLANCE CAMERA NETWORK Major: Electronics Engineering Code: 9520203 ABSTRACT OF DOCTORAL DISSERTATION ELECTRONICS ENGINEERING Hanoi −2021 This study is completed at: Hanoi University of Science and Technology Supervisors: Assoc Prof Pham Ngoc Nam Assoc Prof Le Thi Lan Reviewer 1: Assoc Prof Tran Duc Tan Reviewer 2: Assoc Prof Le Nhat Thang Reviewer 3: Assoc Prof Ngo Quoc Tao This dissertation will be defended before approval commitee at Hanoi University of Science and Technology: Time 9h00, date 08 month 01 year 2021 This dissertation can be found at: Ta Quang Buu Library - Hanoi University of Science and Technology Vietnam National Library INTRODUCTION Motivation The development of image processing and pattern recognition allows to build an automatic video analysis system This system contains four crucial steps: person detection, tracking, person re-identification and recognition Person re-identification is defined as a problem which associates images/sequences of images of a pedestrian when he or she moves in a non-overlapping camera network [7] Although achieving some important milestones, person re-identification has just not been built in practice due to poor performance In person ReID, the dataset is divided into two sets: probe and gallery Depending on the number of images for person representation, person re-identification is classified into single-shot and multi-shot In single-shot approach each person has sole one image in both gallery and probe sets Inversely, each person has multiple images in multi-shot approach Noted that probe and gallery sets are captured on at least two non-overlapping field of camera views Problem formulation Given a query person Q i and N persons in gallery G j , where j = 1, N (l) Q i = n qi o , l = 1, n i (k) (1) G j = n gj o , k = 1, m j where n i and m j are the number of images of person Qi and person G j The identity of the given query person Qi is determined as follows [26]: j ∗ = arg d (Q i , G j ) , j (2) where d (Q i , G j ) is defined as the distance between Q i and G j In another definition, similarity between two persons is used instead of distance between them j ∗ = arg max Sim (Q i , G j ) , j (3) Challenges Person ReID has to face different challenges: (1) the strong variations in illuminations, view-points, poses, etc, (2) the large number of images for each person in a camera view and the number of persons, (3) the effect of human detection and tracking results Objective The thesis has three main objectives as follows: • Robust person representation for multi-shot person ReID.The first objective of this dissertation is to find a novel method to reduce computation cost as well as memory requirement but still retain accuracy in video-based person ReID approach • Improve the accuracy through fusion schemes Improving accuracy is the most important target of person ReID The second objective of this thesis is to improve the person ReID accuracy based on feature fusion schemes which wish to take advantage of all kinds of features • Building a fully automated surveillance system A practical person ReID system has three main stages including human detection, tracking, and person ReID However, most existing person ReID studies only focus on the person ReID stage Therefore, they bases on an assumption that human detection and tracking are perfect The final goal is to build a fully automated person ReID system and evaluate the effect of human detection and segmentation in person ReID Context constraints This study follows supervised person ReID approach and both single-shot and multi-shot approaches are considered • Images/videos are captured in daylight conditions • Focusing on short-term person ReID in which the appearance and clothes of each pedestrian not change during a certain period of time In addition, pedestrians not wear uniform • Person ReID is solved in the close-set approach Each person has to appear on at least two cameras Contributions Respect to three objectives are above presented, three main contributions are introduced in this dissertation • Contribution 1: In this thesis, an effective method for multi-shot person ReID through representative key frames selection and temporal feature pooling Instead of using all images, representative frames are extracted and used for person representation There are two types of representative frames that are frames within a walking cycle and four key frames of a walking cycle are considered in this work • Contribution 2: Previous studies have proved that each feature has its own discriminative power for person representation, to leverage these features, different late fusion schemes are proposed for both setting of multi-shot person ReID Beside equal weights, feature weights are adaptively determined for each query based on the query characteristics Dissertation outline In addition to the introduction and conclusion, the dissertation consists of four chapters: Chapter reviews and synthesizes the existing literature researches related to person ReID Chapter presents an effective framework for person ReID This framework helps overcome the difficulties when dealing with video-based person ReID Chapter introduces several fusion schemes for person ReID The proposed fusion schemes are evaluated in both settings of person ReID Chapter presents a fully-automated person ReID system including human detection, tracking and person ReID The affect of human detection and segmentation steps on overall person ReID performance is considered Conclusion and future work summarizes the contributions of this thesis, and introduces some future works for person ReID problem CHAPTER LITERATURE REVIEW 1.1 Datasets and evaluation metrics 1.1.1 Datasets Table 1.1 Benchmark datasets used in the thesis Datasets VIPeR CAVIAR4REID RAiD PRID-2011 iLIDS-VID Time 2007 2011 2014 2011 2016 #ID 632 72 43 934 300 #Cam #Images Label 1,264 hand 1,220 hand 6,920 hand 24,541 hand 42,495 hand Full frames + Resolution 128x48 vary 128x64 128x65 vary Single-shot X X Multiple-shot X X X X Setting 1 2 Five benchmark datasets including VIPeR, CAVIAR4REID, RAiD, PRID-2011 and iLIDSVID are used for performance evaluation of the proposed methods in this thesis Among the above five datasets, CAVIAR4REID and RAID is setup following the first setting, while three remaining datasets are according to the second setting, one haft is used for the training phase and the other used for the test phase Table 1.1 summaries the datasets used in the thesis 1.1.2 Evaluation metrics In order to evaluate the proposed methods for person ReID, Cumulative Matching Characteristic (CMC) curves [22] are used The value of the CMC curve at each rank is the rate of the true matching results and total number of queried persons 1.2 Feature extraction In order to describe a pedestrian image, biometric cues (eyes, iris, gait) or visual appear- ance is exploited Due to low resolution images/videos, information extracted on eyes or iris is not sufficient for person representation Consequently, majority of existing person ReID studies mainly focus on visual appearance of pedestrian [12] In general, features are classified into two main categorizes: hand-designed and deep-learned features 1.3 Metric learning The main target of metric learning is to find a suitable and effective distance for person matching It tries to minimize the distances between cross-view images corresponding the same person and maximize those for different persons Metric learning is also known as learning a sub-space on which projected feature vectors are satisfied above-mentioned conditions 1.4 Fusion schemes for person ReID To leverage the robustness of different features for person representation, some efforts have tried to combine several features, named feature fusion Feature fusion are divided into feature-level (early fusion) and score-level fusions (late fusion) In the early fusion approach, features are concatenated to form a large-dimension vector while methods belonging to the second approach combine the weights/scores, obtained from the matching steps, in a similarity function to get the final score 1.5 Representative frame selection In the multi-shot person ReID, using all frames will make great burden on computation process as well as storage memory Some existing studies select key frames rather than using all frames in a sequences for person representation [6, 16, 24] 1.6 Fully automated person ReID systems A fully automated person ReID system has three main phases: human detection, tracking, and person ReID In fact, there are a few works focusing on this in contrast to a wide range of other ones for the only phase of person ReID Towards to a complete person ReID system, this thesis considers other phases that are person detection and segmentation The effect of these two phases on person ReID performance is evaluated on both single shot and multi-shot scenarios CHAPTER MULTI-SHOT PERSON RE-ID THROUGH REPRESENTATIVE FRAMES SELECTION AND TEMPORAL FEATURE POOLING 2.1 Introduction This chapter introduces a novel framework by adding key frames selection and feature pooling in order to eliminate the redundancy information and speed up computation time as well as to reduce the computational complexity 2.2 Proposed method 2.2.1 Overall framework Figure 2.1 presents the proposed framework for multi-shot person ReID, with four main steps: representative frame selection, image-level feature extraction, temporal feature pooling, and person matching The first step aims to determining the representative frames used for person representation Three strategies of representative frame selection are introduced in this work: four key frames, a walking cycle, and all images Once frames used for person representation are determined, Gaussian of Gaussian descriptors (GOG) [18] are extracted from these frames As a set of GOG features are computed for each person, so they need to be represented as a unique feature before putting into person matching by temporal feature pooling step Finally, person matching based on Cross-view Quadratic Discriminative Analysis (XQDA)[14] technique is performed at the final step to compute the matched individuals for each given probe person Gallery sequences Temporal pooling layer Extract walking cycles Min-pooling Image-level features Average-pooling Person matching Extract key frames Max-pooling Representative frames selection A probe sequence ID person Extract walking cycles Min-pooling Image-level features Average-pooling Extract key frames Max-pooling Temporal pooling layer Figure 2.1 The proposed framework consists of four main steps: representative image selection, image-level feature extraction, temporal feature pooling and person matching The proposed framework is demonstrated in details through two following Algorithms The Algorithm 2.1 is performed in the training phase which can be offline process While the Algorithm 2.2 is conducted online in the test phase 2.2.2 Representative image detection Firstly, a representative walking cycle is chosen from a set of walking cycles of a person during the moving path based on Flow Engery Profile (FEP) [21] Secondly, four key frames are taken from this walking cycle Four representative frames are extracted from a walking cycle: two frames corresponding to local minimum and maximum points of FEP and two frames that are the middle frames between max- and min-frames Algorithm 2.1: Algorithm for training phase (Off-line process) Input: Image sequences on cross-view cameras: X = {X i } , i = 1, N tr ; Z = {Z j } , j = 1, N tr N tr is the number of persons used for training Output: Model parameters: W, M Step 1: Select representative frames for each person Sub-step 1.1: Extract walking cycles for each pedestrian for i ← 1, N tr (c) hX i (ci,1 ) = n xi (c i,2 ) , xi (c i,lc ) , x i oi = Cycle − extraction (X i ) for j ← 1, N tr (c) hZ j (c j,1 ) = nzj (c j,2 ) , zj (c j,lc ) , z j oi = Cycle − extraction (Z j ) Sub-step 1.2: Extract four key frames from a random walking cycle for i ← 1, N tr (k) (k ) (k ) (k ) (k ) (c) hX i = n xi , x i , x i , x i oi = Keyframe − extraction X i for j ← 1, N tr (k) hZ j (k ) = n zj (k ) , zj (k ) , zj (k ) , zj oi = Keyframe − extraction Z j(c) Step 2: Compute feature vectors at image-level for i ← 1, N tr for l i ← 1, len(i) f ili = Feature − extraction (xli i ) for j ← 1, N tr for l j ← 1, len(j ) l l f j j = Feature − extraction (zj j ) /* len(i) and len(j ) are length of the image sequences of X */ Step 3: Compute the final feature for person representation for i ← 1, N tr final Fi f il i , pool_ choice = Temporal_pooling for j ← 1, N tr F jfinal = Temporal_pooling i and Z j l nf j j o , pool_ choice Step 4: Compute the sub-space projection matrix and learned kernel metric based on XQDA algorithm F X = F ifinal F Z = F jfinal [W, M ] = XQDA (F X , F Z ) 2.2.3 Image-level feature extraction Among numerous features proposed for person ReID problem, GOG in [18] is evaluated as one of the most effective descriptors In this proposed framework, GOG descriptor are extracted on the four color spaces (RGB, Lab, HSV, and nRnG), and these features are combined Algorithm 2.2: Algorithm for test phase (On-line process) Input: A query person: Q i A gallery of persons G = {G j } , j = 1, N ts (Nts is the number of person in the gallery set.) Parameters of the trained model: W, M Output: A ranked list of gallery persons corresponding to a given query person Step 1: Select representative frames for each person Sub-step 1.1: Extract walking cycles for each pedestrian (c ) (c ) (c ) (c) hQ i = n qi i,1 , qi i,2 , q i i,lc oi = Cycle − extraction (Q i ) for j ← 1, N ts (c) hG j (g j,1 ) = n gj (g j,2 ) , gj (c j,lc ) , gj oi = Cycle − extraction (G j ) Sub-step 1.2: Extract four key frames from a random walking cycle frames (k) (k ) (k ) (k ) (k ) (c) hQ i = n qi , qi , qi , qi oi = Keyframe − extraction Q i for j ← 1, N ts (k) hG j (k ) = n gj (k ) , gj (k3 ) , gj (k ) , gj oi = Keyframe − extraction G (j c) Step 2: Compute feature vectors at image-level for li ← 1, len(i) f il i = Feature − extraction (qli i ) for j ← 1, N ts for l j ← 1, len(j ) l l f j j = Feature − extraction (gj j ) /* len(i) and len(j ) are length of the image sequences of Q one of three cases: all frames, cycle, and four key frames Step 3: Compute the final feature for person representation F ifinal = Temporal_pooling f ili , pool_ choice for j ← 1, N tr F j final = Temporal_pooling i and G j in */ l nf j j o , pool_ choice Step 4: Calculate distance between each person in gallery and the query person for j ← 1, N ts d(Q i , Gj ) = distance(F ifinal , F jfinal , W, M ) Step 5: Rank gallery persons in ascending order of distance between each of gallery person to the query person (1) (2) (N [Ri , R i , R i ts )] = ranked_list (d(Q i , G j )) 2.2.4 Temporal feature pooling This work proposes a technique of temporal feature pooling to (1) make the comparison/matching between two objects more simple, and (2) reduce computational time as well as occupied memory Three pooling strategies including min-, average-, and max-pooling across all video frames are applied in this work 2.2.5 Person matching XQDA technique is an extended version of the Bayesian face and Keep It Simple and Straightforward MEtric (KISSME) [11] algorithms, in which the multi-class classification problem can be changed into binary one of intra-personal and extra-personal variations The remarkable point of XQDA is to learn a distinguished subspace with cross-view data and a distance function simultaneously 2.3 Experimental results Extensive experiments are performed on two public benchmark datasets of PRID 2011 and iLIDS-VID to show the effectiveness of the proposed framework 2.3.1 Evaluation of representative frame extraction and temporal feature pooling schemes The experimental results are shown in three cases of four key frames, one walking cycle, and all frames Each case is further evaluated on (1) four different color spaces of RGB, Lab, HSV, nRnG, and fusion of them; and (2) three kinds of pooling data including minimum (MIN), maximum (MAX), and average pooling (AVG) These results in nine evaluated scenarios of feature pooling on training and testing sets in this work: AVG-AVG, AVG-MAX, AVG-MIN, MAX-AVG, MAX-MAX, MAX-MIN, MIN-AVG, MIN-MAX, MIN-MIN These sequences of symbols can be understood for that, for example, MAX-MIN: max pooling is applied on training data while pooling is applied on testing data The obtained results show that the best results are achieved by AVG-AVG in majority cases in all three cases For PRID-2011, in the fusion scheme of color spaces, the matching rates at rank-1 for three schemes are the highest ones They are 77.19%, 79.10%, and 90.56%, respectively The same conclusion is given out when dealing with iLIDS-VID dataset Figure shows the CMC curves for these best results in the three mentioned scenarios on a) PRID-2011 and b) iLIDS-VID datasets As can be seen for PRID-2011, the matching rates at rank-1 are increased by only 1.91% with a walking cycle and 12.47% with all frames, in comparison with four key frames case For iLIDS-VID, these improvements are 3.05% and 20.58%, respectively 2.3.2 Quantitative evaluation of the trade-off between the accuracy and computational time Table 2.1 shows the comparison of the three schemes in terms of person ReID accuracy, computational time, and memory requirement on PRID-2011 dataset The values reported in Table 2.1 are computed for a random split on PRID-2011 dataset The average number of images for each person on both camera A and B is about 100, and each walking cycle has approximately 13 images on average The experiments are conducted on Computer Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz, 16GB RAM Concerning the memory requirement, as an image of 128 × 64 pixels with a 24-bit color depth takes 24KB, the required memory for four key frames, one walking cycle and all frames schemes are 96KB, 312KB, and 2,400KB, respectively Using all frames in a walking cycle help to increase approximately 2% of matching rate at rank-1, while the computational time is nearly twice in comparison with four key frames This work bases on an assumption that a person stays in field of camera view in a certain time duration In the reality, this assumption does not always hold In the future, the proposed framework will be extended by taking these issues into account The main results in this chapter are published in the 7th publication CHAPTER PERSON RE-ID PERFORMANCE IMPROVEMENT BASED ON FUSION SCHEMES 3.1 Introduction This chapter will show that person ReID accuracy still can be improved through fusion schemes Both kind of features including hand-designed and deep-learned features are used for image representation For hand-designed features, GOG [18] and Kernel Descriptor (KDES) [1] are considered, while, for deep-learned features, two of the most strongest convolutional neural networks that are GoogLeNet and Residual Neural Network (ResNet) are employed In order to evaluate the role of each feature, weights assigned for considered features might be equal or adaptive to the query person Extensive experiments are performed on both of the two settings of person ReID 3.2 Fusion schemes for the first setting of person ReID Multi-shot person ReID can be divided further into two sub categories: image-to-images problem, also called as single-versus-multi (SvsM) and images-to-images one, or multi-versusmulti (MvsM) Image-to-images approach is considered as a special case of multi-shot person ReID, in which there is only one image in the probe set while the gallery has multiple images for the same pedestrian 3.2.1 Image-to-images person ReID 3.2.1.1 The proposed framework The proposed method for the image-to-images person re-identification is shown in Fig.3.1 In this method, the image-to-images person ReID is formulated as a classification-based information retrieval problem where the model of person appearance is learned from the gallery images and the identity of interested person is determined by the probability that his/her probe image belonging to the model 3.2.1.2 Feature fusion strategies Early fusion scheme: The concatenated feature vectors corresponding to the gallery images are utilized for the training phase to generate a model which is employed in the testing phase for matching a query image to a trained class Late fusion scheme: The late fusion is known as the feature fusion strategy at score level in which the ranked list of retrieved persons corresponding to a given query image for each kind of features is employed These scores are combined in different manners based on 11 Multiple images (Gallery) Feature extraction ID1 ID2 Extracting GOG feature Extracting KDES feature ID3 Early fusion Training SVM Extracting CNN feature Training phase Model A query image (probe) Feature extraction Extracting GOG feature Extracting KDES feature Early fusion Product-rulebased late fusion SVM Prediction Query-adaptive late fusion Extracting CNN feature Matching and ranking ID person Testing phase Figure 3.1 Image-to-images person ReID scheme the two common rules that are sum-rule and product-rule Beside equal weights, inspired by the impressive results of Zheng et al [25] query-adaptive weights will be interested in this study Let Sim(q, G j ) prod−equal−weight , Sim(q, G j ) prod−adaptive−weight , and Sim(q, G j ) (m) be the similarity score between an image q and person Gj computed by applying product rule with equal weights, with query-adaptive weights, and in case of using the mth feature, respectively - Product-rule with equal weights: M Sim(q, G j ) prod−equal−weight = Y Sim (m) (q, G j ) (3.1) m=1 - Product-rule with query-adaptive weights: M Sim(q, G j )prod−adaptive−weight = Y Sim (m) (q, G j ) (m) ωq , (3.2) m=1 where ω(qm) is the weight of feature m th for the given query image q and Sim (m) (q, G j ) is corresponding to score or probability of the query image q belong to appearance model of person G j 3.2.2 Images-to-images person ReID Figure 3.2 shows the proposed framework for images-to-images person ReID In this framework, the temporal linking between images of the same person is not required, and these 12 images are treated independently Images-to-images problem is formulated as a fusion function of multiple image-to-images person ReIDs The similarity of two persons represented by two sets of images is determined as follows: mi Sim(Q i , Gj ) = Y Sim(q li , G j ), (3.3) l=1 where, Sim(q li , G j ) is determined as in the above section for image-to-images person ReID approach Query images (probe) Image_1 Image-images person re-identification Image_2 Image-images person re-identification Image_n Image-images person re-identification Ranked list_1 Ranked list_2 Late fusion based on Product rule Matching and ranking ID person Ranked list_n Figure 3.2 Proposed framework for images-to-images person ReID without temporal linking requirement 3.2.3 Obtained results on the first setting For the first setting, two benchmark datasets: CAVIAR4REID and RAiD are used for ReID evaluation in the two scenarios: image-to-images and images-to-images Images for each of the training and test phases are chosen randomly and this process is repeated 10 times to ensure objectivity of obtained results Two scenarios for CAVIAR4REID dataset which are balanced (case A) and unbalanced (case B) are considered In case A, each pedestrian has images in both the probe and the gallery sets In case B, images of each person are randomly chosen for the probe set and the remaining images are exploited for the gallery set 3.2.3.1 Image-to-images person ReID The first experiment is performed to evaluate the effectiveness of each considered feature As seen in Fig 3.3, it is easy to realize that despite being a hand-designed feature, GOG feature obtains competitive results with CNN feature, a kind of learning features The second experiment is conducted to evaluate the efficiency of the three fusion schemes when utilizing two or three kinds of features, as shown in Fig 3.4 By combining the three chosen features, the matching rates at rank-1 when applying the three fusion schemes are improved by from 2% to 5% compared to those in the case of using only KDES and CNN features 3.2.3.2 Images-to-images person ReID By applying the product rule strategy, image-to-images person ReID is mapped into images-to-images one Figure 3.5 shows the performance of images-to-images person ReID in 13 (% ) 6 7 % 5 % 6 % G O G K D E C N N te s 8 % 1 % 8 % 0 M a tc h in g r a te s (% 0 M a tc h in g te s (% M a tc h in g 0 G O G K D E C N N G O G K D E C N N 5 8 % % % 5 R a n k 5 (a) R a n k R a n k (b) (c) te 6 7 7 7 0 3 6 5 % % % % % % % S E P Q E P Q D A L F a rly -fu s io n (K ro d u c t-ru le (K u e r y - a d a p t iv e a rly -fu s io n (G ro d u c t-ru le (G u e r y - a d a p t iv e D D ( O O ( E E K G G G C M C 0 - C A V IA R R E ID S C M C (% ) S 0 - R A iD S v s M te - C A V IA R R E ID 8 8 9 8 9 1 7 7 3 % % % % % % % S E P Q E P Q D A L F a rly -fu s io n (K ro d u c t-r u le (K u e r y -a d a p tiv e a rly -fu s io n (G ro d u c t-r u le (G u e r y -a d a p tiv e D D ( O O ( M a tc h in g C M C (% ) 0 M a tc h in g M a tc h in g te (% ) Figure 3.3 Evaluation the performance of three chosen features (GOG, KDES, CNN) over 10 trials on (a) CAVIAR4REID-case A (b) CAVIAR4REID-case B (c) RAiD datasets in image-to-images case 8 8 8 7 8 2 9 % % % % % % % S E P Q E P L D A L F a rly -fu s io n (K D ro d u c t-ru le (K D u e ry -a d a p tiv e ( a rly -fu s io n (G O ro d u c t-ru le (G O a te -fu s io n ( G O G 5 R a n k 5 (a) 1 R a n k R a n k (b) (c) M a t c h in g te (% ) Figure 3.4 Comparison the performance of the three fusion schemes when using two or three features over 10 trials on (a) CAVIAR4REID-case A (b) CAVIAR4REID-case B (c) RAiD datasets in image-to-images case 0 C M C - C A V IA R R E ID M 9 8 % % 9 % 8 % M v s M M v s M K D E S + S V M C N N + S V M 4 % E a r ly - f u s i o n 9 % P r o d u c t - r u le % Q u e ry -a d a p t S D A L F M v s M G O G + S V M M v s M M v s M M v s M R a n k Figure 3.5 CMC curves in case A of images-to-images person ReID on the CAVIAR4REID dataset 14 case A of CAVIAR4REID The matching rates at rank-1 are 91.53%, 91.39% and 88.06% for GOG, KDES, and CNN features, respectively And, the three fusion schemes are still effective in the images-to-images case as shown by the matching rates at rank-1 of approximately 94.00% The performance of the three fusion schemes are outstanding when comparing to those of the SDALF methods on all experiment scenarios in this study Table 3.1 show the matching rates in case B of the CAVIAR4REID and the RAiD datasets, respectively The two tables show impressive results with the matching rates of up to 100% at rank-1 Table 3.2 summarizes the matching rate at rank-1 of the proposed framework and the state-of-the-art methods for both image-to-images and images-to-images scenarios in case A of CAVIAR4REID Table 3.1 Matching rates (%) in case of images-to-images on a) CAVIAR4REID (case B) and b) RAiD Methods SDALF[4] MvsM GOG+SV M MvsM KDES +SV M MvsM CNN +SV M MvsM Early −f usion MvsM P roduct−rule MvsM Query −adaptive Rank=1 Rank=5 81.67 96.11 98.89 100.00 98.75 99.86 98.47 99.72 99.72 100.00 99.58 99.86 99.72 100.00 Rank=10 Rank=20 98.06 98.89 100.00 100.00 100.00 100.00 99.86 99.86 100.00 100.00 99.86 99.86 100.00 100.00 Methods SDALF[4] MvsM GOG+SV M MvsM KDES +SV M MvsM CNN +SV M MvsM Early −f usion MvsM P roduct−rule MvsM Query −Adaptive Rank=1 86.05 100.00 99.07 99.30 99.77 98.37 99.77 Rank=5 93.02 100.00 99.07 99.30 99.77 98.37 99.77 Rank=10 95.35 100.00 99.07 99.30 99.77 98.60 99.77 Rank=20 100.00 100.00 99.30 99.30 99.77 98.60 100.00 Table 3.2 Comparison of images-to-images and image-to-images schemes at rank-1 (*) means the obtained results by applying the proposed strategies over 10 random trials in case A of CAVIAR4REID Image-to-images (N=5) SDALF 37.69 KDES 65.50 LSTM WSC 45.60 ISR DDLM 80.10 The proposed method 73.61 Methods 3.3 Images-to-images (N=5) 67.50 91.39(*) 86.39(*) 61.10 90.10 92.30 94.44 Fusion schemes for the second setting of person ReID 3.3.1 The proposed method Fig 3.6 shows the multi-shot person ReID by adding a fusion module (marked in red) to the framework proposed in the Chapter As seen in Figure 3.6, the proposed method has five main steps including key frames selection, image-level and sequence-level feature extraction, metric learning, query-adaptive late fusion, matching and ranking Additionally, two fusion schemes based on multiplication and addition operators are examined and the role of each of features are evaluated through their weights 3.3.2 Obtained results on the second setting The experiments results for the proposed fusion schemes are shown in Figure 3.7 and 3.8 on the datasets of PRID-2011 and iLIDS-VID It is presented in these figures that not only independent features of GOG and ResNet are effective for person ReID problem, but also the combination of these can achieve a higher performance For PRID-2011, independent feature of 15 Image-level features Gallery sequences Extract walking cycles Extracting GOG features Extracting ResNet features Extract key frames Sequencelevel features A probe sequence Metric learning Extract walking cycles ID person Matching and ranking Query-adaptive late fusion ID person Extracting GOG features Extracting ResNet features Extract key frames Sequencelevel features Image-level features Figure 3.6 The proposed method for video-based person ReID by combining the fusion scheme with metric learning technique GOG seems to be stronger than ResNet When applying fusion schemes the matching rates at rank-1 are increased by 5.65%, 5.47%, and 0.9% corresponding to the cases of four key frames, walking cycle and all frames, respectively On the contrary, for iLIDS-VID, ResNet feature provides a higher performance compared to GOG descriptor, and matching rates at rank-1 are improved by 13.1%, 13.68%, and 14.13% It can be explained that a deeper structure as ResNet can learn the complex background and determine useful information for person 8 7 8 2 4 % % % % G R P S O G e s N e t ro d u c t- r u le a u m -r u le a d a p (% ) r a te te 0 M a tc h in g (% ) 0 M a tc h in g M a tc h in g te (% ) representation The above-mentioned experimental results are compared with several existing 7 7 8 6 % % % % G R P S O G e s N e t ro d u c t- r u le a d u m - ru le a d a p 1 R a n k (a) Using four key frames for each person 9 8 9 0 5 6 % % % % G R P S O G e s N e t r o d u c t- r u le a u m - r u le a d a p 6 0 R a n k 5 1 R a n k (b) Using frames within a walking (c) Using all frames for each person cycle Figure 3.7 Matching rates with different fusion schemes on PRID-2011 dataset with a) four key frames b) frames within a walking cycle c) all frames works on person ReID in Table 3.3, with two best results are in bold When using four key frames or walking cycle, the matching rates at rank-1 increased up to 5.7% on PRID-2011 and 21.1% on iLIDS-VID, respectively Moreover, a more detailed analysis is provided to compare the effectiveness of the proposed framework with those of other studies also belonging to the feature fusion approach In [13], both kinds of hand-designed and deep-learned features, which are Local Maximal Occurrence (LOMO) descriptor and PCA-based Convolutional Network 16 4 9 % % % % G R P S O G e s N e t r o d u c t- r u le a d u m -r u le a d a p te 5 6 4 0 % % % % G R P S O G e s N e t r o d u c t- r u le a d u m - r u le a d a p t 1 5 R a n k (a) Using four key frames for each person 7 8 3 0 M a tc h in g te M a tc h in g te 0 9 M a tc h in g (% ) (% ) (% ) 0 5 1 % % % % G R P S O G e s N e t ro d u c t- r u le a d u m - r u le a d a p 1 R a n k R a n k (b) Using frames within a walking (c) Using all frames for each person cycle Figure 3.8 Matching rates with different fusion schemes on iLID-VID dataset a) four key frames b) frames within a walking cycle c) all frames Table 3.3 Comparison between the proposed method and existing works on PRID 2011 and iLIDS-VID datasets.Two best results are in bold Methods Matching rate (%) TAPR, ICIP 2016 AMOC+EpicFlow, TCSVT 2018 Two-stream MR, TII 2018 RNN, CVPR 2016 HOG3D + DVR, TPAMI 2016 STFV3D + KISSME, ICCV 2015 CAR, TCSVT 2017 DFCP, CVPR 2017 CRF, CVPR 2017 CFFM, SPIC 2020 Four key frames GOG+XQDA, Chapter A walking cycle All frames Four key frames Proposed method A walking cycle (Product-rule-based) All frames Four key frames Proposed method A walking cycle (Sum-rule-based) All frames Rank=1 68.6 83.7 78.7 70.0 40.0 64.1 83.3 51.6 77.0 93.3 77.2 79.1 90.6 82.8 84.6 91.5 82.0 82.7 89.9 PRID-2011 Rank=5 Rank=20 94.4 98.9 98.3 100 95.2 99.2 90.0 97.0 71.7 92.2 87.3 92.0 93.3 96.7 83.1 95.5 93.0 98.0 95.5 100.0 94.7 99.4 95.0 99.4 98.4 100.0 96.2 99.7 96.8 99.7 99.0 100.0 96.0 99.7 96.2 99.7 98.8 100.0 Rank=1 55.0 68.7 59.4 58.0 39.5 44.3 60.2 34.5 61.0 82.0 41.1 44.1 70.1 57.5 60.6 80.7 62.2 64.4 81.8 iLIDS-VID Rank=5 Rank=20 87.5 97.2 94.3 99.3 89.8 99.1 84.0 96.0 61.1 81.0 71.7 91.7 85.1 94.2 63.3 84.4 85.0 97.0 95.3 100.0 69.5 90.4 71.7 90.8 92.7 99.1 83.1 95.6 84.8 96.2 96.7 99.6 85.4 96.3 86.5 96.5 96.1 99.6 (PCN), are exploited for person representation One remarkable point in this study is that LOMO features are weighted according to their similarity to the LOMO feature of Maximally Stable Video Frame (MSVF) However, in this study, by concatenating LOMO and PCN features, the important of each feature on each image/sequence is not mentioned In [2] , Chen et al suggest to jointly learn and incorporate both spatial and temporal information for person representation based on CNN and RNN Although using multiple neural networks with multiloss layers, the matching rate at rank-1 are 77.0% and 61.0% on PRID-2011 and iLIDS-VID, respectively, still much lower than the proposed method even using only four key frames (82.0% and 62.6%) Inspired by the obtained results in [2], the authors extend the framework by using multiple attention mechanisms to learn attention-based spatial-temporal feature fusion to better represent image sequences, called Comprehensive Feature Fusion Mechanism (CFFM) 17 The best matching rate at rank-1 in this method are 93.3% and 82.0%, higher by 1.8% and 0.2% compared to those of the proposed framework on PRID-2011 and iLIDS-VID, respectively, however CFFM method has to incorporate both CNN and RNN combined with multiple attention networks 3.4 Conclusions This chapter proposes several feature fusion schemes person ReID in both settings For the first setting, person ReID is formulated as a classification-based information retrieval problem in which time constraint between frames of the same person is not required For the second setting, two late fusion schemes of query-adaptive product-rule-based and sum-rule-based are introduced Main results in this chapter are published in the 5th and 6th publications CHAPTER QUANTITATIVE EVALUATION OF AN END-TO-END PERSON REID PIPELINE 4.1 Introduction Most of the reported person Re-ID methods deal with the human regions of interest (ROIs) which are extracted manually with well-aligned bounding boxes Meanwhile, there are several challenges in a practical person ReID in which the bounding boxes are detected and tracked automatically, such as these bounding boxes might contain the entire or parts of the human body, and the occlusions can be occurred with a high frequency , the fragment of tracklets and identity switches happen due to the sudden appearance/disappearance of the pedestrian in a field of camera view in the tracking step Certainly, these factors make reduce the accuracy of person ReID Consequently, the main purpose of this chapter is to perform a quantitative evaluation of an end-to-end person ReID pipeline Within the limitation of this thesis, a fully person ReID pipeline is examined with three steps including human detection, segmentation and person ReID 4.2 A fully automated person ReID system A fully automatic person ReID system is introduced in Figure 4.1 It contains four main steps: person detection, segmentation, tracking and person ReID While person detection step aims at determining the person region (bounding box) in images captured from surveillance cameras, person segmentation is used for removing background from person bounding box Then, person bounding boxes within a camera field of view (FoV) are connected through person tracking step Finally, person ReID associates images of the same person when he/she moves from a camera FoV to the others ones It is worth to noting that in some surveillance systems, the person segmentation and person detection are coupled 18 Human detection Segmentation (Automatic/manual) Tracking Person Reidentification Probe ID person Gallery Figure 4.1 The proposed framework for a fully automatic person ReID system 4.2.1 Pedestrian detection Concerning person detection, three state-of-the-art person detection techniques that are Aggregate Channel Features (ACF) [3], You Only Look Once (YOLO) [20], and Mask RCNN [10] are considered For person segmentation, Pedparsing [17] method is used thanks to its effectiveness for cropped images from the result of the detectors that mentioned above Another way to perform simultaneously person detection and segmentation is use Mask RCNN [10] 4.2.2 Pedestrian tracking DeepSORT uses both information on motion and appearance for computing a measurement metric in tracking This technique is investigated in details in the study of Nguyen et al [19] In this study, the authors claimed that Mask R-CNN is better than YOLO in human detection task However, Mask R-CNN requires a larger resource for implementation 4.2.3 Person ReID GOG descriptor and XQDA technique are used for feature extraction and metric learning, respectively In order to bring person ReID to practical applications, GOG descriptor is reimplemented in C++ and the optimal parameters of this descriptor are selected through intensive experiments The experimental results show that the proposed approach allows to extract GOG times faster than the available source code and achieve remarkably high accuracy for person ReID 4.3 Evaluation performance of a fully automated person ReID system In this section, the influence of the two previous steps (human detection and segmentation) on person ReID step is estimated For this, some extensive experiments are conducted on both single-shot and multi-shot cases on PRID-2011 4.3.1 The effect of human detection and segmentation on person ReID in singleshot scenario Int this experiment, a bounding box for each individual is chosen randomly In this experiment, for detection human stage, ACF, YOLO, and Mask R-CNN are utilized For segmentation, Mask R-CNN and Pedparsing are applied For feature extraction, GOG descriptor is also employed for person representation Performance of the fully automatic system are evaluated on some cases indicated in Fig.4.2 with two cases ( without/with segmentation) Observing this Figure, several conclusions are provided as follow First, when comparing corresponding 19 (% ) te M a tc h in g (% ) r a te M a tc h in g 0 5 % % % M a n u a l- d A u to -d e te A u to -d e te 0 4 4 1 7 4 % % % % M A A A a n u u to u to u to - R a n k a d d d l- d e t+ e t( A C e t( Y O e t+ A u A u to -s F )+ A u L O )+ A to - s e g R a n k (a) (b) Figure 4.2 CMC curves of three evaluated scenarios on PRID 2011 dataset in single-shot approach (a) Without segmentation and (b) with segmentation curves on left and right side Figures, segmentation stage provides worse results compared to those in case of applying only detection This result means that background removal with binary masks, which may a reason for information loss and smoothness of an image, is not an optimal choice to improve the performance of person ReID task Moreover, in comparison between considered detectors, obtained results indicate that ACF detector achieves a better performance than YOLO one in both cases (without/with segmentation) In addition, the effectiveness of ACF detector can achieve the performance of manual detection One remarkable point is that Mask-RCNN provides an impressive results that is competitive over manual detection This bring a hopefulness for a fully automatic system to be practical 4.3.2 The effect of human detection and segmentation on person ReID in multishot scenario 100 Matching rate (%) 95 90 85 80 90.56% Manual_Detection with the proposed method 91.01% Auto_Detection with the proposed method 88.76% Auto_Detection+Segmentation with the proposed method 75 10 15 20 Rank Figure 4.3 CMC curves of three evaluated scenarios on PRID 2011 dataset when applying the proposed method in Chapter Figure 4.3 indicates the matching rates on PRID 2011 dataset when employing GOG descriptors with XQDA technique in different cases: (1) manual detection, (2) one of the 20 considered automatic detection techniques and (3) automatic segmentation using Pedparsing after the automatic detection stage It is interesting to see that the person re-identification results in case of using automatic detection is slightly better than those of manual detection The reason is that automatic detection results very well-aligned bounding boxes while manual detection defines a larger bounding box for pedestrians Also, one more remarkable point is that when quality of person detection is relative good the segmentation step is not necessary Table 4.1 shows the comparison of the proposed method with state of the art methods The matching rates in case of incorporating one of considered human detection method and person ReID framework proposed in Chapter are 91.0%, 98.4%, 99.3, and 99.9% at important ranks (1, 5, 10, 20) These values are much higher than all of aforementioned deep learning-based approaches Moreover, according to the quantitative evaluation on the trade-off between person ReID accuracy and computation time shown in Chapter 2, total time for person reidentification is 11.5s, 20.5s, or 100.9s in case of using four key frames, frames within a cycle or all frames, respectively If a practical person ReID system is not required to return the correct matching in the first ranks, four key frames will be used in order to reduce computation time and memory requirement but ensure person ReID accuracy when considering 20 first ranks Besides, current human detection and tracking algorithms not only response real-time requirement but also ensure a high accuracy These bring the feasibility of building a fully automatic person ReID system in practical 4.4 Conclusions and Future work In this chapter, a fully automated person ReID system which has three main steps includ- ing person detection, segmentation and person ReID is considered The obtained results shows the influence of human detection and segmentation on ReID performance However, the effect is much reduced thanks to the robustness of the descriptor and metric learning Additionally, if automatic person detection step provide a relatively good performance, segmentation is not required Main results in this chapter are included in three publications: 2nd , th , and th ones Table 4.1 Comparison of the proposed method with state of the art methods for PRID 2011 (the two best results are in bold) Methods HOG3D+DVR TAPR LBP-Color+LSTM DFCP RNN The proposed method with manual detection with automatic detection with automatic detection and segmentation 21 R=1 40.0 68.6 53.6 51.6 70.0 90.6 91.0 R=5 71.1 94.4 82.9 83.1 90.0 98.4 98.4 R=10 84.5 97.4 92.8 91.0 95.0 99.2 99.3 R=20 92.2 98.9 97.9 95.5 97.0 100 99.9 88.8 98.36 99.0 99.6 CONCLUSION AND FUTURE WORKS Conclusion The first contribution is an effective method for video-based person re-identification through representative frames selections and feature pooling First, walking cycles are extracted and then, four key frames are selected from each cycle The matching rates at rank-1 on PRID 2011 are 77.19%, 79.10%, and 90.56% for four key frames, one walking cycle and all frames schemes, while those on iLIDS-VID dataset are 41.09%, 44.14%, and 70.13%, respectively The obtained results show that the proposed method outperforms different state-ofthe-art methods including deep learning ones Additionally, the trade off between the person ReID accuracy and computation time is fully investigated The second contribution in this thesis is the fusion schemes proposed for both settings of person ReID In the first setting, person ReID is formulated as a classification-based information retrieval problem where the model of person appearance is learned from the gallery images and the identity of interested person is determined by the probability that his/her probe image belonging to the model Three fusion schemes including early fusion, productrule and query-adaptive late fusion schemes are proposed Several experiments are conducted on CAVIAR4REID and RAiD datasets, the obtained results prove the effectiveness of fusion schemes through the matching rates at rank-1 are 94.44%, 99.72% and 100% in case A, case B of CAVIAR4REID and RAiD dataset, respectively In the second setting, the proposed method of the first contribution is extended by adding fusion schemes In this case, fusion schemes are built on product-rule and sum-rule operators The experiments are performed on two benchmark datasets (PRID-2011 and iLIDS-VID) with remarkable improvement The matching rates at rank-1 are increased up to 5.65% and 14.13% on PRID-2011 and iLIDS-VID, respectively compared to those in case of using only the single feature Future works Short term: Experiments will be conducted on some large-scale datasets (MARS, Market-1501) Studying and developing a tracking algorithm in order to improve person ReID performance Full evaluation of person detection and tracking will be accessed Long term: • Combining motion and appearance-based features for person re-identification • Spatial and temporal attention and saliency for person re-identification • Unsupervised person re-identification • Open-world person re-identification 22 Bibliography [1] Liefeng Bo, Xiaofeng Ren, and Dieter Fox Kernel descriptors for visual recognition In Advances in neural information processing systems (2010), pages 244–252, 2010 [2] Lin Chen, Hua Yang, Ji Zhu, Qin Zhou, Shuang Wu, and Zhiyong Gao Deep spatialtemporal fusion network for video-based person re-identification In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 63–70, 2017 [3] Piotr Dollár, Ron Appel, Serge Belongie, and Pietro Perona Fast feature pyramids for object detection IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8):1532–1545, 2014 [4] Michela Farenzena, Loris Bazzani, Alessandro Perina, Vittorio Murino, and Marco Cristani Person re-identification by symmetry-driven accumulation of local features In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2360–2367 IEEE, 2010 [5] Mayssa Frikha, Omayma Chebbi, Emna Fendri, and Mohamed Hammami Key frame selection for multi-shot person re-identification In International Workshop on Representations, Analysis and Recognition of Shape and Motion FroM Imaging Data (2016), pages 97–110 Springer, 2016 [6] Changxin Gao, Jin Wang, Leyuan Liu, Jin-Gang Yu, and Nong Sang Temporally aligned pooling representation for video-based person re-identification In Image Processing (ICIP), 2016 IEEE International Conference on, pages 4284–4288 IEEE, 2016 [7] Shaogang Gong, Marco Cristani, Chen Change Loy, and Timothy M Hospedales The re-identification challenge In Person re-identification, pages 1–20 Springer, 2014 [8] Yousra Hadj Hassen, Walid Ayedi, Tarek Ouni, and Mohamed Jallouli Multi-shot person re-identification approach based key frame selection In Eighth International Conference on Machine Vision (ICMV 2015), volume 9875, page 98751H International Society for Optics and Photonics, 2015 [9] Yousra Hadj Hassen, Kais Loukil, Tarek Ouni, and Mohamed Jallouli Images selection and best descriptor combination for multi-shot person re-identification In International Conference on Intelligent Interactive Multimedia Systems and Services (2017), pages 11– 20 Springer, 2017 [10] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick Mask r-cnn In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017 [11] Martin Koestinger, Martin Hirzer, Paul Wohlhart, Peter M Roth, and Horst Bischof Large scale metric learning from equivalence constraints In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2288–2295 IEEE, 2012 [12] Qingming Leng, Mang Ye, and Qi Tian A survey of open-world person re-identification IEEE Transactions on Circuits and Systems for Video Technology, 2019 [13] Youjiao Li, Li Zhuo, Jiafeng Li, Jing Zhang, Xi Liang, and Qi Tian Video-based person re-identification by deep feature guided pooling In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2017), pages 39–46, 2017 23 [14] Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z Li Person re-identification by local maximal occurrence representation and metric learning In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pages 2197–2206, 2015 [15] Hao Liu, Zequn Jie, Karlekar Jayashree, Meibin Qi, Jianguo Jiang, Shuicheng Yan, and Jiashi Feng Video-based person re-identification with accumulative motion context IEEE transactions on circuits and systems for video technology, 28(10):2788–2802, 2017 [16] Kan Liu, Bingpeng Ma, Wei Zhang, and Rui Huang A spatio-temporal appearance representation for video-based pedestrian re-identification In Proceedings of the IEEE International Conference on Computer Vision (2015), pages 3810–3818, 2015 [17] Ping Luo, Xiaogang Wang, and Xiaoou Tang Pedestrian parsing via deep decompositional network In Proceedings of the IEEE international conference on computer vision, pages 2648–2655, 2013 [18] Tetsu Matsukawa, Takahiro Okabe, Einoshin Suzuki, and Yoichi Sato Hierarchical gaussian descriptor for person re-identification In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1363–1372, 2016 [19] Hong-Quan Nguyen, Thuy-Binh Nguyen, Tuan-Anh Le, Thi-Lan Le, Thanh-Hai Vu, and Alexis Noe Comparative evaluation of human detection and tracking approaches for online tracking applications In 2019 International Conference on Advanced Technologies for Communications (ATC), pages 348–353 IEEE, 2019 [20] Joseph Redmon and Ali Farhadi Yolov3: An incremental improvement arXiv preprint arXiv:1804.02767, 2018 [21] Taiqing Wang, Shaogang Gong, Xiatian Zhu, and Shengjin Wang Person re-identification by discriminative selection in video ranking IEEE Trans Pattern Anal Mach Intell., 38(12):2501–2514, 2016 [22] Xiaogang Wang, Gianfranco Doretto, Thomas Sebastian, Jens Rittscher, and Peter Tu Shape and appearance context modeling In 2007 ieee 11th international conference on computer vision, pages 1–8 Ieee, 2007 [23] Zhiqiang Zeng, Zhihui Li, De Cheng, Huaxiang Zhang, Kun Zhan, and Yi Yang Twostream multirate recurrent neural network for video-based pedestrian reidentification IEEE Transactions on Industrial Informatics, 14(7):3179–3186, 2017 [24] Wei Zhang, Shengnan Hu, and Kan Liu Learning compact appearance representation for video-based person re-identification arXiv preprint arXiv:1702.06294, 2017 [25] Liang Zheng, Shengjin Wang, Lu Tian, Fei He, Ziqiong Liu, and Qi Tian Query-adaptive late fusion for image search and person re-identification In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (2015), pages 1741–1750, 2015 [26] Liang Zheng, Yi Yang, and Alexander G Hauptmann Person re-identification: Past, present and future arXiv preprint arXiv:1610.02984, 2016 24 PUBLICATIONS [1] Thuy-Binh Nguyen, Thi-Lan Le, Dinh-Duc Nguyen, and Dinh-Tan Pham (2018), A Reliable Image-to-Video Person Re-identification Based on Feature Fusion, 10th Asian conference on intelligent information and database systems (ACIIDS), Springer, VietNam, ISBN: 978-3-319-75416-1, pp.433-442, 2018 [2] Thuy-Binh Nguyen, Duc-Long Tran, Thi-Lan Le, Thi Thanh Thuy Pham, and HuongGiang Doan (2018), An effective implementation of Gaussian of Gaussian descriptor for person re-identification, 5th NAFOSTED Conference on Information and Computer Science (NICS), IEEE, Vietnam, ISBN 978-1-4673-8013-3, pp.388-393, 2018 [3] Thuy-Binh Nguyen, Hong-Quan Nguyen, Thi-Lan Le, Thi Thanh Thuy Pham, and Ngoc-Nam Pham (2019), A Quantitative Analysis of the Effect of Human Detection and Segmentation Quality in Person Re-identification Performance, 2nd International Conference on Multimedia Analysis and Pattern Recognition (MAPR) (pp 1-6), IEEE, 2019 [4] Thuy-Binh Nguyen, Trong-Nghia Nguyen, Hong-Quan Nguyen, and Thi-Lan Le (2020), How feature fusion can help to improve multi-shot person re-identification performance?, 3rd International Conference on Multimedia Analysis and Pattern Recognition (MAPR), 2020 [5] Thuy-Binh Nguyen, Thi-Lan Le, and Ngoc-Nam Pham(2018), Fusion schemes for image-to-video person re-identification, Journal of Information and Telecommunication, ISSN: 2475-1839 (Print) 2475-1847 (Online), DOI: 10.1080/24751839.2018.1531233, pp.7494 [6] Thuy-Binh Nguyen, Thi-Lan Le, and Ngoc-Nam Pham (2018), Images-to-images person ReID without temporal linking, International Journal of Computational Vision and Robotics, Print ISSN: 1752-9131 Online ISSN: 1752-914X,pp.152-171 (SCOPUS) [7] Thuy-Binh Nguyen, Thi-Lan Le, Louis Devillaine, Thi Thanh Thuy Pham, and Nam Ngoc Pham (2019), Effective Multi-shot Person Re-identification through Representative Frames Selection and Temporal Feature Pooling, Multimedia Tools and Applications, ISSN: 1380-7501 (Print) 1573-7721 (Online), DOI: 10.1007/s11042-019-08183-y, (ISI) ... pooling on training and testing sets in this work: AVG-AVG, AVG-MAX, AVG-MIN, MAX-AVG, MAX-MAX, MAX-MIN, MIN-AVG, MIN-MAX, MIN-MIN These sequences of symbols can be understood for that, for example,... scenarios for CAVIAR4REID dataset which are balanced (case A) and unbalanced (case B) are considered In case A, each pedestrian has images in both the probe and the gallery sets In case B, images... example, MAX-MIN: max pooling is applied on training data while pooling is applied on testing data The obtained results show that the best results are achieved by AVG-AVG in majority cases in all three