Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 55 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
55
Dung lượng
1,54 MB
Nội dung
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY MASTER THESIS Moving object tracking using Fully Convolutional neural network Quang Minh Bui Minh.BQCB190097@sis.hust.edu.vn Control Engineering and Automation Supervisor: PhD Tran Thi Thao School: School of Electrical Engineering HANOI, 4/2021 Acknowlegement This thesis is not possible without the inspiration and support of many people I would like to extend my appreciation to everyone that has been a part of the journey First and foremost, I would like to express my sincere gratitude to my research supervisors, Dr Pham Van Truong and Dr Tran Thi Thao at Hanoi University of Science and Technology, for their consistent guidance, encouragement and supportive advises during the time I my research Especially, I would like to thank Dr Tran Thi Thao for the initial idea that I later developed and presented in this thesis I am grateful to Hanoi University of Science and Technology for giving me the scholarship to encourage me in such a long journey I would like to thank my colleagues at Viettel Corporation for their support in HPC related issues that boosted the progress of my work Abstract Visual Object Tracking is one of the most fundamental and critical task in computer vision due to its wide range of usage in both civilian and military applications such as video surveillance, traffic monitoring, autonomous vehicles, The central problem of Visual Object Tracking is to precisely determine the position of an arbitrary object in a video when its initial state is given in the first frame of the sequence Although many researchers have been putting a lot of effort into tackling the problem, object tracking remains challenging due to the factors of many scenario such as occlusion, illumination change, This thesis proposes a novel method to solve those problem by adopting apprearance based object tracking approach empowering by deep features from Siamese networks Siamese networks put forward a simple framework for tracking yet achieving remarkable performance in terms of the balance between accuracy and speed However, its performance degrades when suffering from fast target appearance changes due to its poor discrimination from other similar objects and clutter background which are considered as distractors Therefore, in this work, we propose a mechanism to better represent object of interest and enhance the discriminative capability of the trackers We presents an improvement for Siamese networks by integrating the Convolutional Block Attention Module (CBAM) into the baseline In which, attention plays a role not only to tell the model to concentrate on important features, but also neglecting distractor to enhance the represention of object of interest As a result, the discriminative capability, adaptability and robustness of the tracker are increased Experimental results on the two popular benchmarks OTB2015 and VOT2018 have shown that our approach achieved remarkable accuracy and robustness while maintaining the tracking speed that is practical for real world applications Master student i Table of Contents Table of Contents iii List of figures v List of tables vi Chapter 1: 1.1 Introduction Background 1.1.1 Challenges in visual object tracking 1.1.2 The general framework of visual object tracking 1.2 Motivation of this study 1.3 Contribution of Thesis 1.4 Outline 1.5 Publications related to this research Chapter 2: 2.1 Literature Review Classification of tracking methods 2.2 Generative tracking methods 2.2.1 Traditional object trackers 2.2.2 CNN-based object trackers 10 2.3 Discriminative tracking methods 12 2.3.1 Traditional object trackers 13 2.3.2 CNN-based object trackers 17 2.4 Chapter summary 20 Chapter 3: 3.1 Proposed Method 22 Baseline study 22 3.1.1 SiamFC Architecture 22 3.1.2 Analysis on SiamFC tracker 23 ii 3.2 Proposed Network Architecture 25 3.2.1 Channel Attention 26 3.2.2 Spatial Attention 29 3.3 Offline Training 29 3.4 Chapter summary 30 Chapter 4: 4.1 Experimental Result and Conclusion 32 Benchmark datasets 32 4.1.1 Tracking metrics 32 4.1.2 Benchmark datasets 33 4.2 Experimental result 34 4.2.1 Implementation detail 34 4.2.2 Ablation study 35 4.2.3 Comparison to other object trackers 38 4.3 Chapter summary 40 4.4 Conclusion 41 4.5 Future work 41 ii List of Figures 1.1.1 Challenging cases in visual object tracking 1.1.2 General Framework of Visual Object Tracking 2.1.1 Two main categories of object tracking methods 2.1.2 Classification Appearance-based tracking methods 2.2.1 General framework of traditional generative tracking methods The method referred in the image is presented in [1] Image courtesy of [2] 2.2.2 Representative illustration for Siamese based tracking methods 11 2.3.1 Illustration of a typical discriminative tracking framework Image courtesy of [2] 12 2.3.2 General framework of correlation filter based tracker Image courtesy of [3] 14 2.3.3 The general framework of CNN-based classification trackers Image courtesy of [4] .19 3.1.1 SiamFC network architecture Image courtesy from [5] 23 3.1.2 Similarity function in SiamFC produce high confidence score for distractive objects of the same class as the target 24 3.1.3 The padded value creep further into the center as we go down the network 24 3.2.1 Proposed network architecture 25 3.2.2 The overview of CBAM block Image courtesy of [6] 26 3.2.3 Channel attention architecture Image taken from [6] 26 3.2.4 Illustration of average and max pooling operation 27 3.2.5 A representative examples of MLP structure 28 3.2.6 Spatial attention architecture Image courtesy of [6] 29 3.3.1 Pair of examplar and searching image for training Images are cropped from the same video If the boundary of the window is beyond the actual content of the image, the boundary is padded with average color 30 4.1.1 An illustration of the overlap region Image courtesy of [7] 32 4.1.2 EAO measurement illustration Image taken from [8] 34 4.2.1 Similarity score map comparison between SiamFC tracker and proposed tracker 36 4.2.2 Effectiveness of our proposed CBAM modules 36 4.2.3 Success plot comparison in OTB100 benchmark 38 4.2.4 Precision plot comparison in OTB100 benchmark 39 4.2.5 Qualitative comparison of different trackers in OTB100 benchmark Example sequences are Ironman, Motorolling, Skiing Best view in color 39 4.2.6 Qualitative comparison of different trackers in VOT2018 benchmark Example sequences are ants, gymnastic1, iceskater1 Best view in color 40 List of Tables 4.1 Backbone network parameter for object representation .35 4.2 SiamCBAM tracker performance on OTB1100 with different training procedures + denote whether the attention modules is used in the architecture or not ∗ denote the modules is trained when layer 4, of the backbone network is unfreezed 37 4.3 SiamCBAM tracker performance on OTB100 with different value of reduction ratio .37 4.4 SiamCBAM tracker performance on VOT2018 with different value of reduction ratio .37 4.5 Performance comparison between our tracker and other methods in VOT2018 .40 Chapter Introduction 1.1 Background The number of camera being set up all over the world have been drastically increasing in the past decade due to the urge of public surveillance to monitor activity within critical areas, detect potential threats in order to protect people, infrastructure, from being harm Consequently, system operators have been bombarded with an enormous amount of data captured from surveillance cameras It is of great importance to facilitate their works by developing automated systems as an assistant to automatically provide visual analysis There are many sub-tasks in such systems including anomaly activity detection, video reasoning, human-computer interaction, object navigation, Most of these tasks are related and dependent on the result from visual tracking algorithm For example successfully following the state of an tracked object is essential information for monitoring and thoroughly understanding its activity over time in order to alarm whether an abnormal event is likely to happen or not Therefore, visual tracking subsystems are crucial components of any modern visual intelligent system Given the initial state of an arbitrary object in the first frame of the video, the objective of visual tracking modules is to accurately identify the states of that object in the rest of the sequence To be more specific, the initial state of the object is simply a bounding box that marks the region where the object locates in the first frame No further information about the object of interest is provided other than raw pixels inside the initialized box in the first frame of the sequence From those data, the patch of the tracked object is encoded by numerous of methods to form a model which can be used to represent and update the state of the target in subsequent frame Due to the requirements of a specific application, other property of tracking system like the number of object to be tracked may vary which give rise to different approach to solve the problem There are basically two category of visual tracking problem intensively investigated by community which are: • Single Object Tracking: focuses on developing robust algorithms to track only one object in a video • Multiple Object Tracking: handles the trajectory of multiple objects at the same time In this thesis, we will focus on the first category and comprehensively discuss the main problems and solutions of visual single object tracking In the next section, we will discuss several challenging scenarios and present the general framework to construct an efficient tracker 1.1.1 Challenges in visual object tracking Many challenging factors should be taken into consideration when designing a object tracking algorithm • Appearance change: indicates the change in size or shape of an object that make a huge difference between its appearance in the first frame and subsequent frames For example, an object moving far away from or toward the camera will make it become smaller or bigger which challenges the object representation quality This example refers to a term called scale variation The problem becomes even more severe if the appearance of the object transforms rapidly within a few frames which hinders the performance of trackers since the object representation is hard to update accordingly Such scenarios is also refered as deformation Another typical type of appearance change is occlusion which happen when an object is partially or even fully hidden In fully occlusion, the object is completely absent from images for a certain amount of time Therefore, the capability to memorize and generalize the object representation is crucial for a trackers to successfully follow the trajectory of the object when it re-appear in subsequent frames Furthermore, this scenario requires a searching mechanism to determine where in the frame the object would re-appear Such problems is another problem in visual object tracking known as Long-term Visual Object Tracking However, in this thesis, we will not cover fully occlusion problem but only concentrate on developing tracker for Short-term Visual Object Tracking problem • Motion blur: It is caused by the relative movement of the recording device that gives rise to the change the light exposure during the time images are recorded This is a common problem while working with non-stationary camera or tracking fast moving object In such cases, the appearance of the object could vary sharply which deteriorate the localization process • Illumination change: Illumination variation on an object make it brighter or darker which may give rise to the loss of feature on object surface, hence degrade the object appearance modelling process • Background Clutter: Area surrounding the target containing several similar object or having no significant boundary with the target can easily fool object trackers since they are confused to determine exactly where the object of interest is In this case, object tracker are prone to drift away from actual target • Real-time processing requirements: A tracking algorithm is only practical if its inference time is fast enough to satisfy operating time constraint in a particular application Otherwise, it is only meaningful for research purpose There may be some other challenging scenarios to be addressed in some specific tracking sequences, but those aforementioned issues are not only the most common issues in real world but also the main problems that are evaluated in many popular tracking benchmarks Therefore, the main objective of visual tracking researchers is to develop accurate and robust trackers to handle such problems Some examples of the aforementioned issues are shown in Figure 1.1.1 Figure 1.1.1: Challenging cases in visual object tracking 1.1.2 The general framework of visual object tracking The general framework of visual object tracking is illustrated in Figure 1.1.2 As mentioned in section 1.1, in the first frame of a video, a bounding box surrounding the target object is initialized and utilized to extract important details which clearly define of the object of interest in the image These information is scientifically referred as features In figure 1.1.2, the object representation phase is called features extraction In machine learning in general, feature extraction starts from an initial set of observed data (in this case, the initial bounding box) and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations [9] In computer vision, there are two main methods used to extract features including traditional hand-crafted feature exactractor (such as HOG[10], SIFT[11], SURF[12], ORB[13], ) and deep features (using deep neural network to learn features from images) After extracting feature, the raw data (images) is transformed into features vector feeding the learning algorithm in order to build an object model or an classifier to perform tasks required by a specific application In figure 1.1.2, after contructing object model from initial state in the first frame, we enter the tracking loop repeattedly operating in subsequent frames Whenever a new frame is fed into the tracking system, based on the object model which was built, a set of region candidates is proposed to mark potential locations where the object could be in the current frame From those candidates, we carry out the localization process to find the actual location of the target Once the object is located, feature extractor is used to represent the instantaneous object and update Figure 4.1.2: EAO measurement illustration Image taken from [8] Based on the precomputed value of accuracy and robustness in each frame, these raw values are then fused to form a plot which are called the expected average overlap curve which is shown in the upper left plot in figure 4.1.2 Given a Ns frames long video example in the short term tracking dataset, a object tracker is initialized at the first frame of the video and their performance is calculated as the sequence goes If the tracker losts the target, it will not be reinitialized and the overlap region of tracker prediction will remain for the rest of the video The performance of tracker in this evaluation protocol is then calculated by averaging overlaps in each frame, Φi, including those after its failure Φ= Ns Ns , Φi (4.4) i=1 The above equation gives the overlap of tracker for a Ns frames video Taking into consideration all the video in a large dataset, we extract exactly Ns consecutive frame in every video an then average the overlap of the tracker on those set of Ns frames sequences to obtain the expected average overlap with respect to Ns ΦˆNs = (ΦNs ) (4.5) Each point in the expected average overlap curve is plotted by evaluating this measure for Ns frame video where Ns is a ceratain value in a range of : Nmax Once the expected average overlap curve is plotted, the EAO measure can be calculated by computing the average of the EAO curve over an interval [Nlo, Nhi] of a sequence Nhi Φˆ= , Φˆ Ns Nhi − Nlo Ns=Nlo (4.6) The upper and lower bound of the interval is determined by a special estimation process Further detail about VOT2018 benchmark and its measurement protocol can be found in [18] 4.2 4.2.1 Experimental result Implementation detail Training We implemented our tracker in Pytorch framwork Our model is trained on the data provided in the GOT10k dataset[82] It has nearly 10000 tracking tracking 34 videos with nearly 1000 classes of object In each sequence of the datasets, we will extract 30 image pairs that an image in a pair is less than 50 frames away from the other The model is trained for 50 epochs using the learning rate exponentially decreased from 10−2 to 10−5 with batch size of 64 Our backbone network for object representation inherited from the convolutional stage of Alexnet [83] The detail of the backbone network parameter is shown in table 4.1 Table 4.1: Backbone network parameter for object representation Layer Examplar Size 127 × 127 Input 59 × 59 conv1 29 × 29 pool1 25 × 25 conv2 12 × 12 pool2 10 × 10 conv3 8×8 conv4 conv5 6×6 Search Size 255 × 255 123 × 123 61 × 61 57 × 57 28 × 28 26 × 26 24 × 24 22 × 22 No Channel 96 96 256 256 384 384 256 kernel size 11 3 3 stride 2 1 The tracker built based on this backbone network is refered as SiamCBAM tracker Apart from the baseline backbone network shown in 4.1, we also experiment its bigger variation by double number of channel in each layer We refer this variant as SiamCBAMBig tracker As mentioned in section 3.2, channel reduction ratio r in the channel attention module is used to reduce the dimension of the channel descriptor We will train our model with different value of r to find the best tracker For SiamCBAM tracker, we used pretrained model from SiamFC tracker to perform different training strategies The first one is to freeze the backbone network and only train the CBAM module with only the channel attention The second one is to freeze the backbone and train the CBAM module including channel-spatial attention Finally, we unfreeze the 4th and 5th convolutional layer of the backbone and train with the complete CBAM module Tracking The tracker is initialized by computing the feature map of the target image along the examplar branch Once the template feature map is calculated, the exemplar branch can be omitted Online tracking only requires the search branch to output the feature map of the search image which is used to compared to the fixed template feature map To handle scale variation, we also search object over scales 1.0375{−1,0,1} To provide smooth transition between different scales, we update the scale by linear interpolation with the damping factor of 0.59 The proposed tracker is implemented using Pytorch framework All experiments are conducted on NVIDIA Quadro P4000 GPU 4.2.2 Ablation study The effectiveness of our proposed tracker in comparison with SiamFC tracker can be seen in figure 4.2.1 It is clearly shown that in SiamFC tracker, the similarity score map produces high response value for object similar to target whereas the score map generated by our proposed tracker indicates that the model is now able to focus on the important part of the target and ignore any other distractors 35 Figure 4.2.1: Similarity score map comparison between SiamFC tracker and proposed tracker The qualitative result comparison between SiamFC tracker and our tracker is shown in figure 4.2.2 Figure 4.2.2: Effectiveness of our proposed CBAM modules Figure 4.2.2 compares the performance of our tracker and the baseline SiamFC tracker on two sequences Bolt and MotorRolling It can be observed that our tracker is able to distinguish the object of interest from similar object in sequence Bolt and the clutter background in sequence MotorRolling while the SiamFC tracker fails to 36 tracker the target and switch to distractor object and the background This is the prove that integrating the attention modules into the baseline tracker improves its discriminative capability hence boost the accuracy and robustness of the tracker Effectiveness of different training strategy As mentioned in section 4.2.1, we trained our model with different type of training procedure on OTB100 to tune for the best performance In this procedure, we fixed the reduction ratio of channel attention modules to r = Table 4.2: SiamCBAM tracker performance on OTB1100 with different training procedures + denote whether the attention modules is used in the architecture or not ∗ denote the modules is trained when layer 4, of the backbone network is unfreezed Model Training Modules AUC Precision Channel Att Spatial Att + 0.585 0.786 0.588 0.788 ∗ + + + 0.581 0.786 0.600 0.808 +∗ +∗ Notice that all result in this table is conducted with fixed reduction ratio r = It can be seen that, the proposed architecture has better result when using both spatial and channel attention and when layer 4, of backbone network is unfreezed This configuration will be used in forthcoming experiment Hyperparameter tuning As mentioned in section 3.2,the channel attention module contains a reduction ratio r to compress the dimention of feature descriptor We conducted experiments on both VOT2018 and OTB100 benchmark to verify the best choice of hyperparameter r The result is shown table 4.3 4.4 Table 4.3: SiamCBAM tracker performance on OTB100 with different value of reduction ratio Reduction ratio(r) 32 16 AUC Precision score 0.595 0.796 0.598 0.799 0.593 0.801 0.600 0.808 Table 4.4: SiamCBAM tracker performance on VOT2018 with different value of reduction ratio Reduction ratio(r) 32 16 Accuracy(A) 0.500 0.483 0.496 0.488 37 Robustness(R) 0.576 0.562 0.562 0.571 EAO 0.210 0.206 0.216 0.207 As we can see from the two table, r = gives the best result for OTB100 benchmark whereas r = gives the highest EAO on VOT2018 4.2.3 Comparison to other object trackers Evaluation on OTB100 The success plot and precision plot of several tracker on OTB100 is shown in figure 4.2.3 and 4.2.4 respectively It can be seen from those figure that, our proposed trackers not only outperform the baseline SiamFC tracker in both precision score and AUC but it also have better performance in comparison with other popular tracker especially our big version SiamCBAMBig To be more specific, our SiamCBAM tracker outperforms SiamFC in both precision score and AUC with a margin of 1.7% and 2.6% respectively whereas that of our SiamCBAMBig tracker is 6.4% and 6.3% However, SiamCBAMBig tracker achieves this remarkable result by sacrificing its tracking speed While SiamCBAM tracker operates at 97FPS, the huge number of extra parameters in the backbone network of SiamCBAMBig tracker results in the fact that it can only run at 26FPS Figure 4.2.3: Success plot comparison in OTB100 benchmark 38 Figure 4.2.4: Precision plot comparison in OTB100 benchmark Qualitative comparison of different trackers in OTB100 benchmark is shown in figure 4.2.5 Figure 4.2.5: Qualitative comparison of different trackers in OTB100 benchmark Example sequences are Ironman, Motorolling, Skiing Best view in color Evaluation on VOT2018 The performance of our trackers will be evaluated based on accuracy, robustness and EAO metrics The detail of these metrics have been presented in section 4.1.2 In this section we will compare the performance of our tracker to other popular tracker including C3DT[84][75], SiamFC[5], DSiam[32], DCFNet[85], Staple[63], SRDCF[52], MIL[86], DSST[58] Their results are illustrated in table 4.5 39 Table 4.5: Performance comparison between our tracker and other methods in VOT2018 Tracker SiamCBAM SiamCBAMBig C3DT[84][75] SiamFC[5] DSiam[32] DCFNet[85] Staple[63] SRDCF[52] MIL[86] DSST[58] Accuracy(A) 0.496 0.532 0.522 0.503 0.512 0.470 0.530 0.490 0.394 0.395 Robustness(R) 0.562 0.501 0.496 0.585 0.646 0.0.543 0.688 0.974 1.011 1.452 EAO 0.216 0.212 0.209 0.188 0.196 0.182 0.169 0.119 0.118 0.079 It can be seen that both variant of our proposed tracker have better performance in comparison with the baseline SiamFC tracker SiamCBAMBig outperforms SiamFC with large margin in both accuracy and robustness while SiamCBAM sacrifices 0.7% accuracy for the exchange of 2.3% improvement in terms of robustness However, SiamCBAM outperforms both SiamFC and SiamCBAMBig in terms of EAO metric This shows that SiamCBAM is the best when dealing with long sequences Qualitative comparison of those trackers in several challenging video on VOT2018 benchmark is shown in figure 4.2.5 Figure 4.2.6: Qualitative comparison of different trackers in VOT2018 benchmark Example sequences are ants, gymnastic1, iceskater1 Best view in color 4.3 Chapter summary In this chapter, we have experimented several variants of our tracker on popular benchmark datasets The result have shown that our tracker outperform the baseline SiamFC and many other tracker in both accuracy and robustness Furthermore, our 40 tracker is able to operate at realtime Therefore, it can be stated that our approach is helpful when solving the object tracking problem 4.4 Conclusion Visual object tracking has long been considered as a core problem to research in computer vision community due to its wide range of applications in real world Although many researchers have put a lot of effort into constructing a trackers that can operate with the reliability as near as human level, it remains a challenging problem that is opened for further progress In this thesis, short-term visual object tracking problem is embraced A wide range of classes of tracking method have been intensively discussed to analyze the pattern in each framework and their links to each other The development of tracking methods have been presented in chronological order to see how ideas was passed from time to time up to now From those comprehensive understanding of the progress that has been made in visual object tracking, a CNN based tracker is chosen as the way to go in this work Inspired by the work of Bertinetto et al.[5], the SiamFC tracker have been intensively researched to lighten the strength behind its success and discover its weakness for further improvement By thoroughly understanding the baseline SiamFC framework in combination with the broad investigation of other related areas in computer vision, a novel CNN architecture specifically designed for visual object tracking is constructed It is the incorporation between the SiamFC framework and the Convolutional Block Attention Modules The architecture of the proposed framework is detailed for further investigation Intensive experimental results on various variant of the proposed tracker is presented which illustrate its achievement in two popular benchmark datasets VOT2018 and OTB100 Its remarkable performance in terms of accuracy and robustness in comparision with the baseline SiamFC tracker and other tracker have shown the effectiveness of our approach in tackling the visual object tracking problem Last but not least, it is noticeable that the proposed tracker runs at beyond realtime tracking speed of 97FPS which promising that our tracker might be practical for real world applications 4.5 Future work The idea adopted in this thesis has shown its initial success However, there are plenty of rooms for further improvement The attention modules used in this work are a plug in block which means it can be integrated in anykind of network architecture It is proved to be able to improve the performance of the baseline SiamFC tracker Therefore, a similar idea, which is to plug this block into other popular tracking framework, looks very promising On the other hand, our framework is only trained on a single datasets GOT10k Since deep network is more benificial from learning from massive amount of data, it is suggested that the proposed tracker training in bigger datasets including ImageNet, LaSOT, might result in better performance As mentioned in 3.2, ResNet or deeper network is not usable for SiamFC framework due to its strict validation of translational invariance However, recent researches have shown that alternating the training procedure might work for Siamese based tracker as presented in [87], [88], [89] Therefore, modifying the procedure is another promising development for the current work 41 References [1] C Bao, H Ling, and H Ji, “Real time robust l1 tracker using accelerated proximal gradient approach,” in Proc IEEE Conf on Computer Vision and Pattern Recognition, pp 1830–1837, 2012 [2] E Gundogdu, Good features to correlate for visual tracking PhD thesis, The graduate school of natural and applied sciences of middle east technical university, 2017 [3] Z Chen, Z Hong, and D Tao, “An experimental survey on correlation filterbased tracking,” CoRR, vol abs/1509.05520, 2015 [4] K Thanikasalam, Appearance Based Online Visual Object Tracking PhD thesis, Queensland University of Technology, 2019 [5] L Bertinetto, J Valmadre, J F Henriques, A Vedaldi, and P H S Torr, “Fully-convolutional siamese networks for object tracking,” in ECCV 2016 Workshops, pp 850–865, 2016 [6] S Woo, J Park, J.-Y Lee, and I S Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018 [7] L Cehovin, A Leonardis, and M Kristan, “Visual object tracking performance measures revisited,” CoRR, vol abs/1502.05803, 2015 [8] M K et al, “The visual object tracking vot2015 challenge results,” in ICCV workshop on Visual Object Tracking Challenge, pp 564 – 586, December 2015 [9] Feature extraction, “Feature extraction — Wikipedia, the free encyclopedia.” [10] N Dalal and B Triggs, “Histograms of oriented gradients for human detection,” in In CVPR, pp 886–893, 2005 [11] D G Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol 60, pp 91–110, 2004 [12] H Bay, A Ess, T Tuytelaars, L V Gool, and B K U Leuven, “Speeded-up robust features (surf),” 2008 [13] E Rublee, V Rabaud, K Konolige, and G R Bradski, “Orb: An efficient alternative to sift or surf.,” in ICCV (D N Metaxas, L Quan, A Sanfeliu, and L V Gool, eds.), pp 2564–2571, IEEE Computer Society, 2011 42 [14] A Krizhevsky, I Sutskever, and G E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems (F Pereira, C J C Burges, L Bottou, and K Q Weinberger, eds.), vol 25, Curran Associates, Inc., 2012 [15] W Liu, D Anguelov, D Erhan, C Szegedy, S E Reed, C.-Y Fu, and A C Berg, “Ssd: Single shot multibox detector.,” in ECCV (1) (B Leibe, J Matas, N Sebe, and M Welling, eds.), vol 9905 of Lecture Notes in Computer Science, pp 21–37, Springer, 2016 [16] J Long, E Shelhamer, and T Darrell, “Fully convolutional networks for semantic segmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015 [17] Y Wu, J Lim, and M Yang, “Online object tracking: A benchmark,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp 2411–2418, 2013 Copyright: Copyright 2013 Elsevier B.V., All rights reserved.; 26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2013 ; Conference date: 23-06-2013 Through 2806-2013 [18] M K et al., “The sixth visual object tracking vot2018 challenge results,” in Computer Vision ECCV 2018 Workshops, Proceedings (L Leal-Taix´e and S Roth, eds.), Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), (Germany), pp 3–53, Springer Verlag, 2019 [19] D Do, G Vu, M Bui, H Ninh, and H Tien, “Real-time long-term tracking with adaptive online searching model,” pp 62–68, 04 2020 [20] Z Chen, Z Hong, and D Tao, “An experimental survey on correlation filterbased tracking.,” CoRR, vol abs/1509.05520, 2015 [21] A W M Smeulders, D M Chu, R Cucchiara, S Calderara, A Dehghan, and M Shah, “Visual tracking: An experimental survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014 [22] A B and, Detection and Tracking in Thermal Infrared Imagery Linkoăping University Electronic Press, Apr 2016 [23] D A Ross, J Lim, R.-S Lin, and M.-H Yang, “Incremental learning for robust visual tracking,” 2008 [24] D Comaniciu, V Ramesh, and P Meer, “Kernel-based object tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 25, pp 564– 577, May 2003 [25] J Kwon and K M Lee, “Visual tracking decomposition,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, June 2010 [26] J Wright, A Yang, A Ganesh, S Sastry, and Y Ma, “Robust face recognition via sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 31, pp 210–227, Feb 2009 43 [27] X Mei and H Ling, “Robust visual tracking using l1 minimization,” in 2009 IEEE 12th International Conference on Computer Vision, IEEE, Sept 2009 [28] T Zhang, B Ghanem, S Liu, N Ahuja, T Zhang, B Ghanem, S L (b, and N Ahuja, “Robust visual tracking via structured multi-task sparse learning.” [29] T Zhang, B Ghanem, S Liu, and N Ahuja, “Low-rank sparse learning for robust visual tracking,” in ECCV, 2012 [30] X Jia, H Lu, and M.-H Yang, “Visual tracking via adaptive structural local sparse appearance model,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, June 2012 [31] R Tao, E Gavves, and A W M Smeulders, “Siamese instance search for tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016 [32] M H Abdelpakey and M S Shehata, “Domainsiam: Domain-aware siamese network for visual object tracking,” CoRR, vol abs/1908.07905, 2019 [33] C Huang, S Lucey, and D Ramanan, “Learning policies for adaptive tracking with deep feature cascades,” in 2017 IEEE International Conference on Computer Vision (ICCV), IEEE, Oct 2017 [34] J Valmadre, L Bertinetto, J F Henriques, A Vedaldi, and P H S Torr, “Endto-end representation learning for correlation filter based tracking,” CoRR, vol abs/1704.06036, 2017 [35] S Hare, A Saffari, and P H S Torr, “Struck: Structured output tracking with kernels.,” in ICCV (D N Metaxas, L Quan, A Sanfeliu, and L V Gool, eds.), pp 263–270, IEEE Computer Society, 2011 [36] P Viola and M Jones, “Rapid object detection using a boosted cascade of simple features,” 2001 [37] A Saffari, C Leistner, J Santner, M Godec, and H Bischof, “On-line random forests,” in 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, IEEE, Sept 2009 [38] J F Henriques, R Caseiro, P Martins, and J Batista, “High-speed tracking with kernelized correlation filters,” in TPAMI, 2014 [39] A Lukeˇziˇc,T Voj’iˇr, L Cˇehovin Zajc, J Matas, and M Kristan, “Discriminative correlation filter tracker with channel and spatial reliability,” International Journal of Computer Vision, 2018 [40] J Gao, H Ling, W Hu, and J Xing, “Transfer learning based visual tracking with gaussian processes regression,” in Computer Vision – ECCV 2014, pp 188–203, Springer International Publishing, 2014 [41] S Wu, Y Zhu, and Q Zhang, “A new robust visual tracking algorithm based on transfer adaptive boosting,” Mathematical Methods in the Applied Sciences, vol 35, pp 2133–2140, Aug 2012 44 [42] C Leistner, A Saffari, P M Roth, and H Bischof, “On robustness of on-line boosting - a competitive study.,” in ICCV Workshops, pp 1362–1369, IEEE Computer Society, 2009 [43] C Shen, G Lin, and A van den Hengel, “Structboost: Boosting methods for predicting structured output variables,” CoRR, vol abs/1302.3283, 2013 [44] R Yao, Q Shi, C Shen, Y Zhang, and A van den Hengel, “Part-based visual tracking with online latent structural learning,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, June 2013 [45] J Ning, J Yang, S Jiang, L Zhang, and M.-H Yang, “Object tracking via dual linear structured SVM and explicit feature map,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, June 2016 [46] B A Draper, D S Bolme, J Beveridge, and Y Lui, “Visual object tracking using adaptive correlation filters,” in 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (Los Alamitos, CA, USA), pp 2544–2550, IEEE Computer Society, jun 2010 [47] J F Henriques, R Caseiro, P Martins, and J P Batista, “Exploiting the circulant structure of tracking-by-detection with kernels.,” in ECCV (4) (A W Fitzgibbon, S Lazebnik, P Perona, Y Sato, and C Schmid, eds.), vol 7575 of Lecture Notes in Computer Science, pp 702715, Springer, 2012 [48] B Schoălkopf and A Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond Adaptive Computation and Machine Learning, Cambridge, MA, USA: MIT Press, Dec 2002 Parts of this book, including an introduction to kernel methods, can be downloaded ¡a href=”http://www.learning-with-kernels.org/sections/”¿here¡/a¿ [49] R M Gray, “Toeplitz and circulant matrices: A review,” Foundations and TrendsOR in Communications and Information Theory, vol 2, no 3, pp 155– 239, 2005 [50] T Liu, G Wang, and Q Yang, “Real-time part-based visual tracking via adaptive correlation filters,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, June 2015 [51] Y Li, J Zhu, and S C Hoi, “Reliable patch trackers: Robust visual tracking by exploiting reliable patches,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015 [52] M Danelljan, G Hăager, F S Khan, and M Felsberg, “Learning spatially regularized correlation filters for visual tracking,” CoRR, vol abs/1608.05571, 2016 [53] F Li, C Tian, W Zuo, L Zhang, and M Yang, “Learning spatial-temporal regularized correlation filters for visual tracking,” CoRR, vol abs/1803.08679, 2018 [54] H Kiani Galoogahi, A Fagg, and S Lucey, “Learning background-aware correlation filters for visual tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1135–1143, 2017 45 [55] S Boyd, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and TrendsOR in Machine Learning, vol 3, no 1, pp 1–122, 2010 [56] C Sun, D Wang, H Lu, and M Yang, “Correlation tracking via joint discrimination and reliability learning,” CoRR, vol abs/1804.08965, 2018 [57] Y Li and J Zhu, “A scale adaptive kernel correlation filter tracker with feature integration,” in European conference on computer vision, pp 254–265, Springer, 2014 [58] M Danelljan, G H?ger, F Khan, and M Felsberg, “Accurate Scale Estimation for Robust Visual Tracking,” in Proceedings of the British Machine Vision Conference 2014, BMVA Press, 2014 [59] M Tang and J Feng, “Multi-kernel correlation filter for visual tracking,” in 2015 IEEE International Conference on Computer Vision (ICCV), IEEE, Dec 2015 [60] A Bibi and B Ghanem, “Multi-template scale-adaptive kernelized correlation filters,” in 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), IEEE, Dec 2015 [61] K Zhang, L Zhang, M Yang, and D Zhang, “Fast tracking via spatio-temporal context learning,” CoRR, vol abs/1311.1939, 2013 [62] F Li, Y Yao, P Li, D Zhang, W Zuo, and M Yang, “Integrating boundary and center correlation filters for visual tracking with aspect ratio variation,” CoRR, vol abs/1710.02039, 2017 [63] L Bertinetto, J Valmadre, S Golodetz, O Miksik, and P H S Torr, “Staple: Complementary learners for real-time tracking,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016 [64] M Danelljan, F S Khan, M Felsberg, and J V D Weijer, “Adaptive color attributes for real-time visual tracking,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, June 2014 [65] C Ma, X Yang, C Zhang, and M.-H Yang, “Long-term correlation tracking,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, June 2015 [66] J Choi, H J Chang, S Yun, T Fischer, Y Demiris, and J Y Choi, “Attentional correlation filter network for adaptive visual tracking,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, July 2017 [67] M Danelljan, G Hager, F S Khan, and M Felsberg, “Convolutional features for correlation filter based visual tracking,” in 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), IEEE, Dec 2015 [68] A Milan, S H Rezatofighi, A R Dick, K Schindler, and I D Reid, “Online multi-target tracking using recurrent neural networks,” CoRR, vol abs/1604.03635, 2016 46 [69] Y Qi, S Zhang, L Qin, H Yao, Q Huang, J Lim, and M.-H Yang, “Hedged deep tracking,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, June 2016 [70] L Wang, W Ouyang, X Wang, and H Lu, “Visual tracking with fully convolutional networks,” in 2015 IEEE International Conference on Computer Vision (ICCV), IEEE, Dec 2015 [71] Y Song, C Ma, L Gong, J Zhang, R W H Lau, and M Yang, “CREST: convolutional residual learning for visual tracking,” CoRR, vol abs/1708.00225, 2017 [72] M Danelljan, A Robinson, F Shahbaz Khan, and M Felsberg, “Beyond correlation filters: Learning continuous convolution operators for visual tracking,” in ECCV, 2016 [73] M Danelljan, G Bhat, F Shahbaz Khan, and M Felsberg, “Eco: Efficient convolution operators for tracking,” in CVPR, 2017 [74] K Li, Y Kong, and Y Fu, “Multi-stream deep similarity learning networks for visual tracking,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, Aug 2017 [75] H Nam and B Han, “Learning multi-domain convolutional neural networks for visual tracking,” CoRR, vol abs/1510.07945, 2015 [76] B Han, J Sim, and H Adam, “BranchOut: Regularization for online ensemble tracking with convolutional neural networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, July 2017 [77] H Fan and H Ling, “Sanet: Structure-aware network for visual tracking,” CoRR, vol abs/1611.06878, 2016 [78] S Yun, J Choi, Y Yoo, K Yun, and J Y Choi, “Action-decision networks for visual tracking with deep reinforcement learning,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, July 2017 [79] K He, X Zhang, S Ren, and J Sun, “Deep residual learning for image recognition,” CoRR, vol abs/1512.03385, 2015 [80] J Hu, L Shen, and G Sun, “Squeeze-and-excitation networks,” CoRR, vol abs/1709.01507, 2017 [81] S Zagoruyko and N Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” CoRR, vol abs/1612.03928, 2016 [82] L Huang, X Zhao, and K Huang, “Got-10k: A large high-diversity benchmark for generic object tracking in the wild,” CoRR, vol abs/1810.11981, 2018 [83] A Krizhevsky, I Sutskever, and G E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems (F Pereira, C J C Burges, L Bottou, and K Q Weinberger, eds.), vol 25, Curran Associates, Inc., 2012 47 [84] D Tran, L Bourdev, R Fergus, L Torresani, and M Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in 2015 IEEE International Conference on Computer Vision (ICCV), IEEE, Dec 2015 [85] Q Qiu, X Cheng, robert Calderbank, and G Sapiro, “DCFNet: Deep neural network with decomposed convolutional filters,” in Proceedings of the 35th International Conference on Machine Learning (J Dy and A Krause, eds.), vol 80 of Proceedings of Machine Learning Research, (Stockholmsmssan, Stockholm Sweden), pp 4198–4207, PMLR, 10–15 Jul 2018 [86] B Babenko, M.-H Yang, and S Belongie, “Visual Tracking with Online Multiple Instance Learning,” in CVPR, 2009 [87] Z Zhu, Q Wang, L Bo, W Wu, J Yan, and W Hu, “Distractor-aware siamese networks for visual object tracking,” in European Conference on Computer Vision, 2018 [88] B Li, J Yan, W Wu, Z Zhu, and X Hu, “High performance visual tracking with siamese region proposal network,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018 [89] B Li, W Wu, Q Wang, F Zhang, J Xing, and J Yan, “Siamrpn++: Evolution of siamese visual tracking with very deep networks,” CoRR, vol abs/1812.11703, 2018 48 ... world applications Deep Tracking Networks for object tracking This type of trackers tackle visual object tracking problem by formulating it by a convolutional neural network that can be trained... breakthrough in object recognition [14], object detection [15] and object segmentation [16] Visual Object Tracking is strongly related with those areas so the idea of using deep learning for object tracking. .. 2.2.2 CNN-based object trackers An important aspect of using deep neural network is that it can solve the issue of data scarcity in visual object tracking problem In traditional object tracker,