Real Time Siamese Visual Tracking with Lightweight Transformer Real Time Siamese Visual Tracking with Lightweight Transformer Dinh Thang Hoang∗, Trung Kien Thai∗, Thanh Nguyen Chi∗ and Long Quoc Tran†[.]
2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Real-Time Siamese Visual Tracking with Lightweight Transformer Dinh Thang Hoang∗ , Trung Kien Thai∗ , Thanh Nguyen Chi∗ and Long Quoc Tran† ∗ Institute of Information Technology, Academy of Military Science and Technology † AI Lab, VNU University of Engineering and Technology Hanoi, Vietnam Email: hoangdinhthang@gmail.com, kienthaitrung@gmail.com, thanhnc80@gmail.com, and tqlong@vnu.edu.vn Abstract—Trackers based on Siamese have demonstrated more remarkable performance in visual tracking The majority of existing trackers typically compute target template and search image features independently, then utilize cross-correlation to predict the possibility of an object appearing at each spatial position in the search image for target localization This paper proposes a Siamese network for feature enhancement and aggregation between the target template and the search image by utilizing a lightweight transformer with several linear self- and crossattention layers With anchor-free head prediction, the suggested framework is simple and effective Extensive experiments on visual tracking benchmarks such as VOT2018, UAV123, and OTB100 demonstrates that our tracker achieves state-of-the-art performance and operates at a real-time frame rate of 39 fps Index Terms—Visual Tracking, Siamese Visual Tracking, Realtime Visual Tracking, Lightweight Transformer I I NTRODUCTION Visual tracking is a critical job in computer vision since it enables the determination of the state of a target object within a video sequence Despite major advancements in recent years, visual tracking continues to face difficulties due to occlusion, scale variation, deformation, illumination variation, and camera motion Various researches have been published in recent years; nonetheless, developing a real-time, high-accuracy tracker remains a difficult task The Siamese network-based tracker [1]–[5] formulates the visual object tracking issue as learning a general similarity map via cross-correlation between the template target and search region feature representations However, crosscorrelation is a linear matching process, which limits the tracker’s ability to capture the complex non-linear interaction between the template and search patch Anchor boxes are frequently employed in visual tracking [1]–[3] and can provide an acceptable trade-off between speed and accuracy However, because anchors are used for region proposals, these trackers’ ability in hyperparameter adjustment is critical for successful tracking SiamBAN [4] and SiamCAR [5] applied the FCOS [6] concept to tracking and developed an anchor-free tracker that simplifies the complicated parameters of anchor placement However, these Siamese-based trackers often locate the target by utilizing the regression and classification branches † Corresponding author 978-1-6654-1001-4/21/$31.00 ©2021 IEEE and optimizing them independently, which can result in a tracking method mismatch Recently, the attention and transformer mechanism was introduced to visual tracking in [7], [8] SiamAtnn [7] is an anchor-based tracker that investigates both self- and crossattention in order to improve the discriminative capacity of the template and search features before performing depth-wise cross-correlation (DW-Xcorr) on the fusion feature TransformerTrack [8] employs a complete transformer, which has a very large computational memory cost and is slow to train We solve the issue indicated above in this work by designing and proposing a new Lightweight Transformer Feature Fusion (LTFF) for the enhancement and fusion of two Siamese branch features The proposed LTFF is composed of two modules: a feature improvement module that interleaves the self and cross attention layers numerous times; and a feature fusion module that contains a single cross-attention layer Our proposed Siamese network with head prediction is anchor-free as a result of the usage of center points for the initial object representation, which enables more precise localization and classification Our primary contributions are as follows: • We propose a new architecture for Siamese tracking that combines feature extraction, lightweight transformer feature fusion, and prediction head modules The prediction head is anchor-free, enabling for more precise localization and classification • The proposed lightweight transformer feature fusion enhances and aggregates rich contextual information between the target template and the search image Additionally, a lightweight transformer is used to reduce our framework’s computational complexity • We conduct extensive experiments on three benchmark datasets including VOT2018, UAV123, and OTB100, and demonstrate that our tracker outperforms state-of-the-art findings while operating at real-time speeds II R ELATED W ORK Visual tracking has become a prominent research topic in computer vision in recent decades as a result of the development of new benchmark datasets This section briefly review three aspects that are most relevant to our work 272 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Input Feature Extraction Transformer Feature Fusion Prediction Head Star Dconv Template 127x127x3 25x25x256 15x15x1024 7x7x256 positional encoding Feature Enhancement Module 25x25x4 Pini Feature Fusion Module Pref x3 25x25x256 25x25x256 25x25x256 25x25x4 25x25x4 25x25x256 Search x3 255x255x3 31x31x1024 31x31x256 25x25x256 25x25x256 Pcls Star Dconv 25x25x256 25x25x2 Fig An overview of the proposed Networks It consists of a input, feature extraction, a transformer feature fusion, and a prediction head for classification and regresstion Siamese based Trackers: Recently, Siamese based trackers [1]–[5] have drawn significant attention for their superior performance With the success of the region proposal network (RPN) Li et al [1] propose the Siamese region proposal network SiamRPN, which solved the scale variation of the target Based on SiamRPN, DaSiamRPN [2] designs a distractoraware tracker To take advantage of the powerful deep feature extracted by the deep network, SiamRPN++ [3] overcame the hurdle posed by incorporating the deep network into the Siamese framework and significantly increased tracking performance While these RPN-based trackers have demonstrated excellent performance, they have a drawback in that they are highly sensitive to anchor hyper-parameters Attention and Transformer Mechanism: Transformer [9] is being developed as a new attention-based machine translation architectural component Attention mechanisms are composed of layers of neural networks that collect data from the whole input sequence SiamAttn [7] investigates both selfand cross-branch attention in order to enhance the discriminative capacity of target features, and then fusion features extracted from template and search pictures by depth-wise cross-correlation This tracker is based on anchors While transformers work admirably in a variety of tasks, they are unacceptably sluggish for very long sequences due to their quadratic complexity in terms of the input’s length TransformerTrack [8] employed a complete transformer consisting of an encoder and decoder that was computationally intensive, memory intensive, and slow to train RFA [10] proposes a linear time and space attention algorithm that approximates the softmax function using random feature methods and investigates its application in transformers Anchor-free Mechanism: Due to their novelty and simplicity, anchor-free detectors have recently garnered significant attention CornerNet [11] pioneered the concept of a cornerbased detector by converting the target’s bounding box to a pair of corner predictions FCOS [6] proposed immediately predicting the existence of an object and its bounding box coordinates without reference to an anchor RepPoints [12] presented representative points as a new object representation that may be used to represent fine-grained localization objects and to identify significant local areas for object classification SiamBAN [4] and SiamCAR [5] adapted the FCOS concept for tracking, whereas SiamCorners [13] applied CornerNet’s concept for tracking and created a simple yet effective anchorfree tracker These works have improved the tracking simplicity and accuracy with the anchor-free, but they continue to rely heavily on the correlation operation fusion of template and search region features In this study, we leverage the fundamental concepts of Linear Attention and Point Set Representation for Object Detection (RepPoints) to develop a new anchor-free Siamese network for visual tracking that directly fuses template and search region features without requiring any correlation operation III P ROPOSED M ETHOD We describe the details of our proposed networks in this section As shown in Fig 1, it consists of three main components: a feature extraction, a lightweight transformer feature fusion, and a prediction head network A Feature Extraction In this work, we use the fully convolutional network to construct the Siamese subnetwork for the visual feature extraction The Siamese network consists of two identical branches One of them is referred to as the template branch, while the other is referred to as the search branch For feature extraction, we employ a modified version of ResNet50 [14] pre-train on [15] as the backbone network In our tracker, we only use the fourth stage’s (layer3) outputs as final outputs The backbone processes the template patch (denoted as z ∈ R3×Hz0 ×Wz0 ) and the search patch (denoted as x ∈ R3×Hx0 ×Wx0 ) to obtain their features maps Fz ∈ RCz ×Hz ×Wz and Fx ∈ RCx ×Hx ×Wx , Hz , Wz = H8z0 , W8z0 , Hx , Wx = H8x0 , W8x0 and Cz = Cx = 1024 After that, we apply a neck with × convolution to decrease the output features channel to 256, and follow by [1] we use only the features from the template branch center × areas, which still capture the whole target region The output features of ours network are defined as Z ∈ RC×h×w and X ∈ RC×H×W , with C = 256 273 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) V' Feature Extraction Transformer Feature Fusion Fusion Module Enhancement Module MatMul Self-Attension Layer MatMul Cross-Attension Layer Cross-Attension Layer 15x15x1024 7x7x256 positional encoding 31x31x256 𝜙(.) 𝜙(.) Q K V 25x25x256 Lc 31x31x1024 31x31x256 (a) (b) Fig (a) Linear attention layer with O(N ) complexity (b) The proposed Lightweight Transformer Feature Fusion module, which consists of two sub-modules: a enhancement sub-module, and a fusion sub-module B Lightweight Transformer Feature Fusion-LTFF As illustrated in Fig 2(b), the proposed LTFF module takes Z, X as inputs, and outputs the feature fusion by applying the lightweight transformer mechanism The LTFF module consists of two sub-modules: a feature enhancement submodule and a feature fusion sub-module Lightweight Transfomer Let x ∈ RN ×F denote a sequence of N feature vectors of dimensions F , x is projected by three matrices WQ ∈ RF ×D , WK ∈ RF ×D and WV ∈ RF ×M to corresponding representations Q, K and V The output self-attention, SA(x) = V , is computed as follows, Q = xWQ , K = xWK , (1) V = xWV , T QK V SA(x) = V = softmax √ D For softmax attention, the complexity of computing softmax(QK T )V is quadratic O(N ) By writing the softmax explicitly, (1) can be written as: PN j=1 κ(Qi , Kj )Vj Vi = PN (2) j=1 κ(Qi , Kj ) T where κ(Q, K) = exp Q√DK is softmax kernel Following [10], [16], by substituting an alternative kernel function κ0 (Q, K) = φ(Q).φ(K)T for the softmax kernel, the computational complexity of attention can be reduced to O(N ), as illustrated in Fig 2(a) In this work, we design our lightweight transformer using the random feature map φarccos (x), which was proposed in RFA [10] Feature Enhancement In this work, we use the 2D extension of the standard positional encoding in Transformers as DETR [17] By adding the position encoding to Z ∈ RC×h×w and X ∈ RC×H×W features, we get positional features ˆ ∈ RC×h×w and X ˆ ∈ RC×H×W The Feature Enhancement Z Module including Lc times of Self-Attention Layer (SAL) and Cross-Attention Layer (CAL) ˆ For self-attention layers, suppose the input features are Z ˆ self-attention (SA) is formulated as: and X, ˆ Q ) φ(ZW ˆ K )(ZW ˆ V ) ∈ RC×h×w ZSA = φ(ZW (3) ˆ Q ) φ(XW ˆ K )(XW ˆ V ) ∈ RC×H×W (4) XSA = φ(XW Then, we can generate self-attention layer features map: ˜=Z ˆ + LN(MPL(CAT(Z, ˆ LN(ZSA )))) ∈ RC×h×w Z (5) ˜ =X ˆ + LN(MPL(CAT(X, ˆ LN(XSA )))) ∈ R X C×H×W (6) ˜ For cross-attention layers, suppose the input features are Z ˜ and X, cross-attention (CA) is formulated as: ˜ Q ) φ(XW ˜ K )(XW ˜ V ) ∈ RC×h×w ZCA = φ(ZW (7) ˜ Q ) φ(ZW ˜ K )(ZW ˜ V ) ∈ RC×H×W (8) XCA = φ(XW Then, we can generate cross-attention layer features map: ˜ + LN(MPL(CAT(Z, ˜ LN(ZCA )))) ∈ RC×h×w Z=Z (9) ˜ + LN(MPL(CAT(X, ˜ LN(XCA )))) ∈ RC×H×W (10) X=X where LN, MPL and CAT are Layer Norm, Multilayer Perˆ ceptron and Concat, respectively After that, by replacing Z ˆ with Z and X, we repeat generate self-attention layer and X features and cross-attention layer features Lc − times Feature Fusion With input features Z and X above, cross-attention feature of X can calculate as: XCA = φ(XWQ ) φ(ZWK )(ZWV ) ∈ RC×H×W (11) then encoder feature of search patch is formulated as: E = X + LN(MPL(CAT(X, LN(XCA )))) ∈ RC×H×W (12) After that, fusion feature map is generated by: F = ϕ(E) ∈ RC×H×W (13) where ϕ is the sequential of convolution × 1, batch norm, relu and convolution × The final feature response map 274 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) R ∈ RC×25×25 is generated by apply bilinear interpolate to change size of F to C × 25 × 25 R has contains massive information for classification and regression C Prediction Head Network The prediction head is composed of subnetworks for localization and classification, as illustrated in Fig (right) The localization subnet takes R as input and produces a new feature map by convolution three × layers with ReLU activations Then, one of the localization subnet’s branches convolves this feature map once more and outputs the initial bounding box pini = (l0 , t0 , r0 , b0 ) The other branch performs a star-shaped deformable convolution on the nine feature sampling points (using the initial box and feature map as inputs) and produces the distance scaling factor (∆l, ∆t, ∆r, ∆b) The distance scaling factor is then multiplied by pini to obtain the refined bounding box pref = (l, t, r, b) The classification subnet is structured similarly to the refinement branch of the localization subnet, except that it outputs a vector of two elements per spatial location with the purpose of predicting the classification score Loss Function Let (gxc , gyc ), (gw , gh ) denote the target box’s center position and size, respectively In this work, we design the positive and negative labels using the ellipse figure region, as described in [4] There are two ellipses, denoted by the letters E1 and E2 E1 ’s center and axes are set to (gxc , gyc ) and ( g2w , g2h ), respectively, while E2 ’s are set to (gxc , gyc ) and ( g4w , g4h ) We can map each location (i, j) in feature map R onto the input frame to obtain the corresponding image location (pi , pj ) If (pi , pj ) falls within the ellipse E2 , it is considered positive; if (pi , pj ) falls outside the ellipse E1 is considered as negative, if (pi , pj ) falls between the ellipse E1 and E2 , it is ignored The initial box and refined box localization losses are calculated using the IoU loss and are defined as follows: X Lref −ini = − I{(pi ,pj )∈E2 } LIoU (pini i,j , gi,j ) (14) Npos i,j Lref −ref = − X I{(pi ,pj )∈E2 } LIoU (pref i,j , gi,j ) (15) Npos i,j where Npos denotes the number of positive samples, I{(pi ,pj )∈E2 } denotes an indicator function that equals to if (pi , pj ) ∈ E2 and otherwise, LIoU denotes the IoU loss as UnitBox [18], gi,j denotes the ground-truth box, pini i,j denotes the initial bounding box, and pref denotes the refined i,j bounding box Cross-entropy loss is used to calculate the classification score loss for each box and is defined as: Lcls = (yo log(po ) + (1 − yo )log(1 − po ))+ (yb log(pb ) + (1 − yb )log(1 − pb )) (16) where yo and yb denote the target object and background object labels, respectively, po and pb denote the probability tracker predicted target object and background We define our multi-task loss function as follows using the above losses L = λ1 Lcls + λ2 Lref −ini + λ3 Lref −ref (17) We not search for the hyperparameters of (17), and simply set λ1 = λ2 = and λ3 = 1.5 D Tracking Phase We crop the initial frame’s template patch and provide it to the network during tracking For the following frames, we trim the search patch and extract features depending on the previous frame’s target position, before performing prediction cls in the search region to obtain the classification map Ph×w×2 reg−ref and refined regression map Ph×w×4 Following that, we can obtain prediction boxes by doing the following steps: px1 = pi − dreg−ref , l py1 = pj − dreg−ref , t px2 = pi + dreg−ref , r (18) py2 = pj + dreg−ref , b and dreg−ref denote where dreg−ref , dreg−ref , dreg−ref t r b l the prediction values of the refined regression map, (px1 , py1 ) and (px2 , py2 ) are the top-left corner and bottom-right corner of the prediction box After generating prediction boxes, the prediction box with the highest score is chosen and its size is updated linearly with the previous frame’s state using the cosine window and scale change penalty to smooth target movements and changes IV E XPERIMENTS A Implementation Details The network is trained on the COCO [19], ImageNet DET [20], ImageNet VID [20], and GOT10k [21] training sets To facilitate comparison, we set the size of a template patch to 127 × 127 pixels and the size of a search patch to 255 × 255 pixels as the same with [3] Stochastic gradient descent (SGD) is used to train our network on a minibatch of 16 pairs We train for a total of 20 epochs, with a warmup learning rate of 0.001 to 0.005 in the first five epochs and an exponentially decaying learning rate from 0.005 to 0.00005 in the final fifteen epochs We set the number of SAL and CAL to Lc = for optimal performance and real-time operation Our approach is implemented in Python using PyTorch on a computer with 02 Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz, 96G RAM, 02 Nvidia GTX 1080Ti B Comparison with State-of-the-art Trackers We compare our proposed tracker with the recent stateof-the-art trackers on three tracking benchmarks, including VOT2018, UAV123, and OTB100 Our tracker achieves stateof-the-art results and operates at a frame rate of 39 fps On VOT2018 VOT2018 [22] benchmark includes sixty challenging videos The expected average overlap (EAO) is 275 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Illumination change (0.137,0.492) Success plots of OPE on UAV123 Camera motion (0.285,0.469) Precision plots of OPE on UAV123 0.8 Overall (0.267,0.457) 0.7 0.6 0.6 0.5 0.4 0.3 0.2 Size change (0.267,0.562) 0.1 [0.638] Ours [0.631] SiamBAN [0.623] SiamCAR [0.610] SiamLT [0.569] DaSiamRPN [0.557] SiamRPN [0.524] ECO 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Overlap threshold [0.840] Ours [0.833] SiamBAN [0.813] SiamCAR [0.797] SiamLT [0.781] DaSiamRPN [0.768] SiamRPN [0.741] ECO 0.4 0.3 0.2 0.1 0.0 0.0 Unassigned (0.054,0.193) Precision Success rate Motion Change (0.239,0.497) 0.8 0.7 10 15 20 25 30 35 40 45 50 Location error threshold Occlusion (0.252,0.473) SiamBAN SiamRPN++ Ours SiamRPN DeepCSRDCF CCOT DiMP-50 TrDiMP Fig Comparisons between the state-of-the-art tracker on UAV123 and the state-of-the-art tracker in terms of success and precision plots of OPE UPDT ATOM Fig The following visual characteristics of EAO on VOT2018 were compared: camera motion, illumination change, occlusion, size change, and motion change used to evaluate the performance on this dataset EAO considers both accuracy (average overlap over successful frames) and robustness (failure rate) On the VOT2018 dataset, we compare our tracker to eight eight state-of-the-art approaches, including SiamRPN [1], SiamRPN++ [3], SiamBAN [4], TrDiMP [8], DiMP50 [23], ATOM [24], UPDT [25], DeepCSRDCF [26], and CCOT [27] As illustrated in Tab I, our tracker achieves an EAO score of 0.457, significantly outperforming state-of-theart methods on this metric Only TriDiMP [8] has an EAO equal to our tracker, despite the fact that TriDiMP is trained using additional data, including TrackingNet [28], LaSOT [29], and a complete transformer including encoder-decoder Additionally, our tracker operates at real-time speeds of 39 fps, while TriDiMP operates at 26 fps on a GTX 1080Ti All sequences of VOT2018 are per-frame annotated by the following visual attributes: illumination change, occlusion, camera motion, size change, and motion change Frames that not correspond to any of the five attributes are represented as unassigned We compare the EAO of the visual attributes of the top-performing trackers As shown in Fig 3, our tracker ranks first on attributes of camera motion and occlusion and ranks second and third on attributes of unassigned and size change This shows that our tracker is robust to camera motion and occlusion TABLE I R ESULTS ON VOT2018, WITH ACCURACY (A), ROBUSTNESS (R), L OST N UMBER (LN), AND E XPECTED AVERAGE OVERLAP (EAO) Source CVPR 2021 CVPR 2020 ICCV 2019 CVPR 2019 CVPR 2019 CVPR 2018 ECCV 2018 ICCV 2015 ECCV 2016 Tracker Ours TrDiMP [8] SiamBAN [4] DiMP-50 [23] SiamRPN++ [3] ATOM [24] SiamRPN [1] UPDT [25] DeepCSRDCF [26] CCOT [27] A(↑) 0.596 0.595 0.590 0.597 0.600 0.590 0.586 0.536 0.489 0.494 R(↓) 0.178 0.141 0.178 0.152 0.234 0.203 0.276 0.184 0.276 0.318 LN(↓) 38.0 30.0 38.0 32.5 50.0 43.4 59.0 39.2 59.0 68.0 EAO(↑) 0.457 0.457 0.447 0.439 0.415 0.400 0.383 0.378 0.293 0.267 On UAV123 UAV123 [30] dataset is comprised of 123 video sequences and over 100,000 frames Each sequence is annotated in detail with upright bounding boxes Fast motion, large scale, variation in illumination, and occlusion all contribute to the difficulty of tracking objects in this dataset Our tracker is compared to six state-of-the-art approaches on this dataset, including SiamRPN [1], DaSiamRPN [2], SiamBAN [4], SiamCAR [5], ECO [31] and SiamLT The success plot and precision plot of OPE are used to assess overall performance in this instance (the success plot indicates the percentage of frames in which the estimated and groundtruth bounding boxes overlap more than a predefined threshold, whereas the precision plot indicates the percentage of video frames in which the distance between the estimated and ground-truth bounding boxes is less than predefined thresholds) As illustrated in Fig 4, our tracker outperforms all other trackers in terms of success and precision, achieving a success score of 0.638 and a precision score of 0.840 Compared to the state-of-the-art SiamBAN, SiamCAR trackers, our tracker achieves the best results with a simpler and faster network These trackers use three layers of backbone as input features and train on the huger dataset (addition LaSOT [29], YouTubeBB [32]) On OTB100 OTB100 [33] is a robustness benchmark that serves as a fair test bed Each sequence in the dataset is labeled with 11 different attributes of interference, including illumination variation, scale variation, occlusion, deformation, motion blur, fast motion, in-plane rotation, out-of-plane rotation, outof-view, low resolution, and background clutter A comparison of OPE success plots with state-of-the-art trackers We compare our network with eight state-of-the-art approaches, including SiamRPN [1], DaSiamRPN [2], SiamRPN++ [3], SiamBAN [4], SiamCAR [5], Ocean [34], KYS [35], and GradNet [36] In OPE, we evaluate each tracker’s success and precision plots As illustrated in Fig 5, the proposed ranks first in terms of success and fourth in terms of precision Our tracker, in particular, significantly improves tracking accuracy when deformation, scale variation, occlusion, out-of-plane, rotation, and low resolution are present 276 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Success plots of OPE on OTB100 Precision plots of OPE on OTB100 0.9 0.9 0.8 0.8 0.7 0.7 Precision Success rate 1.0 0.6 0.5 [0.698] Ours [0.697] SiamCAR [0.696] SiamRPN++ [0.696] SiamBAN [0.695] KYS [0.684] Ocean [0.658] DaSiamRPN [0.639] GradNet [0.629] SiamRPN 0.4 0.3 0.2 0.1 0.6 [0.920] Ocean [0.915] SiamRPN++ [0.910] SiamBAN [0.910] SiamCAR [0.906] Ours [0.904] KYS [0.880] DaSiamRPN [0.861] GradNet [0.847] SiamRPN 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Overlap threshold 10 15 20 25 30 35 40 45 50 Location error threshold Fig Comparison of success and precision plots on OTB100 with state-ofthe-art methods TABLE II Q UANTITATIVE COMPARISON RESULTS OF OUR TRACKER AND ITS VARIANTS WITH DIFFERENT FEATURE FUSION ON UAV123 AND VOT2018 Dataset UAV123 VOT2018 Fusion DW-Xcorr LTFF DW-Xcorr LTFF Success(↑) 0.623 0.638 - Precision(↑) 0.817 0.840 - A(↑) 0.591 0.596 EAO(↑) 0.374 0.457 C Ablation Study To compare with cross-correlation-based methods, we replace the LTFF with the DW-Xcorr layer [3], which has the best performance among cross-correlation-based methods As shown in Tab II, when compared to DW-Xcorr, the LTFF improves success and precision by 1.5% and 2.3%, respectively, on the UAV123 dataset On VOT2018, the LTFF improved the EAO score by 8.3% when compared to DWXcorr V C ONCLUSIONS We have presented a novel Siamese network for visual tracking in this paper We introduce a lightweight transformer feature fusion method for enhancing and fusing features from two Siamese network branches Additionally, a prediction head is utilized to improve tracking accuracy The new tracker can significantly improve the tracker’s robustness against deformation, camera motion, scale variation, and occlusion Extensive experiments on the VOT2018, UAV123, and OTB100 benchmarks reveal that our method achieves new state-of-theart results at a real-time running speed of 39 fps R EFERENCES [1] B Li, J Yan, W Wu, Z Zhu, and X Hu, “High performance visual tracking with siamese region proposal network,” in CVPR, 2018 [2] Z Zhu, Q Wang, B Li, W Wu, J Yan, and W Hu, “Distractor-aware siamese networks for visual object tracking,” in ECCV, 2018 [3] B Li, W Wu, Q Wang, F Y Zhang, J L Xing, and J J Yan, “SiamRPN++: Evolution of siamese visual tracking with very deep networks,” in CVPR, 2019 [4] Z Chen, B Zhong, G Li, S Zhang, and R Ji, “Siamese box adaptive network for visual tracking,” in CVPR, 2020 [5] D Guo, J Wang, Y Cui, Z Wang, and S Chen, “SiamCAR: Siamese fully convolutional classification and regression for visual tracking,” in CVPR, 2020 [6] Z Tian, C Shen, H Chen, and T He, “FCOS: Fully convolutional onestage object detection,” in ICCV, pages 9626–9635 IEEE, 2019 [7] Y Yu, Y Xiong, W Huang, and M R Scott, “Deformable siamese attention networks for visual object tracking,” in CVPR, 2020 [8] N Wang, W.Zhou, J.Wang, H Li, “Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking,” in CVPR, 2021 [9] A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N Gomez, U Kaiser, and I Polosukhin, “Attention is all you need,” in NIPS 2017 [10] H Peng, N Pappas, D Yogatama, R Schwartz, N A Smith, and L Kong, ”Random feature attention”, in ICLR, 2021 [11] H Law and J Deng, “Cornernet: Detecting objects as paired keypoints,” in CVPR, 2018 [12] Z Yang, S Liu, H Hu, L Wang, and S Lin, “Reppoints: Point set representation for object detection,” in CVPR, pp 9657-9666, 2019 [13] H Law, Y Teng, O Russakovsky, and J Deng, “SiamCorners: Siamese Corner Networks for Visual Tracking,” arXiv:1904.08900, 2021 [14] K He, X Zhang, S Ren, and J Sun, “Deep residual learning for image recognition,” in CVPR, 2016 [15] A Krizhevsky, I Sutskever, and G E Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012 [16] I Schlag, K Irie, and J Schmidhuber, “Linear Transformers are secretly fast weight programmers”, in ICML, 2021 [17] N Carion, F Massa, G Synnaeve, N Usunier, A Kirillov, and S Zagoruyko, “End-toend object detection with transformers,” in ECCV, 2020 [18] J Yu, Y Jiang, Z Wang, Z Cao, and T Huang, “Unitbox: An advanced object detection network,” in ACM, pp 516-520, 2016 [19] T Y Lin, M Maire, S Belongie, J Hays, P Perona, D Ramanan, and et al, “Microsoft coco: Common objects in context,” in ECCV, 2014 [20] O Russakovsky, J Deng, H Su, J Krause, S Satheesh, S Ma, Z Huang, A Karpathy, and et al, “ImageNet Large Scale Visual Recognition Challenge,” in IJCV, 2015 [21] L Huang, X Zhao, and K Huang, “GOT-10k: A large high-diversity benchmark for generic object tracking in the wild,” in TPAMI, 2019 [22] M Kristan, A Leonardis, J Matas, M Felsberg, R Pfugfelder, L C Zajc, T Vojir, G Bhat, and et al, “The sixth visual object tracking vot2018 challenge results,” in ECCV Workshops, 2018 [23] G Bhat, M Danelljan, L V Gool, and R Timofte, “Learning Discriminative Model Prediction for Tracking,” in ICCV, 2019 [24] M Danelljan, G Bhat, F.S Khan, and M Felsberg, “Atom: Accurate tracking by overlap maximization,” in CVPR, 2019 [25] G Bhat, J Johnander, M Danelljan, F S Khan, and M Felsberg, “Unveiling the power of deep tracking,” in ECCV, 2018 [26] A Lukezic, T Voj´ır, L.Cehovin Zajc, J Matas, and M Kristan, “Discriminative correlation filter tracker with channel and spatial reliability,” in IJCV, 2018 [27] M Danelljan, A Robinson, F S Khan, and M Felsberg, “Beyond correlation filters:learning continuous convolution operators for visual tracking,” in ECCV, 2016 [28] M Măuller, A Bibi, S Giancola, S Alsubaihi, and B Ghanem, “TrackingNet: A large-scale dataset and benchmark for object tracking in the wild,” in ECCV, 2018 [29] H Fan, L Lin, F Yang, P Chu, G Deng, S Yu, H Bai, Y Xu, C Liao, and H Ling, “LaSOT: A high-quality benchmark for large-scale single object tracking,” 2018 [30] M Muller, N Smith, and B Ghanem, “A benchmark and simulator for uav tracking,” in ECCV, 2016 [31] M Danelljan, G Bhat, F S Khan, and M Felsberg, “Eco: Efficient convolution operators for tracking,” in CVPR, 2017 [32] E Real, J Shlens, S Mazzocchi, X Pan,and V Vanhoucke, “YouTubeBoundingBoxes: A large high-precision human-annotated data set for object detection in video,” in CVPR, 2017 [33] Y Wu, J Lim, and M H Yang, “Object tracking benchmark,” in TPAMI, 2015 [34] Z Zhang, H Peng, J Fu, B Li, and W Hu, “Ocean: Object-aware anchor-free tracking,” in ECCV, 2020 [35] G Bhat, M Danelljan, L V Gool, and R Timofte, “Know your surroundings: Exploiting scene information for object tracking,” in ECCV, pp 205-221, 2020 [36] P Li, B Chen, W Ouyang, D Wang, X Yang, and H Lu, “GradNet: Gradientguided network for visual object tracking,” in ICCV, 2019 277 ... “Distractor-aware siamese networks for visual object tracking, ” in ECCV, 2018 [3] B Li, W Wu, Q Wang, F Y Zhang, J L Xing, and J J Yan, “SiamRPN++: Evolution of siamese visual tracking with very deep... new state-of-theart results at a real- time running speed of 39 fps R EFERENCES [1] B Li, J Yan, W Wu, Z Zhu, and X Hu, “High performance visual tracking with siamese region proposal network,”... presented a novel Siamese network for visual tracking in this paper We introduce a lightweight transformer feature fusion method for enhancing and fusing features from two Siamese network branches