Đánh giá ưu, nhược điểm

3.3 .b Deformable ROI Pooling

2 Đánh giá ưu, nhược điểm

2.1 Ưu điểm

• Mơ hình có sự cải tiến về chất lượng nhận diện và theo dấu đối tượng người đi bộ. Xử lý tốt trong việc nhận diện các đối tượng người đi bộ với kích thước nhỏ.

• Mơ hình kế thừa được điểm mạnh của mơ hình gốc FairMOT về việc xử lý, theo dấu người đi bộ trực tuyến.

• Mơ hình end-to-end và kết hợp huấn luyện cả hai task trong cùng một mơ hình, giảm thời gian huấn luyện

2.2 Nhược điểm

• Mơ hình vẫn gặp các vấn đề như không nhận diện và theo dấu các đối tượng bị che khuất;

• Kết quả theo dấu sẽ bị nhầm lẫn khi số lượng đối tượng người đi bộ quá nhiều. 3 Hướng phát triển trong tương lai

Trong tương lai, mơ hình có thể được phát triển theo hướng giải quyết những vấn đề hiện tại trong luận văn này:

• Tăng dữ liệu huấn luyện: trong luận văn này, tập dữ liệu đề xuất chỉ huấn luyện trên tập data MOT và chưa sử dụng các tập dữ liệu bên ngoài Caltech Pedestrian, CityPersons[19], CUHK-SYSU[20], PRW[21], ETHZ[22]. Việc đa dạng tập dữ liệu huấn luyện có thể cải thiện kết quả của mơ hình.

• Tìm hiểu về các mô đun attention khác để ứng dụng vào bải toán. Nhằm mục đich tăng cường khả năng học, kết quả theo dấu người đi bộ, cũng như những chứng minh về sự ảnh hưởng tích cực của các mơ đun attention trong luận văn này. Nên việc cải tiến các mơ đun attention trong mơ hình đề xuất sẽ là một hướng tiếp cận trong tương lại.

• Cải tiến nhận diện và theo dẫu các đối tượng bị vật cản che chắn. Đây là một hạn chế của mơ hình hiện tại, khi q trình theo dẫu sẽ sai khi đối tượng khơng nhận diện được do bị vật cản.

Lời kết

Trong quá trình thực hiện luận văn, tơi đã được học hỏi, tìm hiểu rất nhiều kiến thức chuyên ngành, kiến thức liên quan đến bài tốn theo dấu người đi bộ sử dụng cơng nghệ học sâu. Ngồi ra, q trình thực hiện nghiên cứu xác định tập dữ liệu, xây dựng mơ hình, tiến hành thí nghiệm đánh giá cũng cho tơi nhiều bài học q giá. Những vấn đề, kiến thức và kinh nghiệm trong quá trình này sẽ là những động lực cho tơi để không ngừng học hỏi và tiếp tục phát triển trên cong đường sự nghiệp phía trước.

Tơi xin chân thành cảm ơn quý thầy cô của trường đại học Bách Khoa Thành Phố Hồ Chí Mình và đặc biệt là Tiến Sĩ Nguyễn Đức Dũng. Một người thầy nhiệt huyết đã truyền cảm hứng và tận tụy hướng dẫn, hỗ trợ tơi hết mình trong q trình thực hiện luận văn này. Chân thành cảm ơn thầy.

Danh sách các cơng trình khoa học

1. D.M. Chuong and N.D. Dung, "Attention mechanics for improving online Multi- Object Tracking" inProceedings of 2022 Asia Conference on Algorithms, Computing and Machine Learning (CACML 2022), Shanghai, China, 2022, pp 201-206.

Minh Chuong Dang

Department of Computer Science Ho Chi Minh City University of Technology

Ho Chi Minh City, Vietnam 1870356@hcmut.edu.vn

Duc Dung Nguyen

Department of Computer Science Ho Chi Minh City University of Technology

Ho Chi Minh City, Vietnam nddung@hcmut.edu.vn

Abstract—In recent years, FairMOT is a known online one-

shot model for tracking pedestrians with a focus on fairness between detection and re-identification (re-ID) tasks with remarkable performance. In this paper, we integrate some attention modules with more relating-object information to improve the performance of FairMOT. Firstly, we propose a spatial attention module which is the proper combination between deformable convolution and key-content factors to improve detection accuracy. Furthermore, we introduce a channel attention module in re-ID branch that can enhance efficient tracking capability. Our experimental evaluation shows that our extensions increase IDF1 and MOTA in tracking challenges MOT17, MOT20 with provided training data only.

Keywords-multi-object tracking, channel attention, spatial

attention

I. INTRODUCTION

Pedestrians are a sensitive subject in traffic because they receive little insurance. Nowadays, pedestrian tracking is a topic of particular interest in the field of autonomous vehicles and multi-object tracking with pedestrians has re- ceived a lot of attention. MOT (Multi-object tracking) is a multi-task of detection and association, which estimates the location and scale of objects to predict object trajectories in a video sequence. In recently, there are two main cutting- edge modern approaches: tracking-by-detection and joint- detection-and-tracking.

There are many early works such as [1], [2], [3] address the MOT problems by tracking-by-detection (TBD). The TBD method separate MOT into two step: detection step, in which objects are detected from frames; association step to tracking that objects. Trackers in this approach using deep learning model as detectors to extract bounding-box for target objects. After that, the association step used Kalman- filter and Hungarian algorithm as the simple and fast method. Most of the state of the art performance follows the tracking- by-detection method. However, with this ”detection first, association second” method, we cannot share the learning process between the two steps. Joint-detection-and-tracking appears to combine two learning processing. The recent success of JDT approach in multi-object tracking [4], [5], [6] solves the disavandtage of TBD approach. In JDT, detection and association shared a backbone network and network had end-to-end optimization.

Although tracking-by-detection approach achieves efficiency accuracy, joint-detection-and-tracking is a proper approach for real-time speed tracking. Besides, the attention mechanism is also getting attention because it improves the performance of computer vision tasks including autonomous driving. We use attention modules and tracker with JDT to increase accuracy while ensuring a real-time tracker.

In this paper, we present a network, ACSMOT (Atten- tional Channel Spatial Multi-Object Tracking), that follows the joint-dectection-and-tracking approach and attention module. FairMOT [4] is chosen for real-time tracking as a baseline model. This baseline use input as frames to extract features for detection and re-ID using a multi-scale encoder-decoder backbone. We show that our model can apply effectively the attention module for spatial and channel features. Due to these attention modules, our approach gets better performance than the baseline model with provided training data only.

In order to demonstrate the effective of our nework .We did experiments on the MOT17 and MOT20 between the origin FairMOT and our proposal. The results show the improvement of our design in both model design attention module design.

II. RELATED WORK

Deep Learning in MOT. It consists of two approachs which are Tracking-by-detection (TBD) and Joint-detection- and-tracking (JDT). TBD treat MOT task as detection and re-ID as two independent tasks. Firstly, they use a CNN- base backbone to detect bounding-box from frames, particularly Faster-RCNN [1], Yolov3 [7]. Then, the input of the previous step is feed to the association network to extract re-ID feature. In the development of the deep learning era, Intersection over Union (IoU), Kalman Filter and Hungarian algorithm are used to predict trajectory, which estimates localization bounding-box in the next frame. JDT is an end- to-end model in which joint detection and association tasks in a single model. As known as one-shot learning, this approach attracts researchers. For example, Track-RCNN [8] build a re-ID head in Mask RCNN and learn bounding box and re-ID embedding feature in end-to-end model. An other example, DEFT [6], FairMOT [4], JDT [5] are

Attention mechanism. It can allow the neural network to

focus heavily on the related components with the input of the problem. This mechanism has been archived success in computer vision problems such as object detection and semantic segmentation [10], [11], [12], [13]. Recently, an analysis of the effect of spatial attention mechanism and chanel attetion mechanism in deep learning [14], [15]. This analysis shows that it is possible to combine the deformable convolution network [16] with the statistical information related to the local neighborhood of the target pixel and achieve the best accuracy and speed trade-off in attention designs. Moreover, attention mechanisms are also mentioned in the MOT task such as CSTrack [17], where CSTrack proposed a novel cross-correlation network to improve the cooperation in the learning process of 2 tasks detection and re-ID. Spatial attention and channel attention are also suggested to improve tracking with multi-scale.

III. ACSMOT

Figure 1: Architecture of ACSMOT. ACSMOT includes three components: Feature Extractor (A), Detection Head (B) and Re-ID Head (C). Images are input to Feature Extractor to extract feature map. Then, that feature map is fed to Detection Head and Re-ID Head to get bounding box and its identity embedding

We inherited the primary design of FairMOT [4], which aims to be a framework that provides fairness for both detection and association tasks while maintaining accuracy and speed. Our model is built according to the Joint-detection- and-Tracking approach with the combination of two task detection and re-ID in the one-shot model, as illustrated in Fig.1, to ensure real-time in the inference process. In this section, we will present the technical design of our proposals.

A. Feature Extractor

In this model, we propose to extend the model of Fair- MOT [4] with attention modules. To improve the model results, we propose to enhance the Feature Extractor, which

sentation by using a spatial attention module, this idea was inspired by the analysis of the effect of spatial attention on convolution network models of [17], [13]. Due to extending the appearance feature learning of the original image, we add a spatial attention module (SAM) combined with a deformable convolution [16] in the DLA-34 [18] backbone. The SAM improves the process of extracting appearance features with local neighborhoods of the target pixel, to achieve the best accuracy-efficiency trade-off for computer vision tasks. Our combination retains the advantages that the DLA-34 offers with its ability to represent objects with different scales. Therefore, the feature extraction module with spatial attention improves the performance and achieves the original expectations.

Different from the backbone of origin FairMOT [4], we propose a spatial attention module in DCN [16], which is intended to further extend spatial representational learning with the local neighborhood of target pixels. This will address the problem of background noise that confuses detection. This improvement is implemented through up samplingsteps in Feature Extractor, shown in the following Fig2b. We will represent the feature from the DCNs as

F∈RC×H×W, where C will be the values 64, 128, 256. The outputs of the DCN will be passed through the spatial attention module (SAN) illustrated in Fig. 2c to extract 2D spatial attention mapMs∈R1×H×W as illustrated in Fig2b. These Fs final feature outputs will be the result of up- sampling steps in the DLA-34 backbone andFs is process as follows:

Fs=F⊗Ms⊕F (1)

where⊗denoted element-wise multiplication and⊕denoted element-wise addition. The output of Feature Extractor is passed in for both detection and re-ID. In the following, we will talk about spatial attention module (SAM).

Spatial Attention Module (SAM). we first let F go

through both average pooling and max-pooling to get statistical information related to local neighborhood pixels and the output isFavgs ∈R1×H×W,Fmaxs ∈R1×H×W, respectively. Then we will concatenateFs

avgandFs maxtoFs

pool∈R2×W×H. Finally, a 5×5 convolution operation will be used to gen- erate the spatial attention feature mapMs. The convolution with the filter size of 5x5 denotes f5×5. In short, the spatial attention is computed as:

Ms=σ(f5×5[MaxPool(F),AvgPool(F)]) =σ(f5×5[Fmaxs ,Favgs ])

=σ(Fs pool)

(2)

(a)

(c)

Figure 2: Feature Extractor. (a) Structure of Feature Extractor . (b) Up-sampling with spatial spatial attention module

(SAM). (c) Spatial attention module (SAM)

(a)

(b)

Figure 3: ReID Head. (a) Structure of Re-ID Head with

CAM. (b) Channel attention module (CAM)

B. Detection branch

We build a detection branch like in previous work Fair- MOT combined with an anchor-free method, this branch will take in feature map extracted from the previous single frame. Then, they will pass in three heads to estimate heatmap, object center offset and the object size of the bounding box of the object. Specifically, at each head feature map will be passed 3×3 convolution followed by 1×1 convolution to get the desired results. A heatmap head is used to calculate the location of the object’s center. To determine the ground truth, during the learning process we will calculate the center map from the image with sizeH×W, in which the value of the center of the object will be assigned a value of 1 in the

center map. Box offset and size heads are used to calculate the exact position of the object relative to the detected object center map heatmap head coordinates.

C. Re-ID branch

On the other hand, we also use another channel attention module to improve the identity embedding feature for the re-ID branch. This attention module is inspired by CSTrack [17], it uses a Spatial - Channel Attention Module to enhance the representation of the object in the re-ID head. Mean- while, spatial attention module appearance learning with suppress background noise and channel attention module focuses more on ID embedding improve. The Re-ID branch will receive the feature maps from the previous output and proceed to extract the identity embeddings for the objects as illustrated in Fig3a. During training, the re-Id branch learns identity embeddings to classify as objects of the ground truth. We denotes image size isHimage×Wimage, and output of Feature Extractor asE∈RC×H×W, whereH=⌊Himage/4⌋

andW=⌊Wimage/4⌋. To achieve final output, we pass the

feature map E through Channel Attention Module (CAM) illustrated in Fig. 3b to obtain feature map with attention

Ec∈R256×1×1.

Ec=E⊗Mc⊕E (3)

whereMc∈256×1×1 is channel attention feature map,⊗ denotes element-wise multiplication and⊕denoted element- wise addition.

Channel Attention Module (CAM).We use 3×3 convolution layer f3×3 to convert feature map from feature extractor into 256 channels E′∈R256×H×W and a channel attention (CAM) module built followed. The CAM is built

3 MOT17 69.8 436 144 4 x 71.1 411 149 57

Table II: Effect of SAM in our framework

NUM SAM CAM MO T A IDF1 IDs Feature Extraction Detection Head ReID Head 1 x 82.8 82.5 602 2 x x 83.0 81.2 542 3 x x 82.6 80.5 573 4 x x x 82.1 80.6 675

by passing E′ through average-pooling and max-pooling operations into twoEc

avgandEc

max∈R256×1×1. Each pooling output will be processed by a shared multi-perception layer (MLP) network, which will receive the pool passed through a 1x1 convolution and a fully connected layer. Output will finally combine by element-wise summation to get

Mc∈R256x1x1. The above A 1D channel attention map is calculated as follows: E′= f5×5(E) Mc =σ(MLP(AvgPool(E′))⊕MLP(MaxPool(E′))) =σ(MLP(Ec avg))⊕MLP(Ec max)) =σ(f c(γ(Eavgc ))⊕f c(γ(Emaxc )))) (4)

whereσdenotes the sigmoid activateion function,γdenotes the relu activate function and fc is a share-weight fully connected layer.

Finally, we apply a convolution layer with 128 kernels to extract the identity embedding featureEreID∈R128×H×W for each location of the heatmap.The re-ID loss definition and training method are the same as the definition of JDT [5] and FairMOT [4]

IV. EXPERIMENT

In this section, we will introduce datasets and implemen- tation details in IV-A, The ablation study is presented shortly in IV-B. Finally, we will evaluate the output from our model with the-state-of-the-art in IV-C

A. Dataset and Implement Settings

MOT: In our experiment, we only use provided

datasets from MOT challenge [22]. Particularly, they are MOT16/MOT17 and MOT20 [23] which have been anno- tated with high accuracy, strictly following a well-defined protocol. The MOT datasets choose pedestrians as a primary

sequences for testing while only four training sequence and four test sequences in MOT20. We evaluate the experiment with a train dataset of MOT16 with ground-truth or MOT17 and MOT20 test dataset with MOT challenge server. Follow- ing to MOT challenge benchmarks, we use metrics as multi- object tracking accuracy (MOTA), multiobject tracking pre- cision (MOTP), identity F1 score (IDF1), identity switch (IDs), Most Tracking (MT), Most Lost (ML) to evaluate our methods.

Implement Settings: Our method is implemented by

Pytorch, which is a common open-source machine learning framework recently. We use DLA-34 as the backbone for feature extraction and we design our framework as JDT [5] structure as FairMOT [4]. We train our model with a train MOT17/MOT20 dataset with 30 epoch and batch size is 8. We use Adam Optimizer with an initial learning rate 10−4 and decay to 10−5 at epoch 20. The input image is enriched by the augmentation data technique with scaling, rotation, color jittering methods. In the training step, we use Google Colab Pro, which is a popular hosted Jupyter Notebook service, with NVIDIA P100 or Tesla T4 GPUs and 25GB RAM. This step takes 10 hours with MOT17 dataset and 20 hours with MOT20 dataset.

B. Ablation Studies

In this session, we study the effect of each attention modules CAM and SAM with components of our tracking model.

Effect of CAM: Firstly, we experiment with CAM mod-

ule. In Table I, we show a comparison between our proposal with CAM in Re-ID head and FairMOT [4]. We use full MOT17 training data set for the training step and MOT16 training data for the validation step (NUM 1 vs NUM 2). In addition, we separate MOT17 train data into a first- half for training and another half for validation (NUM3 vs NUM4), which is sure that nothing is overlapping and fair in comparison. The new change derives to a better improvement, NUM4 archives 1.3 increase point compare

Ví dụ minh họa deformable convolutional neural network

Kết quả theo dấu mơ hình ACSMOT