1. Trang chủ
  2. » Luận Văn - Báo Cáo

Toward data efficient multiple object tracking

95 10 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Toward Data-Efficient Multiple Object Tracking
Tác giả Phan Xuan Thanh Lam, Tran Ho Minh Thong
Người hướng dẫn Dr. Nguyen Duc Dung
Trường học Ho Chi Minh National University
Chuyên ngành Computer Science
Thể loại Bachelor Degree Thesis
Năm xuất bản 2021
Thành phố Ho Chi Minh City
Định dạng
Số trang 95
Dung lượng 1,86 MB

Nội dung

HO CHI MINH NATIONAL UNIVERSITY HCM CITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING ——————– * ——————— BACHELOR DEGREE THESIS TOWARD DATA-EFFICIENT MULTIPLE OBJECT TRACKING Council : Computer Science Thesis Advisor : Dr Nguyen Duc Dung Reviewer : Dr Nguyen Hua Phung —o0o— Students: Phan Xuan Thanh Lam 1710163 Tran Ho Minh Thong 1710314 HO CHI MINH CITY, 07/2021 Declaration We declare that the thesis entitled “ TOWARD DATA-EFFICIENT MULTIPLE OBJECT TRACKING” is our own work under the supervision of Dr Nguyen Duc Dung We declare that the information reported here is the result of our own work, except where references are made The thesis has not been accepted for any degree and is not concurrently submitted to any candidature for any other degree or diploma Abstract Multiple object tracking (MOT) is the task of estimating the trajectories of several objects as they move around a scene MOT is an open and attractive research field with a broad extent of categories and applications such as surveillance, sports analysis, human-computer interface, biology, etc The difficulties of this problem lie in several challenges, such as frequent occlusions, intraclass and inter-class variations, etc Recently, deep learning MOT methods confront these challenges effectively and lead to groundbreaking results Therefore, these methods are used in almost all state-of-the-art MOT algorithms Despite their successes, deep learning MOT algorithms, like other deep learning-based algorithms, are data-hungry and require a large amount of labeled data to work On another hand, annotating MOT data usually consists of manually labeling positions of objects on every video frame (with bounding boxes or segmentation) and assigning each object to a single identity (ID), such that different objects have different IDs, and the same object in different frames has the same ID This makes annotating MOT data a very time-consuming task To solve the data problem in deep learning MOT algorithms, in this thesis, we will propose a method, where we only need the annotations of object positions Experiments show that our method is compatible with the current state-of-the-art method, despite the lack of object ID labeling On the other hand, we found that current annotation tools, such as the Computer Vision Annotation Tool [77], the SupperAnnotate [78], etc are not well integrated with MOT models, and also lack necessary features for MOT problems Therefore, in this thesis, we will also develop a new annotation tool It will support automatically labeling via our proposed MOT model Moreover, our tool will also provide plenty of convenient features, which will increase the automation for labeling processes, control the accuracy and rationality of results, and increase users’ experiences To sum up, our main contributions in this thesis are twofold: • Our first major contribution is an MOT algorithm compatible with state-of-the-art algorithms without the need for object ID labeling This can make the cost of manually labeling data significantly reduced • As a second contribution, we also build an annotation tool Our tool supports automatic annotation and a lot of features that help fasten the labeling process of MOT data Acknowledgments We would like to thank Mr Nguyen Duc Dung for guiding us to important publications and for the stimulating questions on artificial intelligence The meetings and conversations were vital in inspiring us to think outside the box, from multiple perspectives to form a comprehensive and objective critique Contents 1 5 Contrastive Learning & Object Detection 2.1 Contrastive Learning 2.1.1 Self-Supervised Representation Learning 2.1.2 Contrastive Representation Learning 2.1.2.1 Framework of constrastive learning 2.1.2.2 SimCLR 2.1.2.3 MoCo 2.1.2.4 SwAV 2.1.2.5 Barlow Twins 2.2 Object Detection 2.2.1 Two-stage Detectors 2.2.2 One-stage Detectors 6 8 10 11 12 13 14 14 17 23 23 25 27 28 30 30 31 31 Proposed Algorithm 4.1 Propose architecture and algorithm 4.2 Propose Augmentation 33 33 35 Experiment 5.1 Overview 5.1.1 Experimental environment 5.1.2 Datasets and Metrics 5.1.3 Ablative Studies 39 39 39 39 40 Introduction 1.1 The Multiple Object Tracking Problem 1.2 Introduction to MOT algorithms 1.3 Labelling tool for Multiple Object Tracking 1.4 Objective 1.5 Thesis outline Related Work 3.1 DeepSort 3.2 JDE 3.3 FairMot 3.4 CSTrack 3.5 MOT Metric 3.5.1 Classical metrics 3.5.2 CLEAR MOT metrics 3.5.2.1 ID scores ii 5.1.3.1 5.1.3.2 5.1.3.3 5.1.3.4 Robustness of propose augmentation Result of different constrastive objective Result of different batch size on simclr Results on MOTChallenge Analysis of the annotation tool 6.1 System objectives 6.2 An analysis of system objectives 6.3 System requirements 6.4 Use case diagram Design of the annotation tool 7.1 Overall architecture 7.2 Deployment 7.3 Entity relationship 7.3.1 Entity relationship diagram 7.3.2 Entities explanation 7.3.3 Deletion mechanism Used technologies of the annotation tool 8.1 Front-end module 8.1.1 Angular 8.1.2 Ag-grid 8.2 Back-end module 8.2.1 Django 8.2.2 REST framework 8.2.3 Simple JWT 8.2.4 OpenCV 8.2.5 Request 8.2.6 Django cleanup 8.3 Database module 8.3.1 PostgreSql 8.4 File-storage module 8.4.1 S3 service 8.5 Artificial-intelligence module 8.5.1 Google Colaboratory 8.6 Docker Implementation of the annotation tool 9.1 Front-end module 9.1.1 Boxes rendering problem 9.1.2 Implementation of dynamic annotations 9.1.3 Implementation of the custom event managers 9.1.4 Implementation of the drawing new annotation feature 9.1.5 Implementation of the interpolating feature 9.1.6 Implementation of the filtering feature 9.2 Back-end module 9.2.1 Implementation of the multiple objects tracking feature 9.2.2 Implementation of the single object tracking feature 40 41 42 43 45 45 45 47 49 50 50 51 56 56 56 58 59 59 59 59 60 60 60 60 60 61 61 61 61 61 61 61 61 62 63 63 63 66 68 69 70 71 71 71 71 9.2.3 Implementation of commission related features 10 System evaluation 10.1 Evaluation 10.1.1 Strengths 10.1.2 Weaknesses 10.2 Comparison to CVAT 10.3 Development strategy 10.4 Contribution 73 74 74 74 75 75 77 78 11 Conclusion 11.1 Achievement 11.2 Future Work 79 79 79 List of Tables 5.1 5.2 5.3 5.4 5.5 5.6 5.7 Training Data Statistic Embedding performance of different augmentation IDF1 and MOTA when and when not using our augmentation, best performance shown in bold Tracking accuracy of different constrastive method as well as supervised method True positive rate at different false accepted rate of different embedding supervised method Performance of simclr at different batchsize Comparison of our method to the state-of-the-art method 10.1 Comparison with CVAT v 40 40 41 41 42 43 43 77 List of Figures 1.1 1.2 1.3 An illustration of the output of a MOT algorithm [1] Some application of MOT Workflow of MOT 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 Self-Supervised approach of bert[39] Self-supervised learning by rotating the entire input images.[45] Positive and Negative sampling pipeline of constrastive learning Accuracy at different random resized crop Performance of [51] on different batch size A framework of simclr [51] Simclr [51] algorithm Moco [53] algorithm A framework of Barlow Twins [54] The architecture of R-CNN [61] The architecture of Fast R-CNN [62] An illustration of Faster R-CNN model [56] Workflow of YOLO [57] Model architecture of SSD [58] FPN fusion strategy [36] Comparison between centernet and anchor-base method 10 10 11 11 12 14 15 16 17 18 19 20 21 3.1 3.2 3.3 3.4 3.5 3.6 3.7 Overview of the CNN architecture used to extract embedding feature of DeepSort [6] DeepSort[6] matching algorithm Architecture of the model used in JDE[7] Feature sampling strategy of fairmot Fairmot architecture Diagram of cross-correlation network[9] The details of scale-aware attention network[9] 24 25 26 28 28 29 30 4.1 4.2 4.3 4.4 4.5 Propose joint detection and constrastive learning Copy paste augmentation Some mask image result from running FasterRCNN on CrowdHuman Mosaic augmentation Some augmentation from CrowdHuman and MOT17 dataset 34 36 37 37 38 5.1 Visualization of the discriminative ability of the embedding features of different constrastive method Example tracking results of our method on the test set of MOT17 42 44 5.2 vi IMPLEMENTATION OF THE ANNOTATION TOOL Moreover, all newly drawn annotations must also be static as it helps prevent the user’s actions to affect them When the user wants to stop drawing, all current static annotations will be replaced with dynamic ones which provide resizability and portability Up to now, the drawing process will look like this: Replace all dynamic annotations with static annotations Some drawing steps here The user escapes drawing mode Replace all static annotations with dynamic annotations To comply with the second consumption, we can stop the video on entering the drawing mode and escape the drawing mode on playing the video This will prevent any chance for the video to play when the application is in drawing mode However, jumping to another frame while the video is paused is just acceptable, for example, we are drawing something in frame 1000 (the video is paused) and we then jump to frame 2000 This action does not break the law as it only changes the current frame without playing the video In this case, the system will remove all existing annotations and replace them with static annotations belong to the destination frame From now on, we will actually explain how a new annotation is drawn This process is driven by a series of events: When the mouse down event is triggered inside the video screen: A new static box with the size of × is added to the screen at the position where the event is triggered and a new annotation data object is created This static box will be displayed with the default configuration When the mouse move event is triggered inside the video screen: Depending on the position of the pointer after moving (new point) relative to the mouse down position (old point), there are four smaller cases: • If the new point is in the left and above the old point, it will become the top-left corner of the static annotation while the old point will become the bottom-right corner • If the new point is in the right and above the old point, it will become the top-right corner of the static annotation while the old point will become the bottom-left corner • If the new point is in the left and below the old point, it will become the bottom-left corner of the static annotation while the old point will become the top-right corner • If the new point is in the right and below the old point, it will become the bottom-right corner of the static annotation while the old point will become the top-left corner Then, the static annotation will be adjusted depending on these corners When the mouse up event is triggered inside the video screen or the mouse leave event is triggered: The drawing process of this newly created annotation is finished If the static annotations have a positive area (S > 0), a popup will be shown to allow the user to choose an object for assigning to it Then, this static annotation will be added to the list of existing static annotations, which will be used to remove all of them when the user jumps to another frame or escapes drawing mode Moreover, its data also be pushed into the list of annotation data and the list of newly created annotations, which will later tell the server to create new corresponding records in the database 9.1.5 Implementation of the interpolating feature Interpolating is another way for new annotations to be created In our application, it is a pure front-end feature, which means the process does not need extra help from the back-end module 70 IMPLEMENTATION OF THE ANNOTATION TOOL This process receives a list of frames, which are the frames that contain the target object We call them keyframes For each range of frames between a pair of two nearest keyframes, the system generates a new annotation Its x positions, its y positions, its widths, and its heights are calculated through linear interpolations, using corresponding attributes of the two annotations from the two keyframes 9.1.6 Implementation of the filtering feature In this system, filtering only changes the current representation of data, not the data themselves We implemented this feature in an intersection manner In detail, there would be many criteria to filter an annotation and an annotation must satisfy all of them to be rendered If it can not satisfy all, it will be hidden Here, the result of the filtering process is just temporary It will be changed when the filtering criteria are changed Moreover, because the filtering process does not affect the data, its result will be lost on refreshing the view or switching the annotation set 9.2 Back-end module 9.2.1 Implementation of the multiple objects tracking feature In this application, multiple objects tracking means automatically tracking all objects without the user’s interference The user’s responsibilities are just selecting which video to perform this operation and giving a name to the result annotation set Then, the rest will be addressed by the system This means also means that the multiple objects tracking operation can not be applied on any existing annotation set as it only creates a new set We design the system in this way to simplify the data processing in the back-end module However, in case the user wants to track some objects from an existing set, our single object tracking feature is a great solution This multiple objects tracking feature is implemented by requesting to the application programming interface provided by the artificial-intelligence module and then processing the responded data The data processing stage can be split into the following steps: Create a new set to store new objects and new annotations Loop through responded data, create new annotation data objects and new object data objects Use Django bulk create method to save all annotation data objects simultaneously to the database and also save all object data objects simultaneously to the database Extract and transform data into a format that is usable by the front-end module and then send them there In step 3, we use bulk create method instead of the create method, which saves each data object to the database separately, because it reduces the processing time rapidly by cutting down the number or database accessing time It is crucial for improving the users’ experience 9.2.2 Implementation of the single object tracking feature In the reality, users are sometimes only interested in some particular objects in the video Therefore, it is unnecessary to label everything there Moreover, they may upload an existing 71 IMPLEMENTATION OF THE ANNOTATION TOOL annotation set and expect the system to help them track the omitted objects These are some use cases of the single object tracking feature In this project, we not implement a separate single object tracking model in the artificial intelligence module but reuse the same multiple objects tracking model (MOT model) to perform this task Our strategy is to map the tracked object in a frame from the current set with the most similar object also in that frame but from a set generated by the MOT model In detail, for example, if we are turning the annotation set A and want to track the object with identity (ID) 2000 from frame 100 to frame 120, then the process can contain the following steps: Generate a hidden annotation set B from the MOT model: This set is only used for single object tracking purposes and it is inaccessible by the end-user Load set A from the database Loop from frame 100 to frame 120: (a) Update key object in set B: If frame k in set A contains an annotation assigned to the object with ID 2000, the system must find the annotation in frame k in set B that is most similar to it The object assigned with that annotation in set B is called the key object Here, the similarity between two annotations is measured by the intersection over union fraction (IOU fraction ) (b) Clone the key object into set A: If frame k in set A does not contain any annotation assigned to the object with ID 2000 and the frame k contains an annotation assigned to the key object, the system will clone that annotation and add the copy to frame k of set A Return the list of newly cloned annotations to the front-end module Here, we can face three major problems arising: • Time consuming: Each time the user requests for a single object tracking operation, a new hidden annotation set is created This behavior obviously wastes time as the model has to process the video from the first time to the last frame and the time for generating the hidden set is the same for each request on the same video, despite the range of frames or the number of tracked objects • Unawareness of the current state of the tracked annotation set: Back to the previous example, if set A has unsaved changes such as creating some annotations, deleting some existing annotations, etc, the tracking result may be incorrect as it depends on the outdated data of set A stored in the database • Incorrect key object: The current strategy for choosing the key object always leads to an object to be chosen, even though it can be quite different For example, in frame 115 of the hidden set B, there are annotations with IOU fractions respectively are 0.1, 0.1, 0.2, 0.03 It is clear that all four annotations have little or no connection to the annotations assigned to the tracked object in set A However, because of the choosing greatest fraction principle, the object assigned with the third annotation will be chosen In order to fix the time-consuming problem, caching is a good practice We cache the whole hidden set into the database Whenever the back-end module receives a single object tracking request, it first checks if a hidden set belonging to the tracked video has already existed in the database If it has already been there, it will be loaded In case this was the first time this kind of request was made to this video, a new hidden annotation set will be generated from the MOT model Moreover, to make the set inaccessible by the user and distinguished from other normal sets of the same video, a boolean flag will be added It will be true if the set is the hidden set and false otherwise 72 IMPLEMENTATION OF THE ANNOTATION TOOL In terms of data integrity, the front-end module will add the list of created annotations, the list of updated annotations, the list of deleted annotations, and the list of updated objects into the request Then, the back-end will use these lists to update the data loaded from the database Finally, the incorrect key object can be solved by adding a threshold All annotations in the hidden set with an IOU fraction below this threshold will be discarded Therefore, the system can return a more radical result Up to now, everything works fine in case each request is only for one object However, we expanded the algorithm to allow each request can contain multiple objects by allowing multiple key objects in each frame Here, if we are turning the annotation set A and want to track the objects with respectively IDs: 2000, 2010, 3000, from frame 100 to frame 120, the process will become: If the hidden set B is already in the database, load it Otherwise, generate it from the MOT model Load set A from the database Update set A from the list of created annotations, the list of updated annotations, the list of deleted annotations, and the list of updated objects (those lists are extracted from the requests) Loop from frame 100 to frame 120: Loop for each tracked object (a) Update key object of object h in set B: If frame k in set A contains an annotation assigned to the object with ID h, the system must find an annotation in frame k in set B that is most similar to it and has the IOU fraction greater or equal to the threshold The object assigned with that annotation in set B is called the key object of object h (b) Clone the key object of object h into set A: If frame k in set A does not contain any annotation assigned to the object with ID h and the frame k contains an annotation assigned to the key object of object h, the system will clone that annotation and add to frame k of set A Return the list of newly cloned annotations to the front-end module To sum up, the single object tracking feature allows users to track a collection of objects defined by themselves and limited to a range of frames, while the multiple object tracking feature tracks all objects from the first frame to the last frame However, they are both empowered by the MOT model directly or indirectly 9.2.3 Implementation of commission related features The system must check the privilege of a user over a particular annotation set There will be some possible cases that can happen: • The set is not commissioned: In this case, only its owner is accessible • The set is commissioned but not accepted by the collaborator: In this case, neither the owner nor the collaborator can access it However, the collaborator can decide to accept or reject the commission • The created commission has already been accepted: In this case, only the collaborator is accessible to it • The created commission has just been finished or recalled: In this case, it will be deleted and the control of the corresponding set will be returned to the owner 73 Chapter 10 System evaluation In this section, we will evaluate this annotation tool by pointing out its strengths and weaknesses, comparing it to a popular existing annotation tool After that, we will make a plan for future developments Finally, all of our contributions will be listed 10.1 Evaluation 10.1.1 Strengths • This system is well integrated with our multiple objects tracking model Currently, systems that can work with this kind of tracking model are quite rare • This system utilizes the power of our developed model In detail, the system allows generating annotations for a video automatically, using the model Moreover, it also exploits the ability of our model and combines with some database techniques to implement the regular single object tracking feature in a new approach, instead of requiring another single object tracking model • The generated result from our model is cached in the database Therefore, the time that tracking operations take place reduces • Grid-like views are provided These are views that contain tables that displaying data, such as: projects, videos, annotation sets, annotations, objects, etc They help represent entities in a clean and well-structured way Moreover, they allow filtering which helps users to get their concerning data fast and accurately • Multiple label sets are allowed for each video Therefore, the end-user does not have to upload a video again for creating a new result set Moreover, the switching operation between sets is quite fast and straightforward • Constraints are added to provide data integrity and rationality In detail, the system allows an object to appear only once in each frame Therefore, whenever the user is about to choose an object to assign to an annotation in a frame, the system shows only objects that have not existed in that frame yet Moreover, if an object is deleted, all annotations associated with it will also be removed • Tracing backward and forward on an object is allowed From an annotation associated with a particular object, the end-user can trace back to the annotation in the previous frame which is associated with also that object, and so does tracing forward This feature helps end-users to assess the accuracy of their labeling results 74 SYSTEM EVALUATION • The system reserves the smoothness of the video while keeping all annotations accurate and consistent It means that the playing process of the video does not make annotations inconsistent with the frames, and the rendering process guarantees the video speed • The system provides the linking score for each annotation generated from our model This score is a hint for end-users to check the accuracy of the annotation set • Annotations can be edited directly on the screen • End-users can filter their annotations efficiently • The system supports well with regular features for a typical annotation tool, such as: fitting screen, interpolating, filtering, drawing, etc • The system supports cooperation by commissioning annotation sets to other trusted users It will help the owner of the dataset split the tasks as well as horizontally scale their work 10.1.2 Weaknesses • The time for switching amongst modes such as drawing mode, interpolating mode, normal mode, etc, is still slow • The system currently allows end-users to import their videos only via uploading In contrast, other available tools may also accept videos from a URL, from online storage services such as Google Drive, GitHub, Dropbox, etc • There are only two formats for exporting results They are the text format (.txt files) and the comma-separated format (.csv files) It would be better if the system also allows exporting in some video-like formats • Undo and redo operations are not yet supported • Changes in status of commissions are not real-time updated If the owner recalls the commission, the collaborator can only realize that on the refreshing page 10.2 Comparison to CVAT In this section, we will compare our annotation tool to the CVAT, a well-known tool labeling video Firstly, we will have a short introduction to CVAT CVAT is an abbreviation of “Computer Vision Annotation Tool” CVAT is a free, online, interactive video and image annotation tool for computer vision It is being developed and used by Intel to annotate millions of objects with different properties Many user interfaces and user experience decisions are based on feedback from the professional data annotation team Its user interface is represented in figure 10.1 75 SYSTEM EVALUATION Figure 10.1: The CVAT tool In the table 10.1, we compare our annotation tool with CVAT in some criteria 76 SYSTEM EVALUATION Criteria Data type Purpose Annotate type This tool Only video Specialized for pedestrian data Only bounding box CVAT Video and image Multiple purposes Importing video Only from device Data hierarchy Multiple label sets for a video Integrated with MOT model Cooperation mechanism Project → Video → Set Yes Bounding box, polygon, polyline, etc From the device, from URL, from cloud storage, etc Project → Task → Job No Yes Create multiple annotation sets for a video and commission some of them for other end-users Yes No Split the video into packs of consecutive frames called “Job” and assign some of them to other end-users No Allow annotation without associated object Allow change associated ob- Yes No ject of an annotation Merge objects No Yes Views with a table for modi- Yes No fying data Trace backward or forward Yes for all objects Not support for manually an object drawn objects Undo and redo No Yes Zoom in or out the video Yes Yes Allow drawing new box Yes Yes Allow interpolating Yes Yes Allow descriptions for ob- Yes No jects and boxes Allow noting uncertain Yes No boxes and filtering them later Table 10.1: Comparison with CVAT 10.3 Development strategy Having pointed out all weaknesses and compared with some available annotation tools (CVAT), we realized that our system needs some improvements, such as: • Speed up the systems: Currently, the time for switching amongst modes of the system is still slow Therefore, it should be reduced to make the system smoother • Enhance the graphical interface: The user interface is not attractive enough, especially the annotated view Here, the representation of annotation on-screen needs innovations We will study the graphical user interfaces of some available tools to improve ours to enhance 77 SYSTEM EVALUATION • • • • users’ experience Support multiple ways for importing videos: The system does not support importing videos via URL or from cloud storage systems such as Google Drive, Dropbox, etc For that reason, we will implement those new ways of importing videos to provide users flexibility Support more formats of exported result files In detail, we will allow the system to export data as a video Accordingly, a cloned version of the original video is created and all annotations will be drawn permanently on it This cloned version will be downloaded to end-users’ devices Moreover, we also plan to allow end-users to trim their video Consequently, they can export just a part of the video that they are concerned about Support undoing and redoing operations to boost users’ experience Support real-time notification and update for commission-related features 10.4 Contribution To sum up, we made the following contributions when implementing this tool: • Utilise the multiple objects tracking model • Increase automation for labeling and tracking tasks • Create a new tool specialized for pedestrian data • Enhance users’ experience • Provide accurately labeled data for further researches and commercial uses 78 Chapter 11 Conclusion 11.1 Achievement In this thesis, we have proposed an MOT algorithm that unsurprisingly learns the object embedding base on a contrastive learning framework We also implement an annotation tool that supports auto annotation and a lot of novel features that support annotating the MOT dataset To summary, our main results are: • Research and survey the current algorithm and approach in Multiple Object Tracking • Propose and implement the novel architecture, that leverages the contrastive learning framework, compose with state-of-the-art FairMot to achieve a model that can joint learning detection ad embedding without the need of annotating the object identity • Propose two novel augmentation that helps improve the performance of contrastive learning when applying to MOT problem • The proposed method achieves the IDF1 of 71.9 %, which is just 0.4 % less than its baseline that uses all the annotation • Use the proposed model for the auto annotation feature of the annotation tool • Implement an annotation tool that utilizes the power of our model and provides a userfriendly interface for end-users 11.2 Future Work Because of the time limitation and resource constrain, we only train our MOT model on pedestrian tracking dataset, cause it is the standard dataset for evaluating different MOT algorithms Since our method only needs the detection labeling, we can train it on different detection datasets, like vehicle detection for traffic management Integrated such diverse tracking categories could make our annotation tool stronger, and support the end-user better, not just only user interest in pedestrian tracking We leave it as future work We test our proposed method on three different contrastive learning objectives Recently, the contrastive learning has great improvement, and various new contrastive learning model has been introduced We leave the testing on such a new contrastive learning model as future work Although achieve such good results without object id labeling, we still use the detection on all frame to train our model On video with a high frame rate, two adjacent frames are very similar We can leverage the optical flow and various techniques to automatically obtained the detection at adjacent frames, namely given annotation at frame 1, auto-generating the annotation for frame to frame 10 We also leave the auto annotation for adjacent frames as future work 79 CONCLUSION In terms of the annotation tool, we will surmount its current weaknesses as well as develop new features Moreover, we will diversify our tool by making it support more fields of data, instead of just for pedestrian data Currently, half of the five modules of our tool are encapsulated in a Docker container and run locally In future work, we will deploy them on a real server for the end-users’ convenience 80 References [1] Gioele Ciaparrone1, Francisco Luque S´anchez, Siham Tabik, Luigi Troiano, Roberto Tagliaferri, Francisco Herrera DEEP LEARNING IN VIDEO MULTI-OBJECT TRACKING: A SURVEY [2] Markus Schlattmann, Reinhard Klein Efficient Bimanual Symmetric 3D Manipulation for Markerless HandTracking [3] X Fu, K Zhang, C Wang, and C Fan Multiple player tracking in basketball court videos [4] Y Mao, L Han, and Z Yin Cell mitosis event analysis in phase contrast microscopy images using deep learning [5] A Krizhevsky, I Sutskever, and G E Hinton, ImageNet Classification with Deep Convolutional Neural Networks, [6] Nicolai Wojke, Alex Bewley, Dietrich Paulus Simple Online and Realtime Tracking with a Deep Association Metric [7] Zhongdao Wang, Liang Zheng, Yixuan Liu, Shengjin Wang Towards Real-Time MultiObject Tracking [8] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, Wenyu Liu FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking [9] Chao Liang, Zhipeng Zhang, Yi Lu, Xue Zhou, Bing Li, Xiyong Ye, and Jianxiao Zou, RETHINKING THE COMPETITION BETWEEN DETECTION AND REID IN MULTIOBJECT TRACKING [10] Jiahe Li, Xu Gao, Tingting Jiang Graph Networks for Multiple Object Tracking [11] Xingyi Zhou1, Vladlen Koltun and Philipp Krahenb Tracking Objects as Points [12] Bo Pang, Yizhuo Li, Yifan Zhang, Muchen Li , Cewu Lu TubeTK: Adopting Tubes to Track Multi-Object in a One-Step Training Model [13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [14] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai Deformable DETR: Deformable Transformers for End-to-End Object Detection [15] Yihong Xu, Yutong Ban, Guillaume Delorme, Chuang Gan, Daniela Rus, Xavier Alameda-Pineda TransCenter: Transformers with Dense Queries for Multiple-Object Tracking [16] Peize Sun, Jinkun Cao, Yi Jiang, Rufeng Zhang, Enze Xie, Zehuan Yuan, Changhu Wang, Ping Luo TransTrack: Multiple Object Tracking with Transformer [17] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, Christoph Feichtenhofer TrackFormer: Multi-Object Tracking with Transformers 81 REFERENCES [18] Jinlong Peng, Changan Wang, Fangbin Wan, Yang Wu, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Yanwei Fu Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking [19] Zhenbo Xu, Wei Zhang, Xiao Tan, Wei Yang, Huan Huang, Shilei Wen, Errui Ding, Liusheng Huang Segment as Points for Efficient Online Multi-Object Tracking and Segmentation [20] Zhi Tian, Chunhua Shen, Hao Chen, Tong He FCOS: Fully Convolutional One-Stage Object Detection [21] Mingxing Tan, Ruoming Pang, Quoc V Le EfficientDet: Scalable and Efficient Object Detection [22] Irtiza Hasan, Shengcai Liao, Jinpeng Li, Saad Ullah Akram, Ling Shao Generalizable Pedestrian Detection: The Elephant In The Room [23] Xingyi Zhou, Dequan Wang, Philipp Krăahenbăuhl Objects as Points [24] Siyuan Qiao, Liang-Chieh Chen, Alan Yuille DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution [25] https://brilliant.org/wiki/hungarian-matching/ Hungarian Maximum Matching Algorithm [26] Guillem Braso, Laura Leal-Taixe Learning a Neural Solver for Multiple Object Tracking [27] Chaobing Shan, Chunbo Wei, Bing Deng, Jianqiang Huang, Xian-Sheng Hua, Xiaoliang Cheng, Kewei Liang Tracklets Predicting Based Adaptive Graph Tracking [28] F Yu, W Li, Q Li, Y Liu, X Shi, J Yan Poi: Multiple object tracking with highperformance detection and appearance feature [29] P Voigtlaender, M Krause, A Osep, J Luiten, B B G Sekar, A Geiger, and B Leib Mots: Multi-object tracking and segmentation, [30] L Zheng, Z Bie, Y Sun, J Wang, C Su, S Wang, Q Tian MARS: A video benchmark for large-scale person re-identification [31] Chien-Yao Wang, Alexey Bochkovskiy, Hong-Yuan Mark Liao Scaled-YOLOv4: Scaling Cross Stage Partial Network [32] Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao YOLOv4: Optimal Speed and Accuracy of Object Detection [33] Keni Bernardin , Rainer Stiefelhagen The clear mot metrics [34] Hermans, Beyer, and Leibe In defense of the triplet loss for person re-identification [35] Sohn, K Improved deep metric learning with multi-class n-pair loss objective [36] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie Feature Pyramid Networks for Object Detection [37] Kipf, Thomas N , Welling, Max Semi-Supervised Classification with Graph Convolutional Networks [38] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [39] Licheng Jiao, Fan Zhang, Fang Liu, Shuyuan Yang, Lingling Li, Zhixi Feng, Rong Qu A Survey of Deep Learning-based Object Detection [40] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov RoBERTa: A Robustly Optimized BERT Pretraining Approach 82 REFERENCES [41] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer BART: Denoising Sequence-to-Sequence Pretraining for Natural Language Generation, Translation, and Comprehension [42] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever Language Models are Unsupervised Multitask Learners [43] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei Language Models are FewShot Learners [44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, Illia Polosukhin Attention Is All You Need [45] Spyros Gidaris, Praveer Singh, Nikos Komodakis Unsupervised Representation Learning by Predicting Image Rotations [46] Richard Zhang, Phillip Isola, Alexei A Efros Colorful Image Colorization [47] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, Alexei A Efros Context Encoders: Feature Learning by Inpainting [48] Mehdi Noroozi, Paolo Favaro Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles [49] Aaron van den Oord, Yazhe Li, Oriol Vinyals Representation Learning with Contrastive Predictive Coding [50] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola What Makes for Good Views for Contrastive Learning? [51] Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton A Simple Framework for Contrastive Learning of Visual Representations [52] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick Momentum Contrast for Unsupervised Visual Representation Learning [53] Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He Improved Baselines with Momentum Contrastive Learning [54] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, St´ephane Deny Barlow Twins: SelfSupervised Learning via Redundancy Reduction [55] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand Joulin Unsupervised Learning of Visual Features by Contrasting Cluster Assignments [56] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks [57] Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi You Only Look Once: Unified, Real-Time Object Detection [58] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, ChengYang Fu, Alexander C Berg SSD: Single Shot MultiBox Detector [59] Philipp Bergmann, Tim Meinhardt, Laura Leal-Taixe Tracking without bells and whistles [60] Zhengxia Zou , Zhenwei Shi , Yuhong Guo , Jieping Ye Object Detection in 20 Years: A Survey 83 REFERENCES [61] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich feature hierarchies for accurate object detection and semantic segmentation [62] Ross Girshick Fast R-CNN [63] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Doll´ar Focal Loss for Dense Object Detection [64] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D Cubuk, Quoc V Le, Barret Zoph Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation [65] F Yu, D Wang, E Shelhamer, and T Darrell Deep layer aggregation [66] T.-Y Lin, P Goyal, R Girshick, K He, and P Dollar Focal loss for dense object detection [67] MOT17 dataset https://motchallenge.net/data/MOT17/ [68] 2DMOT15 dataset htt ps : //motchallenge.net/data/2D MOT 2015/ [69] MOT20 dataset htt ps : //motchallenge.net/data/MOT 20/ [70] MOT16 dataset htt ps : //motchallenge.net/data/MOT 16/ [71] Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Xiangyu Zhang, Jian Sun CrowdHuman: A Benchmark for Detecting Human in a Crowd [72] A Ess, B Leibe, K Schindler, and L Van Gool A mobile visionsystem for robust multiperson tracking [73] S Zhang, R Benenson, and B Schiel Citypersons: A diverse dataset for pedestrian detection [74] P Dollar, C Wojek, B Schiele, and P Perona “Pedestrian detection: A ´benchmark [75] T Xiao, S Li, B Wang, L Lin, and X Wang Joint detection and identification feature learning for person search [76] L Zheng, H Zhang, S Sun, M Chandraker, Y Yang, and Q Tian “Person reidentification in the wild [77] CVAT annotaion tool htt ps : //github.com/openvinotoolkit/cvat [78] SuperAnnotate tool htt ps : //superannotate.com/ 84 ... tracking can be categorized into single object tracking (SOT) and multiple object tracking (MOT), based on the number of targets being tracked In single object tracking, a target of interest is usually... CLEAR MOT metrics compose of MOTA (Multiple Object Tracking Accuracy) and MOTP (Multiple Object Tracking Precision) To calculate them, we need to match the real objects (ground truth) with the tracker... 76 Chapter Introduction 1.1 The Multiple Object Tracking Problem Object tracking is the task of estimating the trajectories of one or several objects as they move around a scene, usually

Ngày đăng: 03/06/2022, 16:10

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN