Design of cross camera tracking system based on deep learning

OVERVIEW

Goal

This project aims to develop a cross-camera tracking system capable of identifying unique individuals captured by the cameras and tracking their movements in recordings Additionally, it features a function that allows users to locate individuals based on their color appearance within the frame To enhance user experience, all functionalities are presented visually through a user-friendly graphic interface.

Limitations

The person identification system operates through a non-online approach, meaning it cannot function in real-time It analyzes each frame of input video sequentially to generate tracking results Furthermore, the proposed system has only been evaluated using synthetic datasets and has not been tested on real-world scenarios.

3 people In addition, only predefined colors can be used for people searches that are color-based.

Research method

This study employs a quantitative research method centered on experimental results to evaluate and compare various components of a tracking system Initially, several experiments utilized different versions of the YOLO object detection framework to identify the best-performing version for detecting and localizing individuals, assessed through the mean Average Precision (mAP) metric Following this, various experiments were conducted on person re-identification backbone models to determine the most effective model based on mAP, which reflects the backbone's feature extraction capability The research also included an evaluation of a new alternative to the DeepSORT architecture, focusing on its IDF1 score to assess its effectiveness in accurately associating information across multiple cameras.

Research problems

The issues the researcher has raised in addressing this issue are as follows:

PROBLEM 1: Look for relevant documents, research related subjects, and then propose solutions

PROBLEM 2: Create a block diagram, describe how the blocks work, and write a description of the project's requirements

PROBLEM 3: Research and select methods to solve requirements for each block PROBLEM 4: Research and select suitable models and algorithms for object detection, feature extraction, and data linkage between cameras

PROBLEM 5: Research and analyze possible scenarios when a person appears in the frame to serve the design of the dataset

PROBLEM 6: Research to design the interface for the system

PROBLEM 7: Build the system after creating the detailed components and main program flow chart

PROBLEM 8: Testing the system experimentally and gathering data for an evaluation and adjustment of its operations

PROBLEM 9: Write a thesis and report.

Thesis outline

This chapter introduces the topic, the objectives, the limitations, the related works of the research, and the layout of this thesis.

LITERATURE REVIEW

Object detection

Deep learning has gained significant attention over the past few decades, leading to extensive research aimed at enhancing its methodologies Recent advancements have resulted in remarkable outcomes, transforming how deep learning impacts daily life Notably, deep learning-based object detection techniques address various challenges in fields such as medical image analysis, autonomous vehicles, business analytics, and facial recognition These object detectors are categorized into one-stage detectors, exemplified by the YOLO series, and two-stage detectors, including RCNN and Faster RCNN.

Two-stage detectors streamline the object detection process into two main phases: proposal creation and prediction In the proposal generation phase, the detector identifies potential object regions within an image, aiming for high recall to ensure that all objects are represented in at least one proposed region The second stage employs a deep learning model to classify these proposals, assigning them appropriate labels, which can indicate either background or specific object classes Additionally, the model has the capability to enhance the initial localization provided by the proposal generator.

R-CNN (show in Fig 2.1 (a)) is a ground-breaking two-stage object detector deep learning-based approach proposed in 2014 The R-CNN pipeline is separated into three parts: (1) proposal creation, (2) feature extraction, and (3) region classification R-CNN provides a sparse collection of roughly 2000 proposals for each picture using Selective Search, which excludes regions easily recognized as background regions These proposals are then fed to a CNN, which does classification and bounding box regression Deep convolutional networks extracted the features of each proposal individually, resulting in highly repeated calculations As a result, R-CNN requires a significant amount of time for training and testing [13]

The Faster R-CNN architecture introduces a shared feature approach across its stages, significantly improving efficiency By utilizing a convolutional backbone network like VGG or ResNet, Faster R-CNN generates global feature maps that are shared between the Region Proposal Network (RPN) and the detection network, effectively reducing external proposal generation costs Subsequent research has focused on enhancing detection accuracy through improved backbone networks that provide richer feature representations.

Figure 2.1: Overview of different two-stage detection frameworks for generic object detection: (a) RCNN and (b) Faster RCNN [13]

Feature Pyramid Networks (FPN) have been developed to extract Region of Interest (RoI) features from different layers based on scale Subsequent advancements, including ResNeXt with grouped convolutions and Res2Net, aimed to enhance the internal connections of residual networks to more effectively leverage multi-scale features from convolutional maps.

One-stage object detectors streamline the detection process into a single framework, consisting of a backbone, neck, and head The backbone is crucial for feature representation and significantly influences inference efficiency, as it accounts for a large portion of the computational cost The neck integrates low-level physical data with high-level semantic features, creating pyramid feature maps across all levels The head, composed of multiple convolutional layers, predicts final detection results based on the multi-level features provided by the neck Notably, the Single Shot MultiBox Detector (SSD) and YOLO (You Only Look Once) pioneered this unified design, eliminating the need for pre-proposal computations.

Figure 2.2: Overview of different one-stage detection frameworks for generic object detection: (a) YOLO and (b) SSD [13].

YOLO (shown in Fig 2.2 (a)) approached object detection as a regression problem and split the entire image spatially into a predefined number of grid cells (e.g., using a

7 × 7 grid) Each cell was considered as a proposal for detecting the presence of one or

The initial implementation of the model was designed to accommodate up to two objects per cell, generating predictions that included the presence of an item, bounding box coordinates, dimensions (width and height), and the object's class Nevertheless, YOLO faced certain challenges in its performance.

The system's ability to detect objects is limited to two at a time, posing challenges in identifying small or cluttered items Additionally, relying solely on the latest feature map for predictions proves inadequate for accurately recognizing objects of differing sizes and aspect ratios.

Single-Shot Multi-box Detector (SSD) (shown in Fig 2.2 (b)) addresses YOLO’s limitations SSD comprises two parts: a backbone model and an SSD head

The backbone model, typically a pre-trained image classification network like ResNet or MobileNet, serves as a feature extractor by removing the last fully connected layer To enhance this backbone, the SSD head incorporates additional convolutional layers The resulting outputs are assessed as bounding boxes and object classifications based on the spatial locations of the final layer activations.

The Single Shot MultiBox Detector (SSD) employs multiple feature maps, each tailored to detect specific object sizes based on their receptive fields To enhance the detection of larger objects, additional convolutional feature maps were integrated into the original backbone architecture The network undergoes end-to-end training, optimizing a combined loss function that incorporates both localization and classification losses across all prediction maps Final predictions are generated by aggregating detection scores from these feature maps To improve training efficiency, hard negative mining is applied, minimizing the influence of negative proposals on gradient updates Additionally, extensive data augmentation techniques are employed to enhance detection accuracy As a result, SSD achieves detection performance comparable to Faster R-CNN while enabling real-time inference capabilities.

The updated versions of YOLO have significantly enhanced performance while preserving real-time inference speed YOLOv2 introduced improved anchor priors derived from training data through k-means clustering, which effectively reduced localization optimization issues By incorporating Batch Normalization layers and multi-scale techniques, YOLOv2 achieved strong competitive results.

YOLOv3 introduces a new feature extractor called Darknet-53, which replaces the previous Darknet-19 used in YOLOv2 This advanced backbone architecture consists of 53 convolutional layers and incorporates ResNet-style residual layers, significantly improving accuracy while maintaining a speed advantage Darknet-53 outperforms both ResNet101 and ResNet152, being 1.5 times and 2 times faster, respectively.

YOLOv4 restructured its detection framework into three main components: backbone, neck, and head It effectively implemented bag-of-freebies and bag-of-specials, optimizing the framework for training on a single GPU.

One-stage object detectors generate output with a single CNN operation, while two-stage detectors utilize high-score region proposals from the first-stage CNN, which are then processed by a second-stage CNN for final predictions The inference time varies between one-stage and two-stage object detectors, impacting their efficiency in real-time applications.

In object detection, the inference time for one-stage detectors is consistent, represented by the formula \( T_{one} = T_{1st} \), while for two-stage detectors, it varies based on the number of area suggestions with confidence scores exceeding a specific threshold, expressed as \( T_{two} = T_{1st} + m \cdot T_{2nd} \) Consequently, to achieve real-time performance, object detection systems should prioritize one-stage detectors.

Object tracking

Object tracking is a crucial task in computer vision focused on detecting and monitoring objects across image sequences Its applications span various fields, including traffic surveillance, accident detection, robotics, autonomous vehicle navigation, medical diagnostics, and human activity recognition.

Object tracking faces several significant challenges that can impact its accuracy and reliability Key issues include illumination variation, where changes in lighting make it hard to differentiate the target from its background Background clutter also complicates tracking, as similar colors or textures can lead to confusion and errors Low resolution is problematic when the target occupies only a few pixels within the bounding box, while scale variation adds complexity when the size of the target changes beyond expected limits Additionally, occlusion can hinder tracking when the target is partially or fully obscured.

Effective object tracking faces several challenges, including reduced visibility, target position changes due to rotation and deformation, and the complexities introduced by fast motion These factors necessitate robust tracking mechanisms to ensure accurate tracking across frames Overcoming these challenges is essential for reliable object tracking in real-world applications.

Figure 2.13: Challenge of object tracking algorithm facing in the real-world: a) illumination variation; b) background clutters; c) low resolution; d) scale variation; e) change the target position; f) occlusion; g) fast motion

Object tracking can be categorized into two main types: single object tracking and multi object tracking Single object tracking focuses on monitoring a specific target throughout a video, beginning with its identification in the first frame and requiring consistent detection in subsequent frames Effective single object trackers must be capable of tracking any given object, regardless of prior classification model training In contrast, multi object tracking involves tracking multiple objects simultaneously This method requires the tracker to first identify the number of objects present in each frame and then maintain the identity of each object across frames.

20 are developed from basic methods divided into five different categories including feature-based, segmentation-based, estimation-based and learning-based methods [25]

Figure 2.14: Single object tracking and Multiple object tracking

This method is one of the simplest for object tracking, involving the extraction of features such as color, texture, and optical flow to distinguish objects in feature space To effectively identify objects in subsequent frames, it is crucial that these features are both specific and reliable The process begins with extracting distinguishable features, followed by using a similarity criterion to match the object with the most similar features in the next frame, as illustrated in Fig 2.15 However, a significant challenge lies in the extraction stage, where accurate and trustworthy features must be obtained to differentiate the target object from others.

Figure 2.15: Calculate optical flow in frames [2]

The tracking problem can be transformed into an estimation challenge by utilizing estimation techniques, where an object is represented by a state vector that encapsulates its dynamic behavior, such as position and speed The foundation of dynamic mode estimation is rooted in Bayesian approaches, allowing for continuous updates of the target's position based on the latest sensor data through Bayesian filters This recursive process involves two key steps: prediction and updating During the updating phase, the target's position is refined using current observations and the observation model, while the prediction phase employs the state model to forecast the target's next location Each video frame undergoes these prediction and updating procedures, utilizing estimation methods like the Particle filter and Kalman filter.

The particle filter is a probabilistic filtering technique employed for state estimation in dynamic systems, utilizing a set of particles that are sequentially updated based on observations and system dynamics This method effectively represents the posterior distribution of the system state, making it particularly valuable for applications such as object tracking and robot localization Its capability to handle nonlinearities and uncertainties is crucial, as most tracking challenges involve non-linear scenarios Consequently, the particle filter has emerged as a preferred solution for addressing these complex tracking issues.

The Kalman filter is a sophisticated mathematical technique employed for state estimation in linear dynamic systems It generates accurate current condition estimates by integrating previous estimates with real-time measurements The filter operates in two main phases: prediction and update During the prediction phase, it utilizes the dynamic model of the system to forecast the current state based on prior estimates and system inputs, while also providing an uncertainty estimate In the update phase, the filter adjusts the state estimate based on the precision of the new data received By calculating the Kalman gain, it effectively merges available information to deliver the most accurate estimate possible The Kalman filter is extensively used across various sectors, including navigation, robotics, finance, and control systems, due to its robustness in handling noisy observations and uncertainties, making it an essential tool for numerous real-time applications.

The integration of convolutional neural networks with the Kalman filter has significantly enhanced pedestrian detection capabilities within Multiple Object Tracking (MOT) The Simple Online and Realtime Tracking (SORT) method utilizes Faster R-CNN detections and employs a straightforward approach that predicts object motion through the Kalman filter while effectively associating these detections.

The Hungarian algorithm utilizes a cost matrix based on intersection-over-union (IoU) distances to efficiently match detections with anticipated object states This method, combined with the SORT algorithm, has demonstrated exceptional performance, earning SORT the title of the top open-source algorithm on the MOT15 dataset.

The SORT algorithm, despite its success, faces significant challenges, particularly with identity switches that can lead to incorrect track assignments and fragmentation in crowded or occluded environments Its reliance on linear motion patterns limits its effectiveness in tracking complex object movements, resulting in subpar performance when objects change direction abruptly Additionally, maintaining track continuity during occlusions proves difficult, as distinguishing between occluded objects and new entrants to the scene is problematic These limitations highlight the need for further research and advancements in algorithms to address the complexities of object tracking in diverse and dynamic settings.

Figure 2.18: Diagram of SORT algorithm [30]

Learning-based approaches enable the detection and tracking of unique targets by utilizing features and appearances learned from previous frames During testing, these methods apply acquired knowledge to identify and follow objects in subsequent frames These learning-based techniques can be classified into three categories: generative, discriminative, and reinforcement learning.

Discriminative trackers treat tracking as a classification challenge, aiming to differentiate the target from the background This approach is categorized into shallow learning, which utilizes handcrafted features and traditional classifiers, and deep learning, which employs deep neural networks to automatically learn discriminative features for precise target predictions Deep learning algorithms excel in extracting high-level representations from raw data, leading to impressive outcomes in various tracking applications By leveraging discriminative learning techniques, trackers can significantly enhance their effectiveness in distinguishing targets from backgrounds.

Shallow learning in object tracking involves traditional methods that depend on handcrafted features and standard classifiers to execute tracking tasks This technique extracts various features from the target and its background, including color, texture, and shape, which are then used to train classifiers like support vector machines (SVM) or random forests While shallow learning-based trackers necessitate human feature engineering and classifier selection, which can be time-consuming and may hinder performance in complex scenarios, they remain effective in certain applications where the target's appearance and motion are relatively stable However, the rise of deep learning techniques is reshaping the landscape of object tracking.

25 advancement, shallow learning approaches are gradually being succeeded by more powerful and adaptive algorithms capable of automatically learning discriminative features and capturing complex object dynamics [31]

In tracking applications, support vector machines (SVMs) play a crucial role by generating a confidence map of the search region This process involves identifying the object region, search region, and context region, which are essential for accurate tracking A linear SVM is demonstrated through the use of filled circles and rectangles, representing the support vectors that guide the tracking process effectively.

Person re-identification

Person re-identification (ReID) is a major challenge in computer vision, aiming to match individuals across non-overlapping camera views Traditional methods relied on manually crafted features and distance metrics, struggling to adapt to variations in appearance, pose, and lighting The advent of deep learning, particularly Convolutional Neural Networks (CNNs), has transformed human ReID, enabling more accurate and robust matching of individuals across different camera angles Deep learning models are proficient at automatically extracting high-level features, significantly enhancing ReID performance.

29 representations from image data, enabling increased ReID performance and discriminative features

Re-identification (ReID) involves identifying and linking an individual of interest across multiple cameras using their visual representations from various images or videos This process tackles challenges such as varying illumination, different perspectives, obstructions, and changes in appearance to ensure accurate matches ReID plays a vital role in applications like video surveillance, public safety, and multi-camera tracking systems The ongoing development of robust ReID algorithms is crucial for enhancing automated video analysis and comprehension in complex real-world situations.

As deep learning advances in person re-identification (ReID), researchers are exploring various techniques to enhance matching accuracy across cameras The two primary approaches are transformer-based and CNN-based methods Transformer-based techniques leverage the power of attention mechanisms to capture long-term dependencies in feature representations, thereby improving spatial and temporal interaction representations Conversely, CNN-based methods utilize convolutional neural networks to extract distinctive information from images, focusing on the physical and structural features of individuals Although both methodologies aim to boost ReID precision, they employ different architectures and strategies to address challenges such as varying lighting conditions, viewing angles, and occlusions.

Before the rise of deep learning in the ReID research community, handcrafted algorithms focused on learning local features However, CNN-based techniques have since taken precedence, leading to significant advancements in the field These methods leverage Convolutional Neural Networks (CNNs) to extract distinctive features from images of individuals, enabling accurate matching and identification.

Recent advancements in ReID (Re-Identification) have seen a surge in the use of CNN-based approaches, which excel at extracting intricate visual patterns and creating hierarchical representations from input data This trend reflects the growing popularity of these methods among individuals working with diverse camera systems.

ResNet, or Residual Network, is a CNN-based architecture that has gained prominence in the field of ReID due to its innovative use of residual connections, which facilitate the transfer of information across multiple layers without loss This design effectively addresses the challenges of training very deep networks, overcoming the vanishing gradient problem and enabling the training of networks with hundreds of layers ResNet excels in extracting discriminative features from person images, capturing both low-level and high-level visual signals, which are crucial for accurate person matching The architecture enhances the discriminative ability of features by encoding complex patterns and variations in appearance Variants like ResNet-50 and ResNet-101 are widely used in ReID tasks, often pre-trained on large-scale image classification datasets like ImageNet and fine-tuned on person re-identification datasets However, despite its effectiveness, ResNet faces challenges in optimizing learned embeddings for similarity measures essential for person re-identification, an issue addressed by Siamese networks Siamese networks learn similarity directly from image pairs, yielding more discriminative embeddings that account for natural variations in appearance and position Additionally, they benefit from transfer learning using pre-trained models like ResNet, enhancing their performance in ReID tasks.

The 31 network leverages ResNet's feature extraction capabilities, utilizing a pre-trained ResNet as its backbone to enhance its learning of similarity metrics By combining ResNet's ability to capture fine-grained details with the robust representation learning of a Siamese network, this integration significantly boosts both performance and generalization.

HRNet, or High-Resolution Network, is an advanced deep learning architecture designed to balance spatial resolution and receptive field size in computer vision tasks It excels at processing high-resolution inputs, enabling the capture of intricate details while retaining a comprehensive global context HRNet is widely utilized in various applications, including person re-identification, semantic segmentation, and object recognition Its innovative use of parallel branches with different spatial resolutions allows for the preservation of high-resolution representations throughout the network By integrating multi-scale features at every level, HRNet maintains high-resolution information, contrasting with traditional models that downsample feature maps early on Consequently, HRNet is particularly effective for tasks requiring precise spatial information, adeptly capturing both local details and global context.

The HRNet maintains high-resolution representations through the whole process

The proposed network architecture begins with a high-resolution convolution stream and progressively integrates additional high-to-low resolution convolution streams in parallel This design features four distinct stages, with each stage comprising multiple streams that correspond to varying resolutions By facilitating repeated multi-resolution fusions, the network effectively exchanges information across these parallel streams, enhancing overall performance and adaptability.

HRNet excels in producing high-resolution representations that are both semantically robust and spatially precise This is achieved by employing a parallel connection of high-to-low resolution convolution streams, allowing for the maintenance of high resolution rather than relying on recovery from low resolution Unlike traditional fusion methods that combine upsampled low-resolution high-level representations with high-resolution low-level ones, HRNet continuously fuses multiple resolutions to enhance both low and high-resolution representations Consequently, the resulting representations are significantly stronger in terms of semantic quality.

HRNet serves as a robust backbone architecture in ReID, designed to match individuals across various camera views by extracting distinctive features from their appearance Its ability to capture fine-grained details while maintaining high-resolution data has made HRNet a popular choice in ReID research A key advantage of HRNet in person re-identification is its capacity to capture fine-grained spatial data through multi-resolution branches, enabling the extraction of detailed features at multiple scales.

33 as clothing patterns, body parts, and accessories This is essential for distinguishing apart people who seem alike and handling difficult situations when small details matter

2.3.1.3 Challenges of CNN-based approach

CNN-based methods for person re-identification (ReID) face challenges in effectively processing data from local neighborhoods, often resulting in the loss of important details during convolution and downsampling These limitations hinder CNN models' ability to capture the contextual information and fine-grained features essential for accurate ReID.

The proposed transformer-based method addresses the limitations of CNN-based approaches in ReID by leveraging the attention mechanism and the ability to recognize long-range dependencies This innovative framework aims to enhance the accuracy and robustness of ReID, providing a superior alternative to traditional CNN methods.

TransReID is an innovative transformer-based object ReID framework designed to enhance feature representation robustness The framework establishes a strong baseline through significant adjustments to a pure transformer architecture To bolster feature learning and enhance long-range relationships, a Jigsaw Patches Module (JPM) is introduced, which reorganizes patch embeddings via shift and shuffle operations This module operates in tandem with a global branch that lacks these operations, enabling the network to extract robust, perturbation-invariant features while maintaining a global context Additionally, a Side Information Embedding (SIE) is incorporated to further refine the learning process of robust features.

To mitigate data bias from cameras and viewpoints, a unified framework is introduced that effectively incorporates non-visual cues through learnable embeddings, replacing the complex designs typically employed in CNN-based methods.

SYSTEM DESIGN

System requirements specification

The researcher has established the following specifications for the system in order to build one that can support tasks involving tracking people through the use of several cameras:

• Take a video as system's input, then analyze it to get tracking results With over 90% IDF1 score

• User can retrieve person given provided color condition based on raw tracking result with a performance acquired over 80% of AUC-PR score

• The system has a user-friendly, interactive, and graphic user interface

The system's architecture consists of several key components: preprocessing, object detection, person re-identification, single-camera tracking, multi-camera matching, storage, person searching, and a graphic user interface Each of these blocks plays a crucial role in processing and analyzing data effectively, as illustrated in the block diagram in Figure 3.1.

(1) Preprocessing block: Take the surveillance system's videos as input, then collect all of the images from those videos and compress the image quality

The object detection block processes the output from the preprocessing stage to identify and locate individuals within the image It generates bounding boxes around detected people, which are then provided to the person re-identification block for further analysis.

The person re-identification block utilizes bounding boxes generated by the object detection block to extract distinctive features that differentiate individuals The result is a set of unique characteristics associated with each bounding box, enabling effective identification of people.

(4) Single-camera tracking block: The tracking algorithm will assign a unique identification to each person appearing in the surveillance system after

43 collecting information on all unique features for each bounding box Tracking results within a camera will be the output

The multi-camera matching block integrates the tracking results from individual cameras, sourced from the single-camera tracking block, to produce a comprehensive system result This output is subsequently forwarded to the graphical user interface block for further processing.

(6) Storage block: used to store cross-camera tracking results

(7) Color-based searching block: Contains algorithms to find people by color in the system based on the results of tracking

(8) Graphic user interface block: This block processes the system's final output and makes it simpler for users to utilize by using a graphic user interface

Figure 3.1: Overall block diagram of the system

This innovative system streamlines the process of tracking individuals in videos, saving users significant time and effort By analyzing input videos, it generates detailed tracking information for each person Additionally, to enhance search capabilities, the system includes a feature that enables users to filter and retrieve individuals based on color.

Preprocessing

The surveillance system extracts footage into frames and compresses them to optimize storage and enhance system speed, all while ensuring performance is maintained, as illustrated in Table 3.1.

Figure 3.2: Several samples in the validation dataset of AI CITY Challenge 2023

The dataset for this project is divided into training and validation sets, featuring 59 cameras in the training dataset and 28 in the validation dataset A total of 18,010 images have been collected from each camera, resulting in an overall dataset comprising 1,566,870 images When image quality is preserved at 100%, each image requires between 600 and 700 KB of storage space; however, if the quality is reduced to 50%, the storage requirement decreases to 180 to 200 KB per image Table 3.1 provides a detailed breakdown of the storage space required for the entire dataset based on these parameters.

Tracking (HOTA↑) Non-compressed images ~1018 GB 0.932 94.43 93.03 88.69

45 that an uncompressed image requires 700KB of space and a compressed image 200KB of space.

Object detection

The proposed cross-camera tracking system prioritizes accuracy over speed, leading to extensive evaluations aimed at identifying the most effective object detection model This study benchmarks the latest models in the YOLO family—specifically YOLOv5, YOLOv6, YOLOv7, and YOLOv8—using the validation dataset from the AI CITY Challenge 2023 The results are illustrated in Figure 3.2.

Table 3.2: Comparison of three YOLO version in detection task from another project [44]

Variations of the YOLO model, including YOLOv5, YOLOv6, YOLOv7, and YOLOv8, offer different configurations such as YOLOv5n, YOLOv7-tiny, and YOLOv8s The speed and accuracy of these models are influenced by the number of parameters; fewer parameters result in faster performance but lower accuracy, while more parameters enhance accuracy at the cost of speed This study focuses on models that prioritize accuracy, specifically those with a higher number of parameters, including YOLOv5x6, YOLOv6l, YOLOv7x, and YOLOv8x A comparison of their accuracy, measured by AP50, is presented in Table 3.2.

95 metric) and speed (based on FPS score of the detection task in the research [44]) of

The analysis of the four YOLO versions—YOLOv5, YOLOv6, YOLOv7, and YOLOv8—reveals that the largest variants, YOLOv5x6, YOLOv6l, YOLOv7x, and YOLOv8x, demonstrate superior accuracy compared to their smaller counterparts Consequently, the project utilized these three larger variations for experiments on the validation dataset, with results presented in Table 3.3 and Fig 3.5 The findings indicate that YOLOv5x6, YOLOv6l, and YOLOv7x perform comparably, achieving accuracy scores of 92.32, 92.36, and 92.43, respectively.

YOLOv8 demonstrates superior performance among the tested models, achieving a mean Average Precision (mAP) of 93.28 while maintaining a low parameter count of 68.2 million, which enhances its speed and accuracy Consequently, the researcher chose to implement the object detection block in this study utilizing YOLOv8.

Model Params Input size AP50-95↑ (%) v5x6 140.7 M 1280 92.32 v6l 59.6 M 1280 92.36 v7x 70.8 M 1280 92.43 v8x 68.2 M 1280 93.28

Table 3.3: Comparison of YOLOv5x, YOLOv6l, YOLOv7x and YOLOv8x on the validation dataset

The author trained four YOLO variants—YOLOv5x, YOLOv6l, YOLOv7x, and YOLOv8x—using pre-trained models on the COCO dataset to address the significant challenge of the domain gap between synthetic validation datasets and real-world scenarios By fine-tuning these models with a synthetic validation dataset, the researcher aimed to adapt them to a similar synthetic domain, enhancing their effectiveness for person detection The study also highlights various data augmentation techniques, detailed in Table 3.4, that were employed to further improve model performance.

Mix-up 0.15 Mix-up (probability)

Table 3.4: Data augmentation for training model

YOLOv8 has been chosen for the object detection block due to its high accuracy, outperforming YOLOv5 Key improvements include modifications to the model's backbone and head, such as the introduction of C2f, which replaces C3 and enhances output processing by combining results from multiple bottlenecks Additionally, the initial 6x6 convolution in the stem has been changed to a 3x3 convolution YOLOv8 also features an anchor-free mechanism in its head, eliminating the need for anchor boxes and streamlining the Non-Maximum Suppression (NMS) process, ultimately leading to faster and more efficient detections.

Figure 3.3: YOLOv8 head with two branches including box branch (upper) and class branch (lower)

The YOLOv8 training process utilizes a loss function that comprises three key components: CIoU Loss, Distribution Focal Loss (DFL) for bounding box parameter prediction, and Binary Cross-Entropy Loss for object label prediction.

𝐿 𝑡𝑜𝑡𝑎𝑙 = 𝐿 𝐶𝐼𝑜𝑈 + 𝛼𝐿 𝐷𝐹𝐿 + 𝛽𝐿 𝐵𝐶𝐸 (3.1) where α and β are constants to balance the importance of the loss functions and are chosen experimentally Currently, experimentally 𝛼 = 𝛽 = 1 gives the best results

In addition, CIOU Loss is calculated according to the Equation 3.2

The formula \( c^2 + \gamma v \) represents a method for optimizing the intersection over union (IoU) between a bounding box and its corresponding ground-truth box Here, \( b \) and \( b_{gt} \) denote the center coordinates of the bounding box and ground-truth box, respectively, while \( \rho() \) signifies the Euclidean distance The variable \( c \) refers to the diagonal length of the smallest enclosing box for both the bounding box and ground-truth box Additionally, \( v \) measures the consistency of the aspect ratio between the two boxes, as defined in Equation 3.3 The coefficient \( \gamma \) serves to balance the IoU loss and the aspect ratio consistency \( v \), with its value determined by Equation 3.4 [46].

The CIoU loss function aims to align the predicted bounding box with the ground truth, while the DFL loss function encourages the network to develop a General Distribution for the bounding box, incorporating uncertainty in object boundaries due to potential labeling errors DFL focuses on the distance from the center to the four edges of the bounding box as the regression target For instance, if the distance from the center to the left edge is represented as yb ∈ [y b0 , y bn], the DFL network transforms yb into a series of equally spaced points, y b0, y b1, …, y bn, corresponding to the number of discrete values needed for the CNN.

In Figure 3.3, the output labeled "reg_max" corresponds to 49 outputs, with four distances measured from the center to the edges, resulting in a total output of 4 x reg_max in the box branch By applying the property of discrete distribution, where the sum of probabilities equals one (∑ n i=0 𝑃(𝑦 𝑏𝑖 ) = 1), the predicted value ŷ b can be determined using Equation 3.5 [46].

The output from the box branch in Figure 3.3 represents the discrete values of the distribution P(y bi ) To ensure that the sum of these probabilities equals one (∑ n i=0 P(y bi ) = 1), the softmax function is applied, resulting in outputs denoted as S i, which correspond to P(y bi ) Although there is some uncertainty, the bounding box value should be positioned close to the ground-truth bounding box Consequently, the DFL emphasizes values near the ground-truth bounding box y b by increasing the probabilities S i of the adjacent left value y bi and the adjacent right value y b i+1.

𝑏𝑖+1) From there, the formula of DFL is shown as in Equation 3.6 [46]

Figure 3.4: The demonstration of general distribution using DFL

Binary Cross-Entropy Loss is calculated according to the Equation 3.7

In addition, YOLOv8 uses Task-aligned One-stage Object Detection (TOOD)

[47] in order to have an information connection between the classification task and the

The YOLOv7 model for object detection consists of two branches: one for detecting bounding boxes and another for generating object labels TOOD introduces a variable, t = c α × IoU 𝛽, to assess the alignment between these two tasks, where c represents the classification output and IoU denotes the Intersection over Union score from the box branch The parameters α and β are equilibrium constants determined through experimental methods This t value is then normalized to t̂, ensuring that its maximum aligns with the highest IoU value in each image, while N signifies the number of classes involved.

Figure 3.5: The detection result of YOLOv8 in on benchmark dataset

Person re-identification

Re-identification plays a vital role in multi-camera tracking by focusing on obtaining reliable and distinctive appearance features for each individual Researchers investigate two primary deep feature extraction methods: transformer-based and CNN-based models While CNN-based models are popular for person re-identification due to their proficiency in extracting discriminative features and capturing local patterns, they often face challenges in modeling long-range dependencies and fine-grained details This limitation can hinder their effectiveness in complex situations where individuals' appearances vary significantly across different camera views.

Transformer-based models have become a promising solution for person re-identification, utilizing self-attention mechanisms to capture global dependencies and effectively model long-range interactions.

Transformer-based models excel in analyzing image regions by effectively understanding the relationships between different parts of an image, making them adept at managing variations in pose, viewpoint, and occlusion Their superior performance is particularly evident in complex backgrounds and crowded scenes However, these models generally demand greater computational resources and have a larger memory footprint than CNN-based models, which can lead to high training costs and limit their deployment in resource-constrained environments.

The researcher compared transformer-based methods, specifically TransReID, with CNN-based models including ResNet and HRNet This study utilized a synthetic validation dataset and the mAP selection measure to determine the superior model between ResNet and HRNet, with results presented in Table 3.6 Additionally, Table 3.5 demonstrates that the TransReID model significantly outperforms three other methods on the Market-1501-C dataset.

Table 3.5: Comparison of person re-identification methods on the public Market- 1501-C datasets [48]

In the training and feature extraction stages, the input image was resized to 256 × 128 pixels To enhance the preprocessed data, various data augmentations were applied, including random horizontal flips, random erasing, and random padding During the feature extraction process, a global feature with a dimension of 2048 was generated before applying batch normalization, serving as the final output for the input image.

The experimental results presented in Table 3.6 highlight that the Transformer-based model TransReID achieved superior performance compared to the CNN-based model HRNet, with a 1.76% higher mAP score To enhance performance further, it is recommended to combine the features of HRNet and TransReID to create a new feature set with a dimension of 4096, which yields a score of 96.12% Consequently, to replace the wide neural network in DeepSORT, the system's feature extractor will utilize the ensemble features derived from both TransReID and HRNet.

TransReID + HRnet 96.12 Table 3.6: The comparison of feature extractor for ReID task

Figure 3.6: The custom ReID dataset to train person re-identification model

Single-camera tracking

In-person tracking aims to maintain an individual's identification across multiple video frames by linking detected bounding boxes of people, thereby creating a trajectory for each tracked individual To ensure accurate and reliable tracking, the system must navigate challenges such as occlusions, appearance variations, positional changes, and camera movements Performance evaluations of various methods, including EMATT, POI, SORT, and DeepSORT, reveal that POI achieves the highest MOTA score, indicating superior accuracy in ID assignment, while SORT excels in MOTP score, demonstrating strong localization reliability DeepSORT, though having the lowest IDSW score of 781, reflects the stability of ID assignments Ultimately, effective cross-camera tracking emphasizes the importance of consistently assigning unique IDs to individuals, even as they move between different camera views.

In this study, the performance of the DeepSORT tracker was enhanced by substituting its feature extractor, known as the "Deep Appearance Descriptor," with features derived from MOTA and MOTP metrics, which were recorded at 4.7% and 0.5% lower than POI and SORT, respectively.

The proposed system utilizes a Gaussian Mixture Model (GMM) to effectively detect and separate ID-switching tracklets in person re-identification scenarios This approach addresses the challenges posed by individuals walking in unpredictable patterns, such as moving in straight lines, pausing, turning, or passing closely by one another.

DeepSORT [35] 61.4 79.1 781 Table 3 7: Tracking results on the MOT16 [54] challenge

DeepSORT is a highly effective tracking algorithm renowned for its exceptional performance in multi-object tracking, particularly in challenging scenarios like crowded environments and occlusions Its ability to accurately track individuals under such conditions sets it apart By integrating the SORT algorithm with advanced deep learning-based object recognition, DeepSORT effectively overcomes these tracking difficulties The architecture features a wide residual network, comprising two convolutional layers followed by six residual blocks, which computes a global feature map of dimensionality 128 Additionally, a final batch and ℓ2 normalization ensure compatibility with the cosine appearance metric, enhancing its tracking precision.

DeepSORT, designed for real-time applications, revealed that the wide neural network struggled to effectively extract unique features of individuals using the HRnet and TransReID models in the person re-identification block Consequently, the proposed system opted to utilize features obtained from the person re-identification block instead of relying on the wide neural network for appearance linkage, as demonstrated in the results presented in Table 4.1.

Figure 3.8: Overview of the CNN architecture [35]

3.5.2 ID-switching detection and ID-switching splitting

The study employed identity switch detection and identity switch splitting mechanisms based on Gaussian Mixture Models (GMM) to reduce the score of identity switches These techniques facilitate the identification of tracklets with identity switches and enable the separation of these tracklets into two distinct segments corresponding to the respective individuals.

A tracklet 𝓉 is defined as 𝓉 ← {(𝑑 1, 𝑓 1 ), (𝑑 2, 𝑓 2 ), , (𝑑 𝑘, 𝑓 𝑘 )}, consisting of k detections, where 𝑓 𝑖 represents the feature vector for the 𝑖-th detection Each tracklet is associated with a single identity, necessitating that the feature distribution adheres to a Gaussian model characterized by a mean 𝜇 and variance Σ, which encapsulates the tracklet's overall traits The proposed system employs ID-switching detection to pinpoint instances of ID-switching within the tracklet Subsequently, it applies ID-switching splitting to divide the tracklet into two distinct tracklets by modeling the feature set using a Gaussian Mixture Model (GMM) Specifically, the feature set is modeled as a GMM via Expectation Maximization, which aims to minimize the likelihood as outlined in Equation 3.8.

56 where 𝜃 = (𝜋, 𝜇, ∑) are the mixture model parameters, with 𝒩(𝑓 𝑖 | 𝜇 𝑗 , ∑ 𝑗 ) as the Gaussian probability density function with mean 𝜇 𝑗 and covariance matrix ∑ 𝑗 𝜋 𝑗 is the mixture coefficient for the 𝑗-th Gaussian component

The proposed system utilizes two Gaussian distributions to effectively detect identity switches, with a predefined threshold of 0.4 By calculating the mean points of these distributions and measuring the Euclidean distance between them, the system can accurately identify when an identity switch occurs This method enhances the precision of tracking by ensuring consistent identification of objects in multi-object tracking scenarios.

To resolve identity switches and effectively split the tracklet, GMM utilizes two Gaussian distributions, represented as 𝜃 1 = (𝜋 1 , 𝜇 1 , ∑ 1 ) and 𝜃 2 = (𝜋 2 , 𝜇 2 , ∑ 2 ) The process involves identifying the longest subsequence of consecutive bounding boxes from the initial box in the higher-weighted Gaussian distribution The boxes identified in this subsequence are assigned to the first individual, while the remaining boxes are allocated to the second individual.

Specifically, if the weight of 𝜃 1 (represented by 𝜋 1 ) is greater than the weight of

In the analysis of bounding boxes clustered to 𝜃 1, the start and end indices, labeled as 𝑠 and 𝑒, were established for the longest subsequence The bounding boxes within the range {(𝑑 𝑠 , 𝑓 𝑠 ), (𝑑 𝑠+1 , 𝑓 𝑠+1 ), , (𝑑 𝑒 , 𝑓 𝑒 )} are assigned to the first individual, while the remaining boxes are allocated to the second individual This splitting strategy is consistently applied, even when 𝜋 1 is less than or equal to 𝜋 2, to effectively minimize identity switches throughout the tracklet.

This method significantly reduces identity switches and enhances tracking system performance by accurately assigning bounding boxes to individuals according to the weights of the Gaussian.

57 distributions, the tracking algorithm maintains consistency and accuracy in identifying and tracking individuals, enhancing the overall quality of multi-object tracking

3.5.3 Post-check of matching algorithm

To improve tracking consistency, especially when individuals move out of the camera's view and later reappear, the proposed system integrates a single-camera matching step after the DeepSORT matching process This additional post-check stage significantly enhances tracking performance by utilizing cluster-based matching of tracklets according to their appearance features.

To effectively associate individuals captured on camera, the proposed system utilizes agglomerative clustering to group tracklets that correspond to the same person based on their similarities By applying the function ƒ(∙) to each tracklet, it extracts the average feature vector from all relevant features The similarity between two tracklets, denoted as 𝑡 𝑖 and 𝑡 𝑗, is then assessed using the cosine distance formula.

To enhance the clustering of tracklets and identify distinct individuals, a single-camera distance matrix is created from all pairs of tracklets derived from the DeepSORT output This matrix quantifies the similarity between tracklets by calculating the angle between their feature vectors Utilizing this distance matrix, agglomerative clustering is conducted to group similar tracklets iteratively, ultimately forming clusters that signify individual persons The final result is a collection of clusters, each representing a unique individual in the video.

Multi-camera matching

The concept of multi-camera tracklet matching focuses on linking individuals' identities recorded by different cameras In this approach, each camera's cluster acts as a representation of a person's identity The method employs Agglomerative Clustering alongside cluster feature calculations to enhance the accuracy of identity associations across multiple camera feeds.

The clustering method allows the system to assess the relationship between two clusters by measuring the distance between them, as defined in Equation 3.10 By utilizing these distance calculations, tracklets can be accurately matched across multiple cameras, ensuring consistent identities for individuals captured from different viewpoints.

Where ℂ 𝑣 , ℂ 𝑢 is two separated clusters, the function ƒ(∙) applied on a cluster will extract the average feature vector across all features

Figure 3.9: Several samples in the tiny colorful person dataset and annotation (a) The annotation according to its color; (b) Several samples in the tiny colorful person dataset

Color-based searching person

In order to evaluate the effectiveness of the color-based searching person module

The Tiny Colorful Person (TCP) dataset consists of 150 images, each featuring annotations that detail the colors of the character's shirt and pants These images are cropped from a synthetic validation dataset, as outlined in Section 4.2 Each sample is assigned a two-color annotation, with a corresponding numerical code for evaluation purposes, as demonstrated in the provided figures.

This article outlines the development of a color-based person identification function utilizing KMeans Clustering and the Lab color space The module's overall design is illustrated in Fig 3.10 The implementation leverages methods from the scikit-image library, specifically rgb2lab and deltaE_cie76, to achieve its functionality.

KMeans clustering is utilized on the RGB input image by fitting the model to the image and predicting cluster assignments for each pixel This process results in a segmentation of the image that highlights color similarity, as illustrated in Fig 3.11.

Figure 3.10: Block diagram of color-based searching person module

To accurately compare colors, it is essential to convert hex or RGB values into a color-uniform space This process utilizes the rgb2lab function to transform the values into the Lab color space, which is accomplished through the “convert image color” feature.

60 block The Lab color space provides a perceptually uniform representation of colors, allowing for meaningful color comparisons

In the Lab color space, the deltaE_cie76 method is used to calculate the color difference between a user-selected color and the top k colors in an image This method employs the CIE76 formula to assess perceptual color similarity A potential match is indicated if any of the top k colors have a difference below the specified threshold of 0.5 Ultimately, the module outputs n images that feature the color selected by the user.

Figure 3.11: Cluster three colors in a person image using KMeans.

RESULT

Environments

The proposed system is developed and evaluated within the environment of the

The AI City Challenge 2023, particularly track 1, emphasizes multi-camera multi-target person tracking, utilizing the robust capabilities of an RTX 3090 GPU as its core hardware This powerful graphics card is essential for delivering the computational resources required to train and execute deep learning models for effective person tracking The software framework is developed using Python 3.8, PyTorch 2.0, CUDA 11.6, and Qt Creator 2.16, ensuring a seamless integration of technologies for optimal performance.

Datasets

The dataset utilized in this challenge, generated through the NVIDIA Omniverse Platform, comprises synthetic data amounting to 2,607,781 frames captured from 129 distinct cameras Among these, 59 cameras are allocated for training purposes, featuring a total of 4,375,736 bounding boxes.

The dataset comprises 71 unique person IDs and includes 28 cameras designated for validation, which contain 1,950,917 bounding boxes and 35 unique person IDs While a testing dataset is available, the researcher focuses solely on the validation set to evaluate the system's performance To optimize storage, the quality of images retrieved from the input video was reduced to 50% of the original Furthermore, to ensure consistency across experiments, the extracted images were resized to 1280x1280 pixels Detailed information about these methodologies can be found in Section 3.2.

Evaluation metric

In this study, the researcher assesses the effectiveness of a multi-camera tracking system by concentrating on the IDF1 score To enhance the analysis, two supplementary metrics, MOTA and MOTP, will also be utilized, with detailed evaluations presented in sections 4.3.1 and 4.3.2.

MOTA, or Multiple Object Tracking Accuracy, evaluates tracking precision by factoring in false positives (FP), false negatives (FN), and identity switches (IDS), as outlined in Equation 4.1 This metric assesses the number of missed objects (FN) and incorrectly tracked objects (FP), while also penalizing identity switches where the algorithm misassigns identities A lower MOTA score reflects inferior tracking performance, whereas a higher score signifies enhanced accuracy and reliability of the tracking system.

The proposed system utilizes MOTA to assess the stability of an individual's identity across multiple frames, indicating that a higher MOTA score signifies a more consistent identity over time.

Multiple Object Tracking Precision (MOTP) measures the average positional accuracy of tracked objects by assessing the distance between predicted and ground truth bounding boxes It is calculated by dividing the total overlap of all predicted bounding boxes with their corresponding ground truth boxes by the total number of bounding box pairs A higher MOTP score signifies improved localization precision and more accurate tracking, as illustrated by the localization error example in Fig 4.1.

In this context, 𝒄 𝒕 represents the total number of matches identified at time t, while 𝒅 𝒕 𝒊 denotes the distance between the actual object locations in the ground truth and the corresponding detection outputs.

Figure 4.1: The scenario shows why MOTP is essential

Combined with MOTA, MOTP will show whether the bounding box around a person is correct, this metric is similar to mAP in evaluating object detection models

IDF1, or Identification F1 score, is a key evaluation metric for assessing the accuracy of object identification in multiple object tracking tasks This metric incorporates both precision and recall, providing a comprehensive measure of performance The formula for IDF1, represented as Equation 4.3, calculates the score by dividing the total number of correctly identified objects by the sum of identified objects and missed objects It effectively accounts for both false positives, which are incorrectly identified objects, and false negatives, which are objects that were missed during the identification process.

IDF1 offers a thorough evaluation of identification accuracy by measuring the correct identification of objects while minimizing false identifications A higher IDF1 score reflects superior object identification performance, with a perfect score of 1 signifying flawless identification.

Here, IDTP is Identity True Positives, IDFP is Identity False Positives, and IDFN is Identity False Negatives The IDF1 score in the proposed system

67 indicates how effective detection or association is The higher the IDF1 score, the better the detection and association

A precision-recall curve is a graphical tool that depicts the trade-off between precision and recall in classification models, particularly useful for evaluating machine learning performance in binary classification tasks Precision represents the ratio of true positive predictions to the total predicted positives, reflecting the accuracy of positive predictions In contrast, recall, or sensitivity, measures the proportion of true positives among all actual positive instances, indicating the model's effectiveness in identifying positive cases.

Figure 4.2: Theoretical precision-recall curves [59]

The precision-recall curve, illustrated in Fig 4.2, displays precision on the y-axis and recall on the x-axis This curve is generated by adjusting the model's classification threshold and computing the corresponding precision and recall values for each threshold Each point on the curve corresponds to a unique threshold, while the gray dotted line indicates a baseline for comparison.

A classifier that predicts all instances as belonging to the positive class would result in a baseline accuracy of 68% In contrast, an ideal classifier, depicted by the purple line, achieves perfect precision and recall across all thresholds Most real-world classifiers will perform between these two extremes, offering improved predictions compared to the baseline, though they may not reach perfection.

The Area Under the Precision-Recall Curve (AUC-PR) is a key metric that evaluates a model's performance across various thresholds, represented as a scalar value AUC-PR quantifies the area under the precision-recall curve, with higher values indicating superior model performance, reflecting both high precision and high recall An ideal classifier achieves an AUC-PR of 1, while a baseline classifier's AUC-PR varies based on the proportion of positive class observations, typically scoring 0.5 in balanced binary classification datasets Classifiers that offer predictive value will score between the baseline and perfect classifiers.

Utilizing the curve and the AUC-PR metric enables the identification of the optimal threshold for the color-based person search module, while also assessing its overall performance effectively.

Results

The multi-camera matching block results, illustrated in Figures 4.3, 4.4, and 4.5, indicate the model's effectiveness in tracking individuals when their full bodies are visible, even when partially obscured Notably, IDs 4, 6, and 7 are consistently captured in all four cameras, while ID 1 is only visible in camera c018 with just the head, yet the system successfully identifies all individuals without misidentification However, a challenging instance arises with ID 5, who is misidentified in cameras c014 and c015; the man in yellow appears as ID 5 in c014 but is incorrectly labeled as ID 2 in c015, where he is only partially visible and obscured This highlights the complexities of tracking in multi-camera environments.

69 assessed that ID switched occurs in the person tracking block, and the object detection block did a great job because it detected the right person in that position

Figure 4.3: Tracking results in three camera c014, c016, and c019

Regarding Fig 4.4, this is an easier scene because there are not any obstacles blocking the people, but most of them appear with entire body As a result, even though

ID 3 is recorded from a variety of angles, from far away to up close, the tracking results in scenes like these are very good Similar to ID 4, despite the fact that c076's situation is far from the camera and c078's context is partially occluded, the system still confirms that it is the same person

Figure 4.4: Tracking results in four cameras c076, c077, c078, and c081

Similarly, Fig 4.5, which also shows an enormous location, the character in the camera usually makes an appearance with their complete body On all four cameras, IDs

In camera c119, IDs 0, 2, 4, and 5 are tracked effectively, while IDs 1 and 3 are not detected This issue arises when the object detection block fails to recognize individuals in the image, causing the person tracking block to overlook them Consequently, this results in False Negatives, which adversely affect the MOTA and MOTP scores during system evaluation.

Ours 96.91 94.43 93.07 Table 4.1: Comparison of the proposed tracker and DeepSORT on the synthetic validation dataset

The overall system performance is evaluated on the synthetic validation dataset, utilizing IDF1, MOTA, and MOTP scores as detailed in Table 4.1 The proposed system outperforms DeepSORT across all three metrics.

This is significant since the image with obscured person can now be more easily re- identified by the updated feature extractor

Figure 4.5: Tracking results in four cameras c118, c119, c122, and c123

The author employs a binary classifier for color-based searches, categorizing results as True if they match the user's selected color and False otherwise To assess the module's effectiveness and determine the optimal threshold, the author utilizes the TCP dataset outlined in section 3.7.1, along with the precision-recall curve The module demonstrates strong performance, as illustrated in Fig 4.8.

Figure 4.6: The precision-recall curve of color-based searching person module with multiple thresholds The circled point found the optimal threshold for the module

To determine the optimal threshold, the precision-recall curve was analyzed to find the point that maximizes both precision and recall, indicating the best performance trade-off As illustrated in Fig 4.6, the optimal threshold is marked, achieving a precision score of approximately 0.81 and a recall score of 0.76, with an AUC-PR of 0.78 To identify a threshold that closely matches specific precision and recall values, the author utilized another plot derived from the precision-recall curve function, as shown in Fig 4.7, to clearly delineate the optimal threshold.

Figure 4 7: The precision-recall trade-off with the best threshold

Figure 4 8: Result of color-based searching person (a) Yellow; (b) Blue; (c) Green

Figure 4.7 indicates that the optimal threshold exceeds 0.45 and approaches 0.5, achieving a precision of 83.2% and a recall of 88.9% Consequently, a threshold of 0.5 is selected as the ideal value for the color-based discovery module.

CONCLUSION AND FUTURE WORK

CONCLUSTION

This study presents a cross-camera tracking system that integrates three advanced deep learning algorithms: YOLOv8 for object detection, HRnet + TransReID for person re-identification, and DeepSORT for object tracking, complemented by an intuitive graphical user interface By fine-tuning the training process, the research successfully enhanced YOLOv8's performance, achieving a mean Average Precision at 50% (mAP50) of 93.28% on a synthetic validation dataset Additionally, an improved DeepSORT was developed by replacing the wide neural network extractor with an ensemble of TransReID and HRnet, resulting in a mAP50 score of 96.12% for person re-identification The system's capabilities, including a color-based person search feature, indicate a significant advancement in smart monitoring technology A detailed analysis of the system's strengths and weaknesses can be found in Table 5.1.

FUTURE WORK

To enhance the cross-camera tracking system, future efforts should prioritize real-world deployment with actual individuals, moving beyond synthetic datasets Additionally, investigating alternative models for person re-identification that provide quicker inference times without compromising performance is essential, as this will improve the system's speed and responsiveness for real-time applications Lastly, expanding the system's capabilities will be crucial for its overall functionality and effectiveness.

Incorporating advanced precondition options for identifying individuals can enhance systems by including features such as emotion recognition, gender identification, and age estimation, as well as detecting specific items like masks, personal protective equipment (PPE), or weapons These integrations enable more detailed and context-aware insights about individuals, facilitating advanced analytics and improving security or surveillance applications.

- The system can handle a huge volume of video input, making it suitable for big surveillance systems

- The system recognizes almost all person, regardless of their appearance, with reliable accuracy

- Users can modify and employ a number of scenarios by selecting particular regions, cameras, or persons with colors thanks to the user-friendly, simple-to-use, and intuitive interface

- The hardware and the amount of video input have a significant impact on how quickly the tracking results are output

- The system's precision when applied to actual persons has not been examined

- The graphic user interface's persons search function lacks diversity; elements like searching by emotions, with or without masks, protective gear, or weapons, etc., must be added

Table 5 1: Strengths and weaknesses of the system

The 7th AI City Challenge, held in 2023, features contributions from a diverse team of researchers and professionals, including Milind Naphade, Shuo Wang, David C Anastasiu, and Zheng Tang, among others This collaborative effort aims to explore innovative solutions for urban challenges through artificial intelligence, showcasing the potential of AI in enhancing city living The challenge emphasizes the importance of interdisciplinary collaboration in developing smart city technologies that can improve infrastructure, transportation, and overall quality of life.

[2] Kwon, Dong-Sun Kim and Jinsan, "Moving Object Detection on a Vehicle Mounted"

[3] M Fotouhi, A R Gholami, and S Kasaei, "Particle Filter-Based Object Tracking Using Adaptive Histogram," 2006

[4] Baˇrina, David, "GABORWAVELETS IN IMAGE PROCESSING," 2016

[6] KALMAN, R E., "A New Approach to Linear Filtering," 1960

[7] Zhang, Lijun Zhou and Jianlin, "Combined Kalman Filter and Multifeature Fusion," 2019

[8] Joseph Redmon, Ali Farhadi, "YOLOv3: An Incremental Improvement," 2018

[9] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao, "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object," 2022

[10] Chuyi Li, Lulu Li, Hongliang Jiang, Kaiheng Weng, Yifei Geng, Liang Li, Zaidan

Ke, Qingyuan Li, Meng Cheng, Weiqiang Nie, Yiduo Li, Bo Zhang, Yufei Liang Linyuan Zhou Xiaoming Xuy Xiangxiang Chu Xiaoming Wei Xiaolin Wei,

"YOLOv6: A Single-Stage Object Detection Framework for Industrial," 2022

[11] Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao, "YOLOv4: Optimal Speed and Accuracy of Object Detection," 2020

[12] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, "Faster R-CNN: Towards Real-Time Object"

[13] Xiongwei Wua, Doyen Sahoo, Steven C.H Hoi, "Recent Advances in Deep Learning for Object Detection," 2019

[14] Glenn Jocher; Alex Stoken; Jirka Borovec; NanoCode012; ChristopherSTAN; Liu Changyu; Laughing; tkianai; Adam Hogan; lorenzomammana; yxNONG; AlexWang1900; Laurentiu Diaconu; Marc; wanghaoyang0106; ml5ah; Doug;

Francisco Ingham; Frederik; Guilhen; Hatovix; , "ultralytics/yolov5: v3.1 - Bug Fixes and," 2020

[15] Chien-YaoWang, Hong-Yuan Mark Liao, I-Hau Yeh, Yueh-Hua Wu, Ping-Yang Chen, Jun-Wei Hsieh, "CSPNET: A NEW BACKBONE THAT CAN ENHANCE LEARNING," 2019

[16] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, Jiaya Jia, "Path Aggregation Network for Instance Segmentation," 2018

[17] Renjie Xu, Haifeng Lin, Kangjie Lu, Lin Cao and Yunfei Liu, "A Forest Fire Detection System Based on Ensemble Learning," 2017

[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, "Spatial Pyramid Pooling in Deep Convolutional," 2015

The study by Guo et al (2022) introduces an enhanced YOLOV4-CSP algorithm specifically designed for detecting sliver defects on bamboo surfaces, focusing on extreme aspect ratios This advancement in computer vision technology aims to improve the accuracy and efficiency of defect detection in bamboo, which is crucial for quality control in bamboo-related industries.

[20] WZMIAOMIAO, "YOLOv5 (6.0/6.1) brief summary," [Online] Available: https://github.com/ultralytics/yolov5/issues/6998

[21] G C A & Q J Jocher, "YOLO by Ultralytics (Version 8.0.0)," Ultralytics , 2023 [Online] Available: https://github.com/ultralytics/ultralytics

[22] RangeKing, "Brief summary of YOLOv8 model structure," [Online] Available: https://github.com/ultralytics/ultralytics/issues/189

In their 2023 study, Xiang Li and colleagues introduced a novel approach called Generalized Focal Loss, which enhances dense object detection by effectively learning qualified and distributed bounding boxes This innovative method aims to improve the accuracy and efficiency of detecting multiple objects within complex environments.

[24] Zahra Soleimanitaleba, Mohammad Ali Keyvanrada, "Single Object Tracking: A Survey of Methods, Datasets, and Evaluation Metrics"

[25] Verma, Rachna, "A Review of Object Detection and Tracking Methods," 2017

[26] "Kalman Filter - Wikipedia," [Online] Available: https://en.wikipedia.org/wiki/Kalman_filter

[27] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, Ben Upcroft, "Simple Online and Realtime Tracking," 2017

[28] Kuhn, H W., "THE HUNGARIAN METHOD FOR THE ASSIGNMENT PROBLEM"

[29] Laura Leal-Taixé, Anton Milan, Ian Reid, Stefan Roth, Konrad Schindler,

"MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking," 2015

Bui Tien Tung's article, "SORT - Deep SORT: A Perspective on Object Tracking (Part 2)," explores advanced techniques in object tracking, focusing on the evolution from SORT to Deep SORT It highlights the significance of integrating deep learning methods to enhance tracking accuracy and efficiency The article provides insights into the algorithms' functionality and their practical applications in real-world scenarios, making it a valuable resource for those interested in computer vision and machine learning For further details, visit the full article at Viblo.

[31] Gioele Ciaparrone, Francisco Luque S´anchez, Siham Tabik,, "Deep Learning in Video Multi-Object Tracking: A Survey," 2019

[32] Min Tian, Weiwei Zhang, and Fuqiang Liu, "On-Line Ensemble SVM for Robust Object Tracking," 2007

[33] Minyoung Kim, Stefano Alletto, Luca Rigazio, "Similarity Mapping with Enhanced Siamese Network," 2017

[34] Sergey Zagoruyko, Nikos Komodakis, "Wide Residual Networks," 2017

[35] Nicolai Wojke, Alex Bewley, Dietrich Paulus, "SIMPLE ONLINE AND REALTIME TRACKING WITH A DEEP ASSOCIATION METRIC," 2017

[36] Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, Wei Jiang, "TransReID: Transformer-based Object Re-Identification," 2021

[37] Yifan Suny, Liang Zhengz, Yi Yangz, Qi Tianx, Shengjin Wangy, "Beyond Part Models: Person Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline)," 2018

[38] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, "Deep Residual Learning for Image Recognition," 2015

[39] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li and Li Fei-Fei, "ImageNet:

A Large-Scale Hierarchical Image Database," 2009

[40] Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, "Siamese Neural Networks for One-shot Image Recognition"

[41] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu,, "Deep High-Resolution Representation Learning," 2020

[42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan

N Gomez, Łukasz Kaiser, Illia Polosukhin, "Attention Is All You Need," 2017

[43] Sean Benhur, "Hierarchical Clustering: Agglomerative + Divisive Clustering," [Online] Available: https://builtin.com/machine-learning/agglomerative- clustering

[44] "Performance Benchmark of YOLO v5, v7 and v8," [Online] Available: https://www.stereolabs.com/blog/performance-of-yolo-v5-v7-and-v8/

[45] W W L W S C X H J L J T a J Y Xiang Li, "Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,"

[46] Luu Thanh Tung, Tran Vu Hoang, "Thiết kế hệ thống quản lý giao thông thông minh," UTE, 2023

[47] Y Z Y G M R S W H Chengjian Feng, "TOOD: Task-aligned One-stage Object Detection," 2021

[48] Paper with code, "Person Re-Identification on Market-1501-C," [Online] Available: https://paperswithcode.com/sota/person-re-identification-on-market- 1501-c

[49] Tao Ruan, Ting Liu, Zilong Huang, Yunchao Wei, Shikui Wei, Yao Zhao, Thomas Huang , "Devil in the Details: Towards Accurate Single and Multiple Human Parsing," 2018

[50] Fabian Herzog, Xunbo Ji, Torben Teepe, Stefan Hửrmann, Johannes Gilg, Gerhard Rigoll, "Lightweight Multi-Branch Network for Person Re-Identification," 2023

[51] Lingxiao He, Xingyu Liao, Wu Liu, Xinchen Liu, Peng Cheng, Tao Mei,

"FastReID: A Pytorch Toolbox for General Instance Re-identification," 2023

[52] R Sanchez-Matilla, F Poiesi, and A Cavallaro, "Online multi-target tracking with strong and weak detections," 2016,

[53] F Yu, W Li, Q Li, Y Liu, X Shi, and J Yan, "Poi: Multiple object tracking with high performance detection and appearance feature," 2016

[54] MOT16: A Benchmark for Multi-Object Tracking, "MOT16: A Benchmark for Multi-Object Tracking," 2016

In the 2023 study by Quang Qui-Vinh Nguyen and colleagues, titled "Multi-camera People Tracking With Mixture of Realistic and Synthetic Knowledge," the authors explore innovative techniques for enhancing people tracking across multiple camera systems By integrating both real and synthetic data, the research aims to improve the accuracy and efficiency of tracking individuals in various environments This approach not only addresses challenges in traditional tracking methods but also contributes to advancements in surveillance and security technologies.

[56] Xin Jin, Jiawei Han, "K-Means Clustering," 2011

[57] "scikit-image," [Online] Available: https://scikit-image.org/

[58] NVIDIA, "NVIDIA Omniverse," [Online] Available: https://www.nvidia.com/en- us/omniverse/

[59] Steen, Doug, "Precision-Recall Curves," [Online] Available: https://medium.com/@douglaspsteen/precision-recall-curves-d32e5b290248

Tiêu đề	Design of Cross-Camera Tracking System Based on Deep Learning
Tác giả	Chung Tien Dat
Người hướng dẫn	PhD. Tran Vu Hoang
Trường học	Ho Chi Minh City University of Technology and Education
Chuyên ngành	Computer Engineering Technology
Thể loại	Graduation Project
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	100
Dung lượng	7,19 MB