3 Figure 1-3 Drone Racing gate from University of Zurich Challenge Autonomous drone racing has become a captivating challenge within the robotics community, aiming to develop agile UAVs
INTRODUCTION
An overview of drone racing
An unmanned aerial vehicle or uncrewed aerial vehicle, commonly known as a drone, is an aircraft without any human pilot, crew or passengers on board
In the early 2010s, enthusiasts started enhancing remote-controlled quadcopters for increased speed and agility, leading to informal gatherings where hobbyists raced their custom-built drones through makeshift courses.
Figure 1-1 Demonstration of Drone Model
The first formal drone racing event can be traced back to 2014 when the
"Aerial Grand Prix" was held in the United States This event marked the beginning of organized and competitive drone racing, drawing attention from both participants and spectators
In 2015, the Drone Racing League (DRL) was founded, which played a crucial role in shaping the modern drone racing challenge DRL introduced
Drone racing has established standardized rules and regulations akin to traditional motorsports, leading to the emergence of professional racing circuits Concurrently, various organizations and leagues worldwide have started hosting drone racing competitions Notable events like the World Drone Prix in Dubai and the MultiGP National Championship have garnered international acclaim, significantly boosting the sport's popularity.
Figure 1-2 Drone Racing League gate for racing
The University of Zurich, a leading institution in drone technology, recently hosted a Drone Racing challenge focused on navigating high-performance racing drones through a series of gates in the shortest time possible This competition features two main challenges: state-based drone racing, which benchmarks algorithms for planning and control regarding time-optimal flight and robustness against unknown factors like aerodynamics, and vision-based drone racing Participants will utilize the university's in-house simulator, Flightmare, to compete in this exciting event.
Figure 1-3 Drone Racing gate from University of Zurich Challenge
Autonomous drone racing is an exciting challenge in the robotics field, focusing on creating agile UAVs that can match or exceed the skills of expert human pilots By achieving higher speeds, these drones not only complete tasks more quickly but also enhance the range of potential missions, proving especially beneficial for practical applications such as search-and-rescue operations.
The surge in drone racing popularity has been significantly driven by technological advancements and innovation, resulting in the creation of faster, more agile, and reliable drones that enhance the excitement of competitions Additionally, the emergence of artificial intelligence and autonomous systems has broadened the scope of drone racing to include AI-powered challenges, where the focus lies on developing sophisticated algorithms and AI agents that can autonomously navigate racing courses, highlighting the remarkable progress in AI and robotics.
AI development in Drone Racing Challenge
Drone racing challenges in AI focus on creating autonomous systems that enable unmanned aerial vehicles (UAVs) to navigate intricate racing courses with speed and accuracy The primary objective is to develop AI-driven drones that can surpass human pilots in competitive racing A common strategy to tackle this challenge involves utilizing a modular software architecture, which divides the problem into specific sub-modules such as perception, motion planning, and control, facilitating effective control and planning solutions.
Four innovative techniques for agile robots have emerged, demonstrating that drones can surpass human pilots in specific scenarios by achieving time-optimal trajectories To enhance perception tasks, deep learning methods are employed, effectively managing visual degradation These techniques extract features from RGB images of racing gates and predict their position and orientation, facilitating the development and continuous updating of a 3D map for improved planning.
The rapid development of AI-powered robots is driven by advanced deep learning algorithms In the realm of vision-based drone racing, however, these robots face unique challenges, including extreme speeds and accelerations, as well as perceptual constraints like a narrow field of view, motion blur, and limited sensing range These factors necessitate quick reaction times, creating substantial hurdles for current technological methods.
An innovative approach to autonomous drone racing employs an end-to-end method that infers control inputs directly from raw images, significantly reducing system latency by encapsulating the entire process within a single deep neural network Research utilizing imitation learning and a variant of ResNet has shown that drones can learn to follow trajectories and navigate gates smoothly by deriving desired velocities from RGB images, demonstrating strong generalization in new environments through domain randomization Further advancements include the integration of attention maps for enhanced robustness in tracking high-speed trajectories and a two-stage learning framework that associates image features with a privileged expert policy using ground-truth data These methods highlight the effectiveness of learning control policies directly from raw images, continuing the trend of end-to-end models, despite their inherent challenge as black box modules within the system.
5 to verify and debug the output actions of these methods This aspect should be taken into consideration when applying these approaches in practical applications
Algorithms encode diverse experiences related to geometry, shapes, colors, and interactions into high-dimensional features, enabling robust data-driven approaches However, this feature extraction process incurs significant computational costs, resulting in increased system latency To mitigate this issue, traditional feature extraction methods, such as oriented FAST and rotated BRIEF (ORB), are utilized for their computational efficiency These methods have been rigorously tested and are widely adopted in challenging conditions for mobile robots.
The ORB-SLAM3 algorithm excels in robot state estimation by leveraging ORB features, but it often discards these extracted features after their initial application, resulting in inefficient use of computational resources This study investigates the possibility of repurposing these computed features for enhancing robot navigation, aiming to optimize resource utilization and improve overall performance.
The proposed method integrates features with a lightweight convolutional neural network (CNN) to develop an end-to-end motion policy for controlling an agile aerial robot in autonomous drone racing This policy enables the robot to navigate a racing track autonomously, similar to human-piloted drone racing sports, by safely maneuvering through a series of unknown gates using only an RGB camera.
Figure 1-4 Overview of the proposed method
Rectified RGB images captured by the onboard monocular camera are analyzed to generate specific input representations These representations include (1) ORB, which focuses solely on the locations of ORB features, and (2) Mask, which consists of ORB-feature guided masks that encompass local patches around the identified ORB feature locations in the original image.
RGBO integrates ORB-feature locations as an extra channel in a standard RGB image The processed input is then analyzed by a lightweight CNN that predicts command velocities 𝑣 𝑥, 𝑣 𝑦, 𝑣 𝑧, and 𝑣 𝜓, enabling the quadrotor to navigate through the next gate The navigation policy is refined through imitation learning, utilizing desired smooth minimum-snap trajectories for optimal performance.
Our innovative approach addresses the high costs associated with gathering real-world training data by exclusively employing imitation learning within a simulation framework This strategy enables trained robots to function effectively in both simulated and real-world scenarios through zero-shot sim-to-real transfer learning As illustrated in Figures 1-4, extensive experiments conducted in both environments evaluate the performance of our method The findings highlight the advantages of leveraging computed features for end-to-end motion planning in autonomous racing quadrotors By integrating ORB feature positions with a lightweight CNN backbone, our method significantly outperforms traditional baseline approaches that depend solely on RGB images.
The contributions of this work can be summarized as follows:
(1) Introducing a novel DL-based method called ORB-Net, which effectively learns an end-to-end motion policy for controlling agile aerial robots by incorporating ORB feature positions;
(2) Demonstrating the effectiveness of the proposed method in autonomous drone racing through experiments in both simulated and real-world scenarios
This paper is structured to provide a comprehensive overview of the autonomous drone racing problem in Section II, followed by an in-depth discussion of the proposed end-to-end motion planning method in Section III Section IV offers experimental evaluations that compare the new approach with state-of-the-art baselines, while Section V concludes the study.
Thesis meaning
In the near future, UAVs are poised to revolutionize goods transportation Achieving agile and efficient performance in autonomous drone navigation is crucial for real-world applications This article presents "ORB-Net," an innovative end-to-end motion planning system that integrates traditional ORB features with deep learning, enhancing the capabilities of aerial robots across diverse scenarios.
Figure 1-5 Illustrate Autonomus Drone in real-worl application
This research significantly advances autonomous drone racing by tackling two major challenges: computational efficiency and robust perception To mitigate the computational demands of deep learning methods, we utilize ORB features, recognized for their efficiency, as a foundational element in the motion planning system By seamlessly integrating these features with RGB images, ORB-Net minimizes system latency, facilitating swift and real-time navigation.
Our research highlights the practical significance of ORB-Net through extensive real-world experimentation, demonstrating its superior performance in autonomous drone racing tasks compared to state-of-the-art methods This end-to-end motion planning system enables aerial robots to navigate racing tracks with remarkable speed, precision, and adaptability, closely matching the skills of expert human pilots.
The research on ORB-Net, which combines ORB features with deep learning, has significant implications for real-world applications beyond drone racing Its potential to enhance aerial robot navigation can greatly benefit search-and-rescue missions, surveillance operations, and precision agriculture, where quick and accurate decision-making is essential By addressing challenges related to computational efficiency and robust perception, this innovative approach marks a substantial advancement in autonomous navigation across various fields.
The thesis significantly enhances the research presented in the paper "ORB-Net: End-to-End Planning Using Feature-Based Imitation Learning for Autonomous Drone Racing," showcased at the 56th ISR Europe 2023 conference.
This thesis introduces an innovative method that merges traditional feature extraction with deep learning for autonomous drone navigation By combining ORB features with deep learning in ORB-Net, the research enhances the efficiency, agility, and adaptability of autonomous drones, positioning them as essential tools for tackling real-world challenges and advancing aerial robotics across diverse applications.
METHODOLOGY
Understanding Neural Network and Artifical Intelligent
A Convolutional Neural Network (CNN) is a Deep Learning algorithm designed to process input images by assigning learnable weights and biases to various edges and objects, enabling effective distinction between them Unlike traditional classification methods, CNNs require minimal preprocessing Their architecture is modeled after the Visual Cortex, reflecting the connectivity patterns of neurons in the human brain, where individual neurons respond to stimuli within a specific area known as the "Receptive Field." CNNs are widely acknowledged as the leading solution for image classification, object recognition, and even pixel-level tasks such as semantic segmentation.
Convolutional Neural Networks (CNNs) are gaining traction across various fields in computer vision due to their ability to learn spatial information hierarchically Comprising convolution layers, pooling layers, and fully connected layers, CNNs utilize backpropagation to automatically extract and transform features into final outputs, such as classifications The convolution layer plays a vital role, employing mathematical operations to analyze pixel values stored in a two-dimensional grid By applying an optimized kernel at each image position, CNNs efficiently detect features that may appear anywhere in the image, allowing for increasingly complex feature extraction as data progresses through the network.
The training process of a neural network involves fine-tuning parameters, such as kernels, to reduce the gap between predicted outputs and actual labels This is achieved through optimization techniques, including backpropagation and gradient descent.
Figure 2-1 Demonstration of neural network architecture
A Convolutional Neural Network (CNN) is structured with an input layer, multiple hidden layers, and an output layer In feedforward neural networks, hidden layers are termed so because their inputs and outputs are obscured by activation functions and the final convolution process These hidden layers primarily execute convolution operations between the convolutional kernel and the input matrix.
A ConvNet architecture is constructed using three key types of layers: convolution layers, pooling layers, and fully connected layers, similar to traditional neural networks These layers are systematically stacked to transform the initial image data, starting from pixel values and culminating in class scores at the final layer.
In a convolutional layer, the input tensor is structured as (batch size) x (input height) x (input width) x (input channels) Upon processing through this layer, the input transforms into a feature map, or activation map, with the dimensions of (batch size) x (feature map height) x (feature map width) x (output channels) Typically, a convolutional layer in a Convolutional Neural Network (CNN) exhibits specific properties that enhance feature extraction from the input data.
1 Filters/Convolution Kernels: Defined by width and height (hyperparameters)
2 Number of Input Channels and Output Channels: Hyperparameters The number of input channels must be equal to the number of output channels (also known as depth) of its input
3 Hyperparameters of the Convolution Operation: Such as padding, stride, and dilation rate
4 Convolutional layers perform convolution operations on the input and pass the results to the next layer This is similar to the response of a neuron in the visual cortex to a specific stimulus
5 Convolution reduces the number of free parameters, allowing the network to have greater depth
6 Furthermore, Convolutional Neural Networks are well-suited for grid-like data formats (such as images) because the spatial relationships between individual features are considered during the convolution and/or pooling process
Convolutional networks incorporate both local and global pooling layers in addition to traditional convolutional layers, effectively reducing data size by merging outputs from groups of neurons into single neurons in subsequent layers Local pooling typically operates on small clusters, such as 2 x 2 regions, while global pooling processes all neurons within the feature map The two prevalent types of pooling are max pooling, which selects the maximum value from each region, and average pooling, which calculates the average value.
12 value Pooling layers help reduce the input size, decrease the number of parameters and computations, and create spatial invariance for features
The fully connected layer is a crucial component in feedforward neural networks, where every neuron in the previous layer connects to all neurons in the current layer This layer plays a vital role in generating the final output, particularly for classification tasks In convolutional neural networks (ConvNets), the fully connected layer is employed after the convolutional and pooling layers to transform the extracted features into the ultimate classification result.
The ConvNet architecture consists of key components, including convolutional layers, pooling layers, and fully connected layers Convolutional Neural Networks (CNNs) have proven to be exceptionally effective in a range of computer vision and image processing tasks.
To determine the marker position, two processes are required, as illustrated in Figure 2-1 Initially, pre-processing is conducted once for each camera sensor to obtain the camera parameters, which are essential for projecting real-world points onto pixel images Using a chessboard image as input, the camera calibration block computes and outputs the necessary camera parameters.
The next step involves processing to determine the marker's position, which is updated regularly The pose estimate block calculates the marker's position in the camera frame using the marker's position in the image as input Additionally, the image processing block provides the marker's position in the captured real-world image By integrating the camera parameters with the marker's position in the image, the pose estimate block ultimately computes the marker's position in the camera frame, serving as input for the control process.
The ResNet (Residual Network) architecture has significantly transformed deep learning by changing how deep neural networks are designed and trained Introduced by Kaiming He and Xiangyu Zhang, this innovative framework addresses the challenges of training very deep networks, enabling improved performance and efficiency in various applications.
In their groundbreaking 2015 paper, "Deep Residual Learning for Image Recognition," Zhang, Shaoqing Ren, and Jian Sun introduced ResNet, which has since emerged as a foundational model in computer vision This architecture has set new benchmarks in accuracy and generalization across numerous visual recognition tasks.
Reinforcement Learning in Robotic
Reinforcement learning, a subset of Machine Learning, focuses on selecting optimal actions to maximize rewards in various situations This technique is utilized by software and machines to determine the best behavior or path to take Unlike supervised learning, which relies on labeled training data, reinforcement learning operates without predefined answers, allowing the agent to learn from its own experiences to accomplish tasks effectively.
Figure 2-3 Reinforcement Learning is an area of machine learning
Reinforcement Learning (RL) focuses on optimizing decision-making by learning the best actions to take in an environment to maximize rewards Unlike supervised or unsupervised machine learning, RL relies on a trial-and-error approach to accumulate data from machine learning systems, enabling the development of optimal behaviors.
Reinforcement learning employs algorithms that learn from outcomes to determine subsequent actions After each action, these algorithms receive feedback to assess the accuracy of their choices, categorizing them as correct, neutral, or incorrect This approach is particularly effective for automated systems that must make numerous small decisions independently, without human intervention.
Reinforcement Learning (RL) has its roots in early AI and psychology, drawing inspiration from the trial-and-error learning processes seen in both animals and humans In recent decades, RL has gained significant attention and development, leading to advancements in various applications.
17 progress has been made in RL, enabling its successful application in various domains, such as robotics, game playing, finance, recommendation systems, and more
The Components of Reinforcement Learning
Reinforcement Learning involves several key components that together form the framework for training an RL agent These components are:
Figure 2-4 The concepts of Reinforcement Learning
An agent is an entity that engages with its environment, learns from experiences, and makes informed decisions based on the knowledge it acquires This can include computer programs, robots, or any systems specifically designed to execute particular tasks.
The environment encompasses the external world that an agent interacts with, serving as the context for its operations It is within this setting that the agent learns and adapts by receiving feedback, which can manifest as rewards or punishments.
The state reflects the current environment and encompasses all essential information for decision-making by the agent While some situations allow for a fully observable state, where the agent has complete insight into the environment, many real-world scenarios present a partially observable state, necessitating the agent to infer hidden information.
An action refers to the choice made by an agent to impact its surroundings This decision is influenced by the agent's current state and the insights gained from previous experiences.
In reinforcement learning, the reward is the immediate feedback an agent receives from the environment following an action This feedback acts as a crucial signal, guiding the agent toward favorable outcomes and assessing the quality of its decisions.
A policy is a strategic framework or a collection of guidelines that an agent utilizes to identify the most effective action in any given situation The main goal for the agent is to develop an optimal policy that enhances the total reward accrued over time.
The value function plays a vital role in reinforcement learning by estimating the expected cumulative reward an agent can obtain from a specific state while adhering to a particular policy This function is essential for assessing the desirability of various states, ultimately guiding the agent towards making more informed decisions.
In certain reinforcement learning (RL) algorithms, a model component is utilized to predict the environment's response to the agent's actions without direct interaction This modeling capability enhances the agent's ability to explore and plan more effectively, leading to improved efficiency in learning and decision-making.
Reinforcement learning is a process where an agent interacts with its environment across various time steps During each step, the agent observes the current state, chooses an action according to its policy, and receives a reward from the environment The primary objective of the agent is to learn from these interactions to improve its decision-making over time.
19 feedback and improve its policy to maximize the cumulative reward it receives over time
The typical RL process can be described in the following steps:
1 Initialization: The agent initializes its policy, value function, and other necessary parameters
2 Observation: The agent observes the current state of the environment
3 Action Selection: The agent selects an action based on its current policy and strategy, such as exploration vs exploitation
4 Action Execution: The agent executes the chosen action in the environment
5 Feedback: The environment provides feedback to the agent in the form of a reward signal, indicating the immediate desirability of the action
6 Update: The agent updates its value function and policy based on the received reward and the new state of the environment
7 Repeat: Steps 2 to 6 are repeated for multiple iterations or episodes, allowing the agent to improve its policy through learning from experience
8 Convergence: The agent continues to learn and update its policy until it converges to an optimal or near-optimal solution
A key challenge in reinforcement learning lies in balancing exploration and exploitation, where the agent must navigate between trying new actions and leveraging existing knowledge to optimize its cumulative rewards.
Exploration is crucial for acquiring new information about the environment and identifying improved strategies This process is particularly important during the initial phases of learning, when the agent's policy remains undefined.
Imitation Learning in Robotics
Reinforcement Learning (RL) is an intriguing area of machine learning where an agent engages with an environment to learn an optimal strategy through a policy and reward system The main goal of RL is to develop a policy that maximizes cumulative rewards over time Despite the impressive achievements of RL algorithms, they often encounter difficulties in environments with sparse rewards and when the design of reward functions becomes intricate.
Imitation Learning (IL) effectively addresses challenges in reinforcement learning by utilizing expert demonstrations from humans or skilled agents, rather than depending solely on sparse rewards or manually designed reward functions This method enables the agent to learn through imitation, absorbing the demonstrated behaviors of the expert, making it especially beneficial in scenarios where reward signals are limited.
24 signals are scarce or when a direct reward function is unavailable, as in the case of teaching a self-driving vehicle
Imitation Learning (IL) allows agents to leverage the expertise of demonstrators, resulting in enhanced learning efficiency and improved performance This approach is particularly beneficial in situations where expert knowledge is accessible or when the complexity of the environment makes it challenging to create a suitable reward function manually.
Key Concepts in Imitation Learning
Imitation learning is particularly advantageous when an expert can easily demonstrate the desired behavior instead of defining a complex reward function or directly learning the policy At the core of imitation learning is the environment, which functions as a Markov Decision Process (MDP).
Figure 2-5 The life cirle of an Imitation learning training process
Imitation Learning fundamentally relies on expert demonstrations, which are sequences of actions executed by domain experts to achieve specific tasks These expert demonstrations act as a crucial training dataset for the learning process.
25 algorithm Experts may be human operators, experienced robots, or any entity capable of performing the task proficiently
Imitation Learning aims to develop a behavioral cloning policy that translates input states into actions, allowing an agent to replicate an expert's behavior The primary objective is to closely mirror the expert's decision-making process to achieve comparable task performance.
Imitation Learning is effectively a supervised learning approach that utilizes a dataset of state-action pairs derived from expert demonstrations The algorithm aims to establish a mapping from states to actions by minimizing the difference between the actions predicted by the imitation policy and those demonstrated by the expert.
Inverse Reinforcement Learning (IRL) is closely linked to Imitation Learning, focusing on deducing the reward function that drives an expert's actions By analyzing the expert's preferences, IRL seeks to extend the expert's behavior to unfamiliar scenarios.
Several methods have been proposed for Imitation Learning, each with its strengths, weaknesses, and use cases Let's explore some of the prominent Imitation Learning techniques:
Behavioral Cloning is a fundamental type of Imitation Learning that involves a learning algorithm mimicking the actions of an expert By utilizing a dataset of expert demonstrations, the algorithm develops a policy aimed at reducing the discrepancy between the expert's actions and its own predicted actions.
Behavioral Cloning is easy to implement and can be effective for low- dimensional action spaces
However, Behavioral Cloning suffers from the problem of compounding errors Small inaccuracies in the learned policy can lead to significant deviations from the expert's behavior, especially in complex environments
Figure 2-6 Behavioural cloning can fail if the agent makes a mistake
Figure 2-7 Dagger solve the problem by online training procedure
To mitigate compounding errors in Behavioral Cloning, the Dataset Aggregation (Dagger) algorithm was developed Dagger enhances the training process by iteratively gathering new data through the current imitation policy while consulting an expert for actions in various states This newly acquired dataset is integrated with the existing data, allowing for the retraining of the policy.
Dagger effectively minimizes the distributional shift between training data and real-world environments, enhancing the performance of imitation policies This approach is especially beneficial in situations where accessing the expert for new demonstrations is challenging.
Adversarial Imitation Learning, or Generative Adversarial Imitation Learning (GAIL), integrates concepts from Generative Adversarial Networks (GANs) into the Imitation Learning paradigm In this framework, a discriminator is employed alongside the imitation policy, tasked with differentiating between actions taken by an expert and those produced by the imitation policy.
The imitation policy aims to produce actions that deceive the discriminator, which is concurrently trained to enhance its ability to differentiate between actions from the expert and those from the imitation policy This adversarial training approach fosters the development of a more robust imitation policy, proving effective in complex, high-dimensional, and continuous action environments.
While Imitation Learning has shown promise in various applications, it comes with its share of challenges:
The effectiveness of the imitation policy is heavily impacted by the quality and variety of expert demonstrations A biased or limited training dataset can hinder the policy's ability to generalize to new situations, resulting in subpar performance in real-world applications.
Covariate shift occurs when there is a discrepancy between the data distribution used to train an imitation policy and the actual data distribution in real-world environments This mismatch can lead to decreased robustness and suboptimal performance of the learned policy when applied outside of the training context.
In some cases, obtaining expert demonstrations can be expensive, time- consuming, or even impossible Access to domain experts may be limited, especially in safety-critical or specialized domains
Input representation of Computer Vision and ORB feature
Humans possess an innate ability to effortlessly perceive the three-dimensional world, such as distinguishing the intricate shapes and translucency of a flower vase while separating the flowers from their background.
Computer vision seeks to enable computers to interpret their surroundings similarly to humans It emphasizes analyzing images and videos to reconstruct essential properties such as object shapes, intensity, and color distributions.
Input representation is essential in transforming data for machine analysis and decision-making This article examines three key input representations in computer vision: RGB (Red, Green, Blue), depth, and ORB (Oriented FAST and Rotated BRIEF) features We will analyze the characteristics, applications, and importance of each representation.
RGB representation is one of the most common and fundamental ways to represent visual data It is based on the three primary colors - red, green, and blue
The RGB color model combines red, green, and blue to create the color of each pixel in an image, enabling vibrant color displays on screens and monitors This representation is effective for capturing and analyzing the visual characteristics of various objects and scenes.
Figure 2-8 RGB pixel representation in an image
Color Information: RGB representation encodes color information, allowing for a rich and vivid depiction of the visual world
Visual Realism: RGB images closely resemble how humans perceive the world, making it intuitive for both humans and machines to interpret
Limitations: While RGB is excellent for capturing surface appearances, it lacks depth information and can struggle with scenes where color alone is insufficient to distinguish objects
Image Classification: RGB data is widely used for training deep learning models to classify objects in images
Object Detection: Many object detection algorithms leverage the color information from RGB images to identify and locate objects within a scene
Semantic Segmentation: RGB-based semantic segmentation assigns each pixel in an image to a particular class, enabling the identification of different objects and their boundaries
Depth representation captures the distance from the camera to different points in a scene, adding a crucial dimension of information beyond color This technique enhances the understanding of a scene's geometry and structure Various methods, including stereo vision, structured light, and time-of-flight cameras, can effectively represent depth.
3D Information: Depth representation adds a third dimension to the visual data, enabling the creation of 3D models and understanding of object distances
Scene Understanding: Depth helps in differentiating between objects at different depths, facilitating improved scene understanding
Challenges: Depth representation can be sensitive to lighting conditions and might struggle with transparent or reflective surfaces
Robotics: Depth data is essential for robots to navigate and interact with their environment It helps them avoid obstacles and grasp objects accurately
Augmented Reality (AR): AR applications use depth information to overlay virtual objects seamlessly onto the real world
Gesture Recognition: Depth data enables the recognition of hand gestures and movements, which is crucial in applications like gaming and virtual reality
Oriented FAST and Rotated BRIEF (ORB) [9] was developed at OpenCV labs by Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary R Bradski in
2011, as an efficient and viable alternative to SIFT and SURF ORB was conceived mainly because SIFT and SURF are patented algorithms
Figure 2-9 Demonstration of ORB Feature on RGB Image
The ORB (Oriented FAST and Rotated BRIEF) algorithm enhances the established FAST keypoint detector and BRIEF descriptor, both recognized for their efficiency and performance By introducing key improvements, ORB significantly boosts the capabilities and overall effectiveness of these techniques.
• The addition of a fast and accurate orientation component to FAST
• The efficient computation of oriented BRIEF features
• Analysis of variance and correlation of oriented BRIEF features
• A learning method for decorrelating BRIEF features under rotational invariance, leading to better performance in nearest-neighbor applications
ORB is a powerful feature detection and description algorithm used in feature matching and object recognition Unlike traditional methods that analyze the entire image, ORB targets localized keypoints that highlight unique patterns, making it ideal for situations where precise feature identification is essential.
2.4.1.5 Fast(Features from Accelerated and Segments Test)
The FAST (Features from Accelerated and Segments Test) algorithm is an integral part of the ORB (Oriented FAST and Rotated BRIEF) feature detection method It evaluates the brightness of a central pixel in relation to 16 surrounding pixels positioned in a circular layout Based on these brightness comparisons, the algorithm classifies the pixels as either lighter or darker, facilitating efficient feature extraction in image processing.
FAST keypoints are identified when a significant number of surrounding pixels are either lighter or darker than the central pixel, indicating their relevance These keypoints are essential as they convey critical information about edge locations within an image.
A key limitation of the FAST algorithm is its inability to provide orientation information and multiscale capabilities To overcome this, the ORB algorithm utilizes a multiscale image pyramid, which represents the image at various resolutions This approach allows ORB to detect keypoints at multiple scales, offering partial scale invariance By implementing the FAST algorithm at each level of the pyramid, ORB effectively identifies keypoints across different scales and orientations, thereby improving its robustness against variations in object size and orientation.
Figure 2-10 Each level in the pyramid contains the downsampled version of the image than the previous level
After locating keypoints orb now assign an orientation to each keypoint like left or right facing depending on how the levels of intensity change around that keypoint
2.4.1.6 Brief(Binary robust independent elementary feature)
Brief takes all keypoints found by the fast algorithm and convert it into a binary feature vector so that together they can represent an object Binary
36 features vector also know as binary feature descriptor is a feature vector that only contains 1 and 0 as shown in the Figure 2-11
Figure 2-11 Each keypoint is described by a feature vector which is 128–512 bits string
The Binary Robust Independent Elementary Features (BRIEF) descriptor begins its process with Gaussian smoothing of the image to enhance resilience against high-frequency noise This crucial initial step helps prevent the descriptor from being unduly affected by such noise Following the smoothing, BRIEF selects a random pair of pixels from a defined neighborhood around the designated keypoint.
The designated neighborhood, known as a "patch," is a square area defined by specific width and height, centered around a keypoint of interest The first pixel of a randomly selected pair is sampled from a Gaussian distribution, which is centered at the keypoint's location, with its spread determined by a parameter called "sigma."
The second pixel in a random pair is chosen from a separate Gaussian distribution, which is centered on the value of the first pixel The spread of this distribution is set to twice the sigma value of the initial distribution, thereby creating a correlation between the two selected pixels.
The final step involves a binary assignment: if the brightness of the first pixel is greater than that of the second pixel, a value of 1 is attributed to the
The BRIEF descriptor assigns a value of 0 to corresponding bits that do not match, creating a 128-bit vector for each keypoint in an image This process is repeated 128 times for every keypoint However, BRIEF lacks rotation invariance, leading to the development of rBRIEF (Rotation-aware BRIEF) used by ORB to address this limitation.
BRIEF creates a compact binary descriptor by analyzing pixel pairs in a localized patch, effectively capturing unique keypoint characteristics It remains robust against noise through Gaussian smoothing, with the relative brightness of pixel pairs forming the binary values that make up the descriptor.
Figure 2-12 Brief select a random pair and assign the value to them
2.4.1.7 Characteristics of ORB Feature Representation:
Keypoint Detection: ORB identifies keypoints in an image that have significant variations from their surroundings, making them suitable for matching across images
Invariant to Rotation and Scale: ORB features are designed to be invariant to changes in rotation and scale, making them robust for object recognition tasks
Limited Semantic Information: ORB features capture local patterns but lack high-level semantic understanding
2.4.1.8 Applications of ORB Feature Representation:
1 Object Tracking: ORB features are used to track objects across consecutive frames in videos or real-time camera feeds
2 Image Stitching: In panorama creation, ORB features are utilized to align and merge multiple images seamlessly
3 Robot Localization: ORB features can aid in localization tasks by matching observed features with a known map
4 Combining Representations for Enhanced Performance
Virtual Environment
Unreal Engine, developed by Epic Games in 1988, is a powerful game development engine originally designed for first-person shooter games Today, it is widely used to create various genres, including fighting games, RPGs, stealth games, and MMORPGs, utilizing C++ coding Known for its robust capabilities and extensive industry support, Unreal Engine is particularly favored for developing real-time 3D games.
A game engine is essential for creating engaging games and their interfaces, as it provides the necessary tools and features for developers to design and script their concepts By utilizing a game engine, developers can freely customize their games according to their vision Currently, Unreal Engine stands out among developers and game artists due to its extensive range of tools, ease of customization, and capabilities for creating high-definition AAA games across multiple platforms.
The platform offers users a comprehensive suite of tools and editors, enabling them to manage and customize their game properties effectively With its extensive mechanisms, the engine facilitates seamless game operation while providing essential resources for creating stunning artwork Additionally, the graphics components enhance the overall visual experience, making it a versatile choice for game developers.
39 engine, online module, physics engine, sound engine, input, and Gameplay Framework
With all the advantages of Unreal Engine in game development industry, we utilize Unreal Engine for Drone navigation task This offer several significant advantages:
Unreal Engine's advanced graphics and physics simulation technology allows for the development of incredibly realistic environments, which is essential for testing and validating navigation algorithms in conditions that mirror real-world scenarios Accurate simulation is vital for both automotive and drone navigation systems to guarantee dependable performance.
Scenario testing in Unreal Engine enables the creation of diverse environments, including urban streets and rural landscapes, to effectively simulate various driving and flying conditions This capability allows for comprehensive testing of navigation systems against challenges like adverse weather, intricate road layouts, and dynamic obstacles.
Unreal Engine's versatility enables the creation of tailored environments that accurately reflect real-world locations This capability is crucial for automotive navigation, allowing for the recreation of cityscapes and specific roadways for thorough testing Additionally, for drone navigation, it facilitates the simulation of various terrains and structures, enhancing the evaluation of navigation algorithms.
Unreal Engine facilitates the integration of virtual sensors, such as cameras, LiDAR, and radar, essential for automotive and drone navigation systems This capability allows for the effective testing of sensor fusion algorithms, object detection, and collision avoidance techniques within a controlled environment.
Unreal Engine's real-time capabilities enable interactive testing, allowing users to control vehicles or drones within simulations This hands-on approach lets developers observe behaviors and make immediate adjustments to algorithms, significantly speeding up the iterative testing process and enhancing development efficiency.
AI and machine learning are becoming essential for navigation in both the automotive and drone industries Unreal Engine serves as a powerful platform for training and testing these algorithms in intricate, dynamic environments, allowing AI systems to learn from a variety of scenarios.
Unreal Engine excels at generating extensive datasets for diverse navigation scenarios, which are crucial for training machine learning models and validating navigation algorithms This approach significantly reduces the time and resources required compared to traditional real-world data collection methods.
Building physical test environments for automotive and drone navigation can be expensive and labor-intensive Unreal Engine presents a cost-effective solution by enabling extensive testing in a virtual environment, eliminating the need for costly physical infrastructure.
Development Efficiency : Unreal Engine's visual scripting (Blueprints) allows developers to quickly prototype and experiment with navigation behaviors and algorithms This reduces development time and fosters innovation
Cross-Disciplinary Collaboration : Unreal Engine's user-friendly interface facilitates collaboration between different teams, including software developers,
AI engineers, and domain experts This is especially valuable in automotive and drone navigation projects, which require expertise from various fields
Integrating Unreal Engine into automotive and drone navigation projects greatly improves the development and testing phases, leading to more dependable navigation systems Its features, including realistic simulation, scenario testing, sensor integration, AI training, and cost-effective development, position it as the optimal solution for these applications.
AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles
In 2017, Microsoft Research launched AirSim, a simulation platform designed for AI research and experimentation Over five years, AirSim has significantly contributed to the sharing of research code and the exploration of innovative ideas in aerial AI development As technology has progressed, particularly in aerial mobility and autonomous systems, drone delivery has evolved from a science fiction idea into a viable business model, creating new demands in the industry.
Figure 2-14 Airsim Quarotor model fly in Unreal Engine environment
AirSim is an open-source simulator for drones and cars, built on Unreal Engine, offering cross-platform capabilities and support for software-in-the-loop simulations with popular flight controllers like PX4 and ArduPilot It enables hardware-in-loop simulations with PX4, providing physically and visually realistic environments Designed as a plugin for Unreal, AirSim allows seamless integration into any Unreal project The development team aims to establish AirSim as a platform for AI research, facilitating experiments with deep learning, computer vision, and reinforcement learning algorithms for autonomous vehicles Additionally, AirSim provides APIs for data retrieval and vehicle control in a platform-independent manner.
AirSim offers both manual driving and programmatic control for drones and cars Users can manually operate drones with a remote control and drive cars using arrow keys The platform provides APIs for programmatic interaction with vehicles, allowing users to retrieve images, access vehicle states, and control the vehicles These APIs are available through RPC and support multiple programming languages, including C++, Python, C#, and Java Additionally, they can be utilized as part of a cross-platform library for deployment on companion computers, enabling users to write and test code in the simulator before executing it on real vehicles AirSim emphasizes transfer learning and related research by facilitating easy recording of training data for deep learning applications, thanks to full access to images, states, and control of the drones.
In this project, we utilize a quadrotor model from the AirSim library to navigate various Unreal Engine environments, aiming to collect a substantial dataset for training an End-to-End drone racing model, as illustrated in Figure 2-14.
PROPOSE METHOD
Simulation environment
In this project, I utilized Unreal Engine to develop a simulation environment for data collection and deep learning model training The Quadrotor model was controlled using AirSim, which offers realistic sensor simulations for various autonomous system sensors, including RGB cameras, depth cameras, LiDAR, GPS, and IMU This capability enables thorough testing and fine-tuning of our model with diverse input representations.
In this section, I will focus on how I effectively utilized various features to set up the training and testing environment for the neural network, rather than detailing the creation of drone models and virtual objects.
In our training environment, we developed a straightforward simulation featuring a Gate prototype and textured backgrounds Consistent with prior drone racing research, we utilized square-shaped gates adorned with checkerboard patterns at each corner, measuring 2.1 meters in outer diameter and 1.5 meters in inner diameter, for both simulated and real-world experiments.
The proposed models will utilize simulation environments created with the photo-realistic game engine Unreal Engine, interfaced through the open-source AirSim library This approach aims to minimize the disparity between simulated and real-world visuals by leveraging AirSim's capabilities to replicate complex ambient effects, closely resembling the appearance of actual environments.
To enhance success rates in sim-to-real transfer learning, domain randomization is utilized by varying backgrounds and lighting conditions during training This approach helps the model concentrate on the gate rather than the surrounding noise and perturbations, thereby improving its adaptability across various scenarios.
Figure 3-1 Domain randomization with different background
To bridge the gap between simulation and reality, I developed a collection of 31 texture backgrounds that were randomly altered during data collection, as illustrated in Figure 3-1 For training, we employed six gates to generate a single data collection trajectory, with each gate's position randomly initialized within our predefined constraints The specifics of the training process incorporating domain randomization will be elaborated in section 3.2.
Perception system set up
Figure 3-2 Overview of the proposed approach
This article addresses the challenge of achieving robust and agile flight for a quadrotor in dynamic environments by utilizing an end-to-end perception network for navigation and control The proposed method analyzes images captured by an onboard camera, specifically a 300 × 200 pixel RGB image, to determine the desired flight direction and navigation speed The network outputs a tuple {𝑉𝑥, 𝑉𝑦, 𝑉𝑧, 𝑌𝑎𝑤}, where 𝑉𝑥, 𝑉𝑦, and 𝑉𝑧 represent the velocity vector in the x, y, and z axes towards a new goal in 3D space, while Yaw indicates the rotation around the quadrotor's vertical axis Additionally, v ∈ [0,1] signifies the normalized desired speed for approaching the target.
In this section, I will show you how we develop our end-to-end system
To develop an effective perception network, we begin by selecting the most relevant baseline models After extensive research, we identify several compact models suitable for deployment on embedded devices Our proposed methods are then evaluated against these established baselines.
DroNet is based on a modified ResNet-8 CNN, initially introduced by Loquercio et al to enhance inference speed by reducing the number of filters and parameters in the fully connected layer Kaufmann et al later adapted DroNet with two MLPs to predict the mean and variance of a drone's pose for model predictive control in racing This study also adjusts the output layers for compatibility while resizing the input layer to accommodate 160×120 images Unlike the original DroNet, which only processes RGB images, this research explores various input representations to enhance the model's adaptability across different scenarios.
A variant of the previously proposed lightweight perception framework of the research team that working with me in this project, GateNet [20], is created
46 by alternating the output layer to produce compatible outputs This baseline also takes RGB images as input for the network
GateNet revolutionizes traditional object detection by integrating the predictions of gate center, orientation, and distance into a single deep neural network, unlike standard detectors that only identify gate locations Utilizing a wide-FOV fish-eye RGB camera, GateNet processes scene images to accurately estimate the gate's center location, distance, and orientation relative to the drone's body frame, as illustrated in Figure 3-3.
Figure 3-3 An illustration of the working principles of our gate perception system
The data will be utilized to reconstruct the gate within a 3D environment and implement an extended Kalman Filter for accurate gate pose estimation The network's design and parameter count facilitate rapid inference speed, which is essential for successful autonomous drone racing.
The GateNet architecture utilizes a convolutional neural network (CNN) with a single fully connected layer at the end The convolutional layers are responsible for extracting features from the input image, while the fully connected layer predicts confidence values, gate centers on the image plane, and the distance and orientation of the gates in relation to the drone.
The GateNet architecture, illustrated in Figure 3-4, incorporates conv2D layers followed by batch normalization and ReLU activation functions To maintain the aspect ratio of input images at 4:3, the dense layer is reshaped into a tensor with 4 rows and 3 columns The output tensor has a depth of 5, reflecting the network's predictions, which include (i) a confidence value, (ii) x and (iii) y offsets, as well as (iv) relative distance and (v) orientation for each cell.
GateNet features six convolutional layers followed by one fully connected layer The first five convolutional layers incorporate batch normalization, ReLU activation, and max pooling with a size of two, while the final convolutional layer includes batch normalization and ReLU without pooling This architecture reduces feature activations to a compact 3×5×16 shape, facilitating real-time inference by avoiding computational bottlenecks The extracted features are then flattened into a 1D vector of size 240, allowing all hidden neurons from the 3D tensor to connect with those in the dense layer, similar to the YOLO architecture We prefer the flatten operation over other dimensionality reduction techniques to prevent information loss, given the small dimension of the last convolutional layer Finally, the flattened vector is linked to a fully connected layer, and no non-linearity is applied post this layer, producing an output vector reshaped to R × C × 5.
The output layer consists of R rows and C columns, and while reshaping the output vector does not influence the network's forward pass during inference or the backward pass in backpropagation, it provides a more intuitive representation of the target vectors.
The 48-layer configuration consists of five key components: (i) the confidence value, (ii) the x-axis offset of a gate's center within an image, (iii) the y-axis offset of a gate's center within an image, (iv) the orientation of the gate in relation to the drone, and (v) the distance from the gate to the drone.
The ground-truth creation process involves calculating the pixel coordinates of a gate center using a world-to-image transformation Center offsets are determined based on the top-left corner of the grid that includes the gate center Each cell in the grid is assigned a confidence value, center offsets, distance, and orientation Importantly, the use of linear activations in the final fully connected layer allows for negative center offsets, enabling GateNet to predict gate centers for partially observed gates, even when their centers are not visible in the input image.
To prepare target samples for training, the input image is segmented into 𝑅 rows and 𝐶 columns The offsets 𝑥 and 𝑦 for each gate are determined by calculating the differences between the gate's center and the top-left corner of the corresponding grid This method enables GateNet to accurately predict gate centers for partially observed gates, even if their centers are not located within the grid's input image corner It is important to note that the offsets 𝑥 and 𝑦 are normalized relative to the grid size, but they can take negative values if a gate center lies outside the input image due to the absence of a bounding non-linear activation function Ultimately, the target sample is represented as a tensor with a specific shape.
𝑅 × 𝐶 × 5, is created to store the targeted variables (Figure 3-5c) The confidence value (c) is set to 1, if a gate center is presented in the corresponding
The distance (d) and orientation (θ) are measured in meters and radians, respectively, relative to the gate To maintain an aspect ratio of 4:3 for the input images, we empirically select values of 𝑅 and 𝐶 as 3 and 4.
ORB-Net: Lightweight drone navigation network with ORB feature
ORB-Net features a network architecture that includes a modified backbone, enhancing the original design from [17] with an extra dense layer and adjusted outputs for real-valued velocities This additional dense layer introduces new parameters, enabling the network to learn and predict velocities with greater accuracy Furthermore, the input layer has been adapted to accommodate various input representations.
This study employs the lightweight CNN architecture, GateNet, as its backbone network, which requires input images sized at 160×120 pixels To maintain the aspect ratio, the input tensor is resized with zero padding The architecture consists of six 2D convolutional layers, each equipped with multiple filters and followed by batch normalization, a ReLU activation function, and 2×2 max pooling Notably, the final convolutional layer, sized at 3×5×16, omits max pooling and is instead flattened to connect the extracted features with the hidden neurons in the subsequent dense layers.
EXPERIMENT AND RESULT
Evaluation environment set up
We thoroughly assess the proposed method across various simulated and real-world scenarios that the robot has not encountered during training In these novel environments, we position three gates and observe the drone's ability to navigate through them to complete a trajectory The difficulty of the tracks is categorized into distinct levels, as illustrated in Figure 4-1.
Figure 4-1 Simulation scenarios (a) Easy-level tracks in which consecutive racing gates do not change orientation and slightly vary their spatial locations
(b) Medium-level tracks, where gates slightly change both orientation and spatial locations with a more cluttered background (c) Difficult-level tracks in which the illumination is more challenging
We test our model in three different environments with respect to its level of difficulty
The self-built simulation environment depicted in Figure 4-1 is designed for testing drone racing projects, closely mirroring real scenarios in Open-AirLab This virtual setup features strategically positioned gates and optimized lighting conditions to enhance visibility for the drone's camera The gates are aligned in a straight line, minimizing orientation differences and ensuring the drone can easily see the next gate as it passes through With only slight positional adjustments between gates, this experiment effectively showcases the model's performance in an ideal environment.
Easy-level tracks feature bright backgrounds and consist of consecutive racing gates with slightly varying spatial locations and similar orientations, spaced approximately 5 meters apart In contrast, medium and difficult-level tracks present gates that change in both orientation (±30°) and spatial locations, set against a more cluttered backdrop Additionally, difficult-level tracks are characterized by darker lighting conditions, which enhance the visual challenges for racers.
Table 4-1 Success rates (in percentage) for gate passing with respect to racing tracks with different levels of difficulties in various simulation environments
The success rates in gate passing for each model are presented in Table 4-
1 For each model, the simulated drone is tested in ten different runs with a varying initial position and heading angle, and success rates are recorded as the number of gate passages without collision over the total number of possible passages It can be seen that the model learning from ORB feature locations only – ORB-Net-O – does not perceive the environment accurately, resulting in poor performances in all tracks and environments However, adding the ORB feature locations as an additional channel to the input tensor seems to increase the success rate considerably in medium-level and difficult-level tracks As a result, ORB-Net-RGBO outperforms all the baselines and other proposed methods in all scenarios The method using guided-masked input –ORB-Net-Mask – which uses masking patches around the ORB feature locations in the original RGB images, performs poorly in medium and difficult scenarios This is possibly due to the locations of the patches in the images being frequently changed due to the inconsistency of the calculated ORB features that may negatively impact the updates of their associated weights in the CNN It can also be noted that while
ORB-Net-RGB outperforms GateNet-RGB by utilizing similar convolutional layers within its network architecture, primarily due to the inclusion of an additional dense layer This extra layer enhances the model's performance by introducing parameters that improve the accuracy of regressing command velocities.
Real-world experiments set up
We conducted further experiments in real-world settings at our lab, utilizing a racing track with gates similar in size and appearance to those in our simulations The end-to-end models were tested on a mini racing quadrotor equipped with an NVIDIA Jetson TX2 onboard computer, showcasing the practical effectiveness of our approach The quadrotor features a forward-facing ZED mini stereo camera, although only the right camera is utilized for a monocular setup A Pixhawk-based flight controller module is employed to translate regressed velocities into motor speeds, while an external motion capture system provides the robot's state, minimizing errors from onboard estimations The entire software framework is developed using ROS.
The racing quadrotor utilized in practical experiments features an NVIDIA Jetson TX2 onboard computer and a ZED mini stereo camera For the purpose of these experiments, only the right RGB camera is employed to maintain a monocular camera configuration.
In these experiments, models trained exclusively in simulations, without any retraining on real-world data, were assessed on a real racetrack under two different lighting conditions The findings for both the RGB-only model and the combined RGB and ORB feature model (RGB-O) are detailed in Table 4-2.
In a real-world experiment depicted in Figure 4-3, a drone successfully navigates a race track featuring two gates The red box highlights the drone's actual position after it has passed the first gate and is making an attempt to pass the second gate.
Table 4-2 Success rates (in percentage) of SIM-TO-REAL TRANSFERRED models counted as the gate passing percentage in a real-world racing track with different lighting levels
Both models underperformed in real-world conditions compared to their previous evaluations, as demonstrated on a race track with varying illumination levels The performance results for the RGB-only model and the combined RGB and ORB feature (RGB-O) model are detailed in Table 4-2.
Both models exhibit reduced performance in real-world scenarios compared to previous simulation experiments, highlighting the negative effects of the sim-to-real gap However, incorporating an additional ORB feature location channel has improved success rates for completing the racing track under low illumination conditions, demonstrating the value of this enhancement This finding reinforces our Simulation to Real approach, as the inclusion of ORB features in the input representation has markedly boosted the model's adaptability in challenging lighting environments.
CONCLUSION
The graduation thesis "ORB-Net: End-to-end Planning Using Feature-based Imitation Learning for Autonomous Drone Racing" enhanced my understanding of robotic automatic systems and vision-based control This research explores the use of ORB feature extraction for robot navigation by integrating these features into a deep learning approach that enables an agile aerial robot to autonomously complete a racing track Experimental results indicate that while ORB feature locations alone are insufficient for learning a navigational policy, combining them with RGB images significantly outperforms methods using only RGB images Additionally, methods trained solely in simulation showed poor performance in real-world scenarios, highlighting the need for improved sim-to-real transfer learning I would like to express my gratitude to Dr Nguyen Anh Quang and the IVSR lab and AirLab members for their invaluable support during this project.
[1] D T Tran*, D D Tran*, A M Nguyen*, Q V Pham, N Shimada, J H Lee and
A Q Nguyen, "MonoIS3DLoc: Simulation to Reality Learning Based Monocular Instance Segmentation to 3D Objects Localization From Aerial View," in IEEE Access, vol 11, pp 64170-64184, 2023, doi: 10.1109/ACCESS.2023.3288027
[2] Huy Xuan Pham*; Micha Heiò*; Dung Tran*; Minh Anh Nguyen; Anh Quang
Nguyen and Erdal Kayacan, "ORB-Net: End-to-end Planning for Drones Using ORB Feature-based Imitation Learning",56th International Symposium on Robotics (ISR Europe2023), Germany, 2023.
[3] Cong Phuc Nguyen; Van Tuan Nguyen; Duc Dung Tran; Anh Minh Nguyen; Ngoc
In their 2022 paper presented at the International Conference on Advanced Technologies for Communications in Hanoi, Vietnam, Phong Dao, Dinh Tuan Tran, Joohoo Lee, and Anh Quang Nguyen explore a multi-task deep-learning approach for vehicle detection and tracking using aerial views captured by UAVs The study emphasizes the effectiveness of advanced technologies in enhancing surveillance and monitoring capabilities in various applications Their findings, detailed in pages 86-91 of the conference proceedings, contribute significantly to the field of intelligent transportation systems and aerial surveillance For further reference, the work can be accessed via the DOI: 10.1109/ATC55345.2022.9942962.
[1] A Loquercio, E Kaufmann, R Ranftl, M Muller, V Koltun, and D Scaramuzza, “Learning high-speed flight in the wild,” Science Robotics, vol 6, no 59, p eabg5810, 2021
[2] K He, X Zhang, S Ren, and J Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp 770–778
[3] C Pfeiffer, S Wengeler, A Loquercio, and D Scaramuzza, “Visual attention prediction improves performance of autonomous drone racing agents,” PLOS ONE, vol 17, pp 1–16, 03 2022
[4] J Fu, Y Song, Y Wu, F Yu, and D Scaramuzza, “Learning deep sensorimotor policies for vision-based autonomous drone racing,” arXiv preprint arXiv:2210.14985, 2022
[5] H Nguyen, S H Fyhn, P De Petris, and K Alexis, “Motion primitives based navigation planning using deep collision prediction,” in 2022 International Conference on Robotics and Automation (ICRA), 2022, pp 9660–9667
[6] H X Pham, H I Ugurlu, J Le Fevre, D Bardakci, and E Kayacan,
Chapter 15 of "Deep Learning for Robot Perception and Cognition," edited by A Iosifidis and A Tefas, focuses on the application of deep learning techniques for vision-based navigation in autonomous drone racing This chapter, spanning pages 371 to 406, explores advanced methodologies that enhance the perception and cognitive capabilities of drones, enabling them to navigate complex environments effectively The insights provided in this section are crucial for understanding how deep learning can optimize drone performance in competitive racing scenarios For further details, the chapter is accessible online through ScienceDirect.
[7] I Bozcan, J Le Fevre, H X Pham, and E Kayacan, “Gridnet: Image agnostic conditional anomaly detection for indoor surveillance,” IEEE Robotics and Automation Letters, vol 6, no 2, pp 1638–1645, 2021
[8] H I Ugurlu, X H Pham, and E Kayacan, “Sim-to-real deep reinforcement learning for safe end-to-end planning of aerial robots,” Robotics, vol 11, no 5, 2022 [Online] Available: https://www.mdpi.com/2218-6581/11/5/109
[9] E Rublee, V Rabaud, K Konolige, and G Bradski, “Orb: An efficient alternative to sift or surf,” in 2011 International conference on computer vision Ieee, 2011, pp 2564–2571
ORB-SLAM3 is an advanced open-source library designed for visual, visual-inertial, and multi-map simultaneous localization and mapping (SLAM) The research, conducted by Campos et al., was published in the IEEE Transactions on Robotics, highlighting its accuracy and versatility in various applications For more detailed information, the full article can be accessed online at arXiv:2007.11898.
[11] O’Shea, K., & Nash, R (2015) An Introduction to Convolutional Neural Networks ArXiv, abs/1511.08458
[12] He, K., Zhang, X., Ren, S., & Sun, J (2015) Deep Residual Learning for Image Recognition 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770-778
[13] Torabi, F., Warnell, G., & Stone, P (2018) Behavioral Cloning from Observation International Joint Conference on Artificial Intelligence
In their 2011 paper presented at the fourteenth international conference on artificial intelligence and statistics, S Ross, G Gordon, and D Bagnell explore the relationship between imitation learning, structured prediction, and no-regret online learning They propose a reduction framework that connects these concepts, contributing to advancements in machine learning methodologies Their findings are documented in the JMLR Workshop and Conference Proceedings, pages 627 to 635.
[15] Suzuki, Satoshi & be, KeiichiA (1985) Topological structural analysis of digitized binary images by border following Computer Vision, Graphics, and Image Processing 30 32-46 10.1016/0734-189X(85)90016-7
[16] Ho, J., & Ermon, S (2016) Generative Adversarial Imitation Learning NIPS
[17] Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., & Bengio, Y (2014) Generative Adversarial Nets NIPS
[18] Shah, S., Dey, D., Lovett, C., & Kapoor, A (2017) AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles International Symposium on Field and Service Robotics
[19] P Foehn, D Brescianini, E Kaufmann, T Cieslewski, M Gehrig, M Muglikar, and D Scaramuzza, “Alphapilot: Autonomous drone racing,” Autonomous Robots, vol 46, no 1, pp 307–320, 2022