1. Trang chủ
  2. » Luận Văn - Báo Cáo

nghiên cứu phát triển hệ thống nhận diện cử chỉ sử dụng camera dựa trên công nghệ học máy nhẹ tinyml hoạt động trên vi điều khiển

84 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

DEVELOPING A GESTURE RECOGNITION SYSTEM USING CAMERAS BASED ON LIGHTWEIGHT MACHINE LEARNING TECHNOLOGY TINYML OPERATING ON A Sinh viên: Trần Mạch Tuấn Kiệt Mã số sinh viên: 19010214 Kh

Trang 1

(DEVELOPING A GESTURE RECOGNITION SYSTEM USING CAMERAS BASED ON LIGHTWEIGHT MACHINE LEARNING TECHNOLOGY (TINYML) OPERATING ON A

Sinh viên: Trần Mạch Tuấn Kiệt

Mã số sinh viên: 19010214 Khóa: K13

Ngành: Kỹ thuật Điều khiển và Tự động hóa Hệ: Đại học chính quy

Giảng viên hướng dẫn: TS Lê Minh Huy

Hà Nội – 2024

Copies for internal use only in Phenikaa University

Trang 2

(TINYML) HOẠT ĐỘNG TRÊN VI ĐIỀU KHIỂN

(DEVELOPING A GESTURE RECOGNITION SYSTEM USING CAMERAS BASED ON LIGHTWEIGHT MACHINE LEARNING

TECHNOLOGY (TINYML) OPERATING ON A MICROCONTROLLER)

Sinh viên: Trần Mạch Tuấn Kiệt

Mã số sinh viên: 19010214 Khóa: K13

Ngành: Kỹ thuật Điều khiển và Tự động hóa Hệ: Đại học chính quy

Giảng viên hướng dẫn: TS Lê Minh Huy

Hà Nội – 2024

Copies for internal use only in Phenikaa University

Trang 9

CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

Độc lập - Tự do - Hạnh phúc

BẢN GIẢI TRÌNH SỬA CHỮA ĐỒ ÁN TỐT NGHIỆP

- Khoa Điện – Điện tử

Họ và tên tác giả đồ án/khóa luận: Trần Mạch Tuấn Kiệt

Ngành: Kỹ thuật điều khiển và Tự động hóa

Đã bảo vệ đồ án/khóa luận tốt nghiệp ngày 3 tháng 4 năm 2024

Tên đề tài: NGHIÊN CỨU PHÁT TRIỂN HỆ THỐNG NHẬN DIỆN CỬ CHỈ SỬ DỤNG CAMERA DỰA TRÊN CÔNG NGHỆ HỌC MÁY NHẸ (TINYML) HOẠT ĐỘNG TRÊN VI ĐIỀU KHIỂN

Giảng viên hướng dẫn: TS Lê Minh Huy

Theo góp ý của Hội đồng, dưới sự định hướng của giảng viên hướng dẫn, tác giả đồ án/khóa luận đã nghiêm túc tiếp thu những ý kiến đóng góp của Hội đồng và tiến hành sửa chữa, bổ sung đồ án theo đúng tinh thần kết luận của Hội đồng Chi tiết về các nội dung chỉnh sửa như sau:

1 Tác giả chỉnh sửa và bổ sung đồ án theo góp ý của Hội đồng

Nội dung Trang cũ Sửa thành Trang mới

gọn related work

2 Các thông số cài đặt huấn

luyện mô hình

Bảng tổng hợp các thông số cài đặt huấn luyện mô hình

31 Định nghĩa về các thông số

đánh giá

Định nghĩa về các thông số đánh giá

Trang 10

TÓM TẮT

Dự án này giới thiệu một cách tiếp cận mới sử dụng nhận dạng cử chỉ tay sử dụng camera trên các vi điều khiển cho mục đích điều khiển Phương pháp đề xuất sử dụng mô hình học máy nhẹ (TinyML) được thiết kế riêng để triển khai trên các nền tảng hạn chế về tài nguyên như bộ vi điều khiển Đề xuất này được so sánh với các phương pháp trước đó mang lại kết quả cạnh tranh về độ chính xác nhưng với kích thước nhỏ hơn đáng kể Sau đó, mô hình này được triển khai trên bộ vi xử lý ARM Cortex M7 và được áp dụng để điều khiển máy quét từ tính trong hệ thống Kiểm tra không phá hủy (NDT)

Dự án này là tổng hợp ứng dụng kiến thức về truyền thông và điều khiển các thiết bị tự động hóa Ngoài ra, nó cũng yêu cầu các kỹ thuật lập trình điều khiển bằng cả LabVIEW và Python để tối ưu hóa vận hành của hệ thống

Trong quá trình hoàn thiện, tác giả đã tích lũy được những kiến thức, kinh nghiệm, chuyên môn vững chắc trong lĩnh vực Deep Learning, TinyML và áp dụng thành công các kiến thức về giao tiếp giữa các thiết bị, điều khiển hệ thống tự động, v.v

Copies for internal use only in Phenikaa University

Trang 11

LỜI CAM ĐOAN

Tên tôi là: Trần Mạch Tuấn Kiệt

Ngành: Kỹ thuật Điều khiển và Tự động hóa

Tôi đã thực hiện đồ án tốt nghiệp với đề tài: Nghiên cứu phát triển hệ thống nhận diện cử chỉ sử dụng camera dựa trên công nghệ học máy nhẹ (TinyML) hoạt động trên vi điều khiển

Tôi xin cam đoan đây là đề tài nghiên cứu của riêng tôi và được sự hướng dẫn của TS Lê Minh Huy

Các nội dung nghiên cứu, kết quả trong đề tài này là trung thực và chưa được các tác giả khác công bố dưới bất kỳ hình thức nào Nếu phát hiện có bất kỳ hình thức gian lận nào tôi xin hoàn toàn chịu trách nhiệm trước pháp luật

GIẢNG VIÊN HƯỚNG DẪN

Trang 12

LỜI CẢM ƠN

Để hoàn thành đồ án này, em xin được bày tỏ lòng biết ơn sâu sắc đến tất cả những người đã hỗ trợ, giúp đỡ em trong quá trình thực hiện đồ án Quá trình hoàn thiện đồ án gặp vô số trở ngại, khó khăn vì tính mới của dự án Tuy nhiên, em đã nhận được rất nhiều sự trợ giúp khác nhau trên tất cả các phương diện kiến thức, vật chất, tinh thần để có thể hoàn thiện đồ án tốt nghiệp với kết quả tốt nhất

Đầu tiên, em xin chân thành cảm ơn thầy Lê Minh Huy Thầy là người đã trực tiếp hướng dẫn em thực hiện đồ án, chỉ bảo tận tình cũng như trợ giúp em giải quyết các vấn đề khó khăn trong quá trình hoàn thiện đồ án

Tiếp đến, em xin gửi lời cám ơn chân thành đến toàn thể các thầy cô trong khoa Điện- Điện tử, trường Đại học Phenikaa, đặc biệt là sự giúp đỡ từ ICSLab (A4-705, Trường Đại học Phenikaa) đã tạo điều kiện thuận lợi nhất cho em hoàn thành đề tài đồ án tốt nghiệp

Em rất mong nhận được sự góp ý của thầy cô để em có thể khắc phục những nhược điểm và ngày càng hoàn thiện bản thân

Xin chân thành cảm ơn!

Hà Nội, ngày tháng năm 2024 Sinh viên thực hiện

Copies for internal use only in Phenikaa University

Trang 13

Abstract

This project introduces a novel approach utilizing camera-based hand gesture recognition for control purposes The proposed methodology employs lightweight machine learning models (TinyML) tailored for deployment on resource-constrained platforms like microcontrollers This proposal is compared with previous methods which give a competitive result in accuracy but with significantly smaller size Subsequently, the model is implemented on an ARM Cortex M7 microprocessor and applied to govern a magnetic scanner within a Non-Destructive Testing (NDT) experimental system

This project is the application of knowledge about communication and control of automation devices Furthermore, it integrates control programming techniques using both LabVIEW and Python to optimize the system's operational fluidity During the process of completing, the author has gained substantial knowledge, experience, and expertise in the field of Deep Learning, TinyML and successfully applied knowledge about communication between devices and automatic system controlling, etc

Copies for internal use only in Phenikaa University

Trang 14

CHAPTER III EXPERIMENTAL RESULTS & BENCHMARK 31

3.1 Experimental setup & dataset 31

Trang 15

CHAPTER IV BUILDING A CONTROL SYSTEM USING HAND

4.4.1 Deploy model into OpenMV Cam H7 51

4.4.2 Control programming on LabView 54

Trang 16

Table of Figures

Figure 1.2 Workflow of the entire research for hand gesture recognition

(HGR) Part 1 involves the process of building a Deep Learning model After

obtaining the model and successfully deploying it on OpenMV, part 2 describes the process of building a control system to integrate the HGR model into real applications 7Figure 2.1 Block diagram of proposed method for hand gesture recognition 8Figure 2.2 Data augmentation processing 10Figure 2.3 Some samples after data augmentation with Random Zoom, Random Rotation, Random Brightness 11

Figure 2.4 Basic Convolutional Neural Network A typical deep learning model can take an input and transform it into corresponding outputs Raw image data can

be fed into the model without going through any traditional data extraction techniques [30] 13

Figure 2.5 Comparison between systems with TinyML (a) and without

TinyML (b) TinyML is embedded directly on the microprocessors on the sensor

to process data before seeking help from edge AI or cloud AI [31] 13

Figure 2.6 Kernel production operation The process of multiplying the kernel

with each submatrix taken from the input respectively along the length and width of the input 15

Figure 2.7 Forward propagation and Backward propagation of convolution

operation Both processes follow chain rules 16

Figure 2.8 Operation of standard convolutional layer (a) replaced by depthwise convolutional layer with two separate layer included depthwise layer (b) and pointwise layer (c) [9] 19Figure 2.9 Squeeze and Excitation block [24] 22

Figure 2.10 Proposed model architecture as base model for hand gesture

recognition We use the first four stages as feature extractor and final stage as

lightweight classifier 27Copies for internal use only in Phenikaa University

Trang 17

Figure 2.11 Detailed structure of each SE block used in micro-bottleneck

Block a) is SE Conv Block to help reduce spatial dimension before entering block

b) SE Residual Block to better extract features 27

Figure 3.1 Some sample from ASL dataset 32

Figure 3.2 Some sample from OUHANDS dataset 33

Figure 3.3 Experimental setup 33

Figure 3.4 Learning curve diagram when training the model on the ASL with expansion factor t = 3.0 The learning curve diagram shows the decrease in loss function and increase in accuracy The model achieves the best results after more than 50 epochs and is nearly 100% accuracy for both training and validation sets 35

Figure 3.5 Accuracy and number of parameters when using two different expansion factor, t = 0.25 and t = 3.0 36

Figure 3.6 Confusion matrix of proposed method with a) t = 0.25 and b) t = 3.0 when deploying transfer learning in OUHANDS dataset 38

Figure 3.7 Comparing size (a) before and after implementing the quantization technique for each of the following models: proposed model with micro-bottleneck, Lightweight CNN [34], CrossFeat [33] and ExtriDeNet [16] 39

Figure 4.1 System design block diagram 42

Figure 4.2 The control system in practice 43

Figure 4.3 Motion Control PMC-1HS-USB 43

Figure 4.4 Driver MD5-HD14 44

Figure 4.5 Step motor A8K-M566 45

Figure 4.6 OpenMV Cam H7 46

Figure 4.7 OpenMV IDE 48

Figure 4.8 LabVIEW front panel, which contains many types of objects placed inside many decorations’ blocks 49

Figure 4.9 LabVIEW block diagram The input, output value and button are represented in front panel 49

Figure 4.10 Simple block diagram of connecting to OpenMV through LabVIEW [40] 50

Copies for internal use only in Phenikaa University

Trang 18

Figure 4.11 LabVIEW & OpenMV connection UI [40] 50

Figure 4.12 User interface for deploying the model to OpenMV 52

Figure 4.13 OpenMV control algorithm flowchart 53

Figure 4.14 Result is shown in OpenMV interface 53

Figure 4.15 Algorithm flowchart of using LabVIEW 54

Figure 4.16 9 gestures being used Where 1 – Left means that the class of the gesture below is 1 and LabVIEW considers it as command go left 55

Figure 4.17 LabVIEW’s user interface 57

Figure 4.18 when not define mode, no action is received even that OpenMV capture a “left” gesture 57

Figure 4.19 Change parameters 58

Figure 4.20 Change mode 58

Figure 4.21 Motors work in continuous mode 59

Figure 4.22 Motors work in use parameters mode 59

Copies for internal use only in Phenikaa University

Trang 19

List of Tables

Table 2.1 Model configuration 25

Table 2.2 Details of block’s configuration in micro-bottleneck 28

Table 2.3 Details of learning process hyperparameters 31

Table 3.1 Summary of ablation evaluation of the proposed model in comparing its accuracy and model size with other some modifications 36

Table 3.2 Summarize the result of the proposed model and previous model in HGR System 41

Table 4.1 PMC-1HS-USB Specifications 44

Table 4.2 MD5-HD14 Specifications 45

Table 4.3 A8K-M566 Specifications 46

Table 4.4 OpenMV H7 Specifications 46

Copies for internal use only in Phenikaa University

Trang 20

CHAPTER I INTRODUCTION

1.1 Introduction

With the development and increasing popularity of computers in daily life, bridging the communication gap between humans and machines has become a vital area of research While conventional input methods like keyboards and mice remain prevalent, they do not represent the most natural way of human interaction between people Furthermore, as technologies like virtual reality, remote control, and augmented reality gain traction, traditional input devices often prove inappropriate Considering these constraints, hand gesture recognition (HGR) emerges as a promising alternative, offering a more intuitive and natural means of interaction between humans and technology

Hand gesture recognition, a pivotal component of human-computer interaction, has emerged as a transformative technology enabling users to communicate with computers through intuitive hand movements, eliminating the need for traditional input devices such as keyboards and mice, and offering a more natural and immersive interaction experience HGR aims to develop methods to detect, recognize human hand gestures and translate it into commands [1] To achieve this objective, some different techniques are employed to collect data or information for recognition, broadly categorized into two main approaches [2] The first approach is sensor-based, wherein sensors attached directly to the user's hand or arm capture various data types such as shape, position, movement, or trajectory Today’s commonly used devices are gloves with accelerometers, position sensors, among others, or employ electromyography (EMG) to detect and interpret muscle electrical impulses, then decoding these signals to specific finger movements More advanced approaches use vision-based techniques by using one or more types of cameras to capture motion or gesture appearance While this method is simple to set up, it requires handling with various features such as a variety of lighting, complex background, along with others [3] However, by leveraging computer vision techniques and deep learning models, hand gesture recognition

Copies for internal use only in Phenikaa University

Trang 21

systems can interpret and classify intricate hand poses and motions in real-time, enabling seamless interaction with digital interfaces and environments This approach has gradually become the main direction in recent years due to the development of vision-based applications and the trend of touchless control in many fields such as gaming, electronic device control, and virtual reality applications [4] In this study we only mention related methods based on vision Hand gesture recognition currently has a multitude of applications across various domains, spanning industrial environmental to daily life Potential fields of application encompass gaming, offering enhanced user interaction experiences, while also extending to touchless control systems, which streamline device interaction without physical contact In the field of accessibility, HGR hold promises to facilitate communication through sign language translation, thereby helping hearing-impaired people integrate more easily Moreover, within professional environments, HGR can be used for control robotic systems, or assist in communicating with disabilities during medical tasks [5] HGR also helps people with disabilities, the elderly, or children to access computers faster, more conveniently and more interestingly In this era of computing, where devices are increasingly integrated into our daily lives, the ability to interact with technology effortlessly and intuitively has become important Hand gesture recognition stands at the forefront of this revolution, promising to redefine the way we interact with computers, devices, and the digital world at large

1.2 Related work

Due to decades of study in vision-based hand gesture recognition, early efforts achieved good outcomes by leveraging the characteristics of images in processing images For instance, color and edge details are two of the most frequent attributes leveraged to identify and discern specific gestures Color detection principally targets isolating the skin tones present on the hand against the environment Edge finding facilitates extraction of the hand region as the figure of interest from the surroundings based on discontinuities in pixel intensities Method [6] approached

Copies for internal use only in Phenikaa University

Trang 22

webcam image acquired through a low-cost camera in multistep process First, Jayashree et al applied a gray threshold technique along with median and Gaussian filters in tandem to prune noise and transform the original RGB into a denoised binary images Following this preprocessing stage, a “Sobel” edge detection algorithm was leveraged to extract region of interest Finally, using a feature matching methodology called Euclidean distance, authors quantified the similarity between centroid and area of differentiated edges in the test versus training set The proposed method was evaluated on a dataset consisting of the American Sign Language (ASL) alphabet, which contained 26 static hand gestures corresponding to each letter from A to Z Using this test set, the method achieved a positive recognition rate of 90.19% However, following the success of state-of-the-art deep learning models in image-related tasks, the domain of image processing take advantage from deep learning [7] [8] [9] [10] [11] Rather than completely eliminate traditional vision techniques, a hybrid approaches using a Dual–Channel Convolutional Neural Network (DC-CNN), fusion the hand gesture images and hand edge images after preprocessing using Canny edge detection [12] The output can be classified using the SoftMax classifier, and each of the two-channel CNNs has a unique weight The proposed system's recognition rate is 98.02% However, the performance of these methods is limited by how well the handcrafted extractor selected features represent the characteristic of hand Researchers recognized the potent representational proficiencies of learned functions over hand-engineered extraction and progressively transitioned toolkits toward end-to-end structured learning drawn directly from pixels Hussain et al [13] used two CNNs parallel which each of that is a state-of-the-art architecture - Inception V3 and Efficient B0 - that have achieved noticeable performance on various image-related tasks Both models were trained in the same RGB images of recorded hand gestures These models were evaluated on the ASL dataset yielding accuracy of 90% and 99%, respectively Improvements in sensors technologies bring new approaches to leveraging depth image data captured by devices such as Kinect and Intel RealSense The method [14] uses two VGG19 with same architectures different input types Specifically, VGG19-v1 was fed the RGB images to extract skin-color

Copies for internal use only in Phenikaa University

Trang 23

maps, while VGG19-v2 took the depth images as input to learn depth-based information By combining the two streams of information, the authors were able to achieve classification accuracy as high as 94.8% on the ASL dataset More advanced deep learning models aim to combine multi-scale or multi-level features to enhance network learnability Method [15] using an innovative approach with 2 stages First, they applied deep learning with a lightweight encoder-decoder structure to transfer the RGB images to segmented images This lightweight structure is based on dilated residual and using atrous spatial pyramid pooling as multi-scale extractor After finding the desired segmentation, the double-channel CNN was fed by the input RGB images and the corresponding segmented images, to learn the necessary features separately This method achieved 91.17% on OUHANDS dataset with only 1.8 MB in model size To do this, the model takes advantage of depthwise separable convolution (DSC) layers in building encoder-decoder and CNN structures Recent methods move the problem from classification to detection to achieve better results while still trying to keep a compact structure Recent works such as [16] and [17] have improved YOLO architecture by replacing the original backbone and neck components by lightweight modules In particular, the approach described in [16], which using ShuffleNetV2 as the backbone in YOLOv3, has achieved impressive results on two challenging datasets with complex backgrounds such as senz3D dataset reaching 99.5% and Microsoft Kinect dataset reaching 99.6% Significantly, the model size was only 8.9 MB compared to the 123.5 MB size of the original YOLOv3 network Although many studies have been carried out, these studies focus heavily on improving model accuracy and pay little attention to computational costs This poses challenges for executing the model on low-cost, constraint hardware such as microcontroller devices with limited memory capacity and computational speed, resulting in significant inference time delays

Trang 24

This surge in adoption opens endless possibilities for various applications, such as smart manufacturing, personalized healthcare, precision agriculture, automated retail, and UAV applications, along with others The appeal of these low-cost and energy-efficient microcontrollers lies in their potential to facilitate a new frontier in technology known as tiny machine learning (Tiny-ML) [19] By deploying deep learning models directly on these compact devices, data analytics can be performed near the sensor, leading to a substantial expansion in the realm of Artificial Intelligence (AI) applications Utilizing deep learning models on microcontrollers allows for localized intelligent tasks, leading to improved performance, privacy, security, and energy efficiency The integration of deep learning on such tiny platforms presents an exciting opportunity to revolutionize the field of AI and further democratize its capabilities Nevertheless, integrating deep learning models with microcontrollers poses significant challenges due to the constrained resources available on these devices [20] Limited memory capacity can restrict the size of the model that can be deployed, while processing power limitations can impact the speed and efficiency of model execution Additionally, the limited battery life of microcontrollers necessitates the use of energy-efficient algorithms to ensure prolonged operation without draining the power source rapidly Overcoming these challenges is crucial to fully leverage the potential of deep learning on microcontrollers and enable the deployment of intelligent applications in resource-constrained environments

The popular adoption of microcontrollers offers an opportunity to deploy HGR system on these hardware platforms, thereby enhancing flexibility and expanding the application scope of such systems Furthermore, leveraging resource-constrained devices enables improved optimization in terms of energy consumption, construction, and operational costs

1.4 Problems and Research methods

Considering the above limitations, the research proposed micro convolutional neural network (micro-CNN) architecture based on TinyML to identify the

Copies for internal use only in Phenikaa University

Trang 25

morphology of hand gestures From this premise, we have developed a comprehensive HGR system tailored for controlling a non-destructive testing (NDT) system to conduct basic experiments This marks the first step in investigating the practicality of the method in real-world environments Our focus lies in constructing a lightweight deep learning model adept at recognizing gestures from RGB images captured by conventional cameras Emphasizing the achievement of accurate results from a single input frame, our model significantly enhances inference time efficiency The qualified model is integrated into an ARM Cortex M7 processor used on the OpenMV H7 platform Additionally, we established a framework facilitating motor control through LabView software Figure 1.2 illustrates the workflow in this study Furthermore, our approach undergoes evaluation and comparison with state-of-the-art models presented in prior research

1.5 Article structure

The following sections of this document are presented in order: Chapter II will detailly describe our proposed method, encompassing the processes of data preprocessing, augmentation, and our proposed model architecture In Section 3, a comprehensive evaluation of the results, including in-depth analysis and comparative against preceding methods, will be provided This section will also describe the benchmarking of our model's performance on a variety of microcontrollers from the STM32 family The application of this proposal to build a practical system is presented in chapter IV This section also describes in detail the process of building a framework that is integrated into the NDT system Further experimental results and demo are discussed in Chapter V The conclusions will be summarized and presented in the final Chapter VI

Copies for internal use only in Phenikaa University

Trang 26

Figure I.1 Workflow of the entire research for hand gesture recognition (HGR)

Part 1 involves the process of building a Deep Learning model After obtaining the model and successfully deploying it on OpenMV, part 2 describes the process

of building a control system to integrate the HGR model into real applications

CHAPTER II PROPOSED MODEL ARCHITECTURE

The main purpose of this research is to develop a compact HGR system tailored for microcontrollers with constrained resources Considering the limitations mentioned above, our focus lies in proposing a lightweight CNN architecture that satisfies the requirements of model size, inference time, and computational cost Additionally, we explored optimization techniques aimed at compressing model size before implementation on these microcontrollers This outlines the entire major contributions within this research The architecture is based on MobileNetV2 [21] and MobileNetV3 [22] with 2D depthwise separable convolution (DSC), bottleneck and Squeeze-and-Excitation block [23] Our proposal is used as pixel-based feature extractor, which extracts spatial features in the image After obtaining high-level features using our architecture as backbone, we integrate a classifier or detector as top module depending on intended use The proposed method exhibits the competitive result with state-of-art model when evaluate in two different datasets American Sign Language (ASL) [24] and OUHANDS [25] datasets Figure 2.1 illustrates the block diagram of the process involved in building proposed method Before performing model training, we use

Copies for internal use only in Phenikaa University

Trang 27

data augmentation techniques to enhance training data diversity and then normalize the entire dataset to make sure the pixel values of images are within a consistent range We train the model using the classification module with the ASL dataset due to its substantial volume, facilitating enhanced feature learning of hand gestures Additionally, we employ transfer learning techniques by leveraging obtained pre-trained model with the OUHANDS dataset, utilizing detector module as top layers This approach enables the model to benefit from prior learning and adapt more effectively to the nuances of hand gesture recognition tasks After training the model, the quantization algorithm is employed using the TFLite framework This algorithm facilitates the reduction of the float-point TensorFlow model to 8-bit precision, making it compatible with embedded hardware that exclusively supports 8-bit computations Final evaluation will be done on some STM32 microcontrollers to investigate inference time Upon successful completion of the assessments, the validated model will be integrated onto the OpenMV platform, enabling its deployment to address real-world challenges

Figure II.1 Block diagram of proposed method for hand gesture recognition.

Copies for internal use only in Phenikaa University

Trang 28

2.1 Dataset preparation

2.1.1 Data augmentation

Data augmentation is a critical technique in training deep learning models, crucial for addressing limited datasets and enhancing model generalization By artificially expanding the dataset through transformations such as rotation, flipping, and adding noise, data augmentation exposes the model to a broader range of input variations, mitigating overfitting and improving robustness This is particularly important in scenarios where the amount of original training data is limited or when the model needs to be invariant to certain transformations Additionally, data augmentation helps models adapt to diverse environmental conditions and input variations encountered in real-world settings, enhancing their reliability and performance [26] Integrating data augmentation into the training pipeline is essential for developing highly effective and generalized models across various domains and applications

Image augmentation leverages fundamental principles of image processing to enhance the training dataset for deep learning models An object may be photographed from different angles and distances, appear in various sizes, or be partially obscured Scale invariance, where models should recognize objects regardless of size, and viewpoint invariance, which allows models to recognize objects from different angles are the key principles for geometrics transformations By simulating real-world variations during training, geometric transformations like rotation, scaling, and translations help the model to not only recognize the object regardless of its orientation or scale but also to understand the underlying structure of the object This adaptability is crucial for the model to accurately identify objects across diverse visual scenarios and is especially beneficial for tasks such as object detection and classification where perspective and scale have significantly affected [27] Color invariance plays a pivotal role in enhancing model performance by allowing the model to focus on structural and texture features rather than color, which can be highly variable due to lighting conditions or camera settings This principle is particularly important for tasks like object

Copies for internal use only in Phenikaa University

Trang 29

recognition, where the shape and texture of an object are more defining characteristics Implementing color space transformations, such as adjusting brightness, contrast, and saturation, improving model accuracy and reliability in practical applications [28]

In the field of HGR, the accurate recognition of distinct gestures heavily depends on the hand’s shape and structural characteristics Given the variability inherent in real-life scenarios, where gestures may be captured from diverse angles and positions within the frame, geometric transformations are essential to simulate these conditions within the dataset These transformations, including translation, rotation, and random zoom, serve to augment the dataset, enabling the model to learn robust representations of hand gestures Moreover, to reduce the model’s sensitivity to variations in lighting conditions during usage, a random brightness operation adjustment method was implemented This method introduces controlled coefficient in brightness, ranging from -0.1 to 0.1, thereby enhancing the model’s resilience to environmental factors and contributing to its overall robustness and generalization capabilities Because the purpose is to help the model learn the necessary features, the augmentation technique is only applied to the training data set and not to the testing data set as Figure 2.2 described Examples of some images after the augmentation process are shown in Figure 2.3

Figure II.2 Data augmentation processing

Copies for internal use only in Phenikaa University

Trang 30

Figure II.3 Some samples after data augmentation with Random Zoom, Random Rotation, Random Brightness

2.1.2 Normalization

Normalizing an image refers to the process of adjusting the pixel values to conform to a standardized scale or distribution This typically involves scaling the pixel values to have a mean of zero and a standard deviation of one, or normalizing them to a specific range, such as [0, 1] or [-1, 1] This process aids in improving convergence during optimization, as it minimizes issues related to features at different scales, thereby facilitating smoother convergence and preventing oscillation The process of updating the weights matrix involves adding the gradient error vector, which is multiplied by a learning rate, via backpropagation These adjustments are made continuously throughout the training process If an input is not normalized before being fed into model training, the learning rate can lead to differences in the corrections applied to different dimensions, potentially leading to overcompensation in one area dimension and compensate for the deficiency in another dimension Furthermore, normalization acts as a form of regularization by preventing models from being outrageously sensitive to certain features, thus aiding in preventing overfitting and enhancing generalization performance

Given an image 𝐼 represented as a matrix of pixel values, where 𝐼𝑖𝑗 denotes the pixel value at row 𝑖 and column 𝑗, the process of normalizing using Min-Max scaling to rescale the pixel value to range [0, 1] as Equation 1 below:

𝐼𝑖𝑗′ = 𝐼𝑖𝑗 − min(𝐼)

Copies for internal use only in Phenikaa University

Trang 31

2.2 Proposed architecture

2.2.1 Theoretical basis

2.2.1.1 Machine Learning, Deep Learning & TinyML

Machine Learning (ML) represents transformative paradigms within the field of artificial intelligence, revolutionizing the way in which computers learn from data and making decisions At its core, ML encompasses a diverse set of algorithms and techniques that enable computers to learn patterns and relationships from data without being explicitly programmed These algorithms are capable of iteratively improving their performance over time, allowing for the development of predictive models and systems that can make decisions and perform tasks autonomously Deep Learning, a subset of ML, has emerged as a particularly powerful approach, leveraging artificial neural networks with multiple layers of abstraction to extract intricate features and representations from raw data Unlike traditional ML methods, DL excels at handling large-scale, high-dimensional datasets, enabling the development of sophisticated models capable of learning complex patterns and relationships directly from raw input data These raw inputs are passed through a series of learnable convolutional layers, fully connected to updating their weights without any specific transformation method as shown in Figure 2.4

These traditional Deep Learning models focus a lot on their robustness but rarely put much emphasis on optimizing computational costs This makes it infeasible to deploy on-edge devices such as mobile phones or microcontrollers With the current popularity of IoT devices, microcontrollers have played an important role in operating many systems in the world To address the limitations of deep learning, TinyML has emerged as a promising field focusing on techniques that help deploy models to devices with limited hardware Lightweight models are directly embedded to operate on these small devices such as sensors and actuators to perform specific tasks Furthermore, the adoption of TinyML techniques in systems brings about significant advantages in data transmission efficiency Rather than transmitting all raw data to the central processing unit, terminals can transmit condensed, processed information, reducing the burden on network bandwidth and

Copies for internal use only in Phenikaa University

Trang 32

enhancing overall system scalability Besides, by processing data locally before transmission, the privacy of raw data can be effectively protected Figure 2.5 depicts a comparison between systems with and without TinyML Using TinyML, data is processed locally before seeking help from edge AI or cloud AI [29]

Figure II.4 Basic Convolutional Neural Network A typical deep learning model can take an input and transform it into corresponding outputs Raw image data

can be fed into the model without going through any traditional data extraction techniques [30]

Figure II.5 Comparison between systems with TinyML (a) and without TinyML

(b) TinyML is embedded directly on the microprocessors on the sensor to process

data before seeking help from edge AI or cloud AI [29]

Copies for internal use only in Phenikaa University

Trang 33

2.2.1.2 Standard convolutional layer

Kernel convolution is a fundamental operation utilized not only in Convolutional Neural Networks (CNNs) but also in various other Computer Vision algorithms It involves applying a small matrix of numbers, referred to as a learnable kernel or filter, over an image and transforming it based on the values within the filter These kernels, typically small in spatial dimensionality, extend across the entirety of the input depth During the traversal of data through a convolutional layer, each filter convolves across the spatial dimensions of the input, generating a 2D activation map This process enables the network to use kernels that trigger upon detecting specific features at distinct spatial positions within the input These 2D activation maps, also known as feature maps, can be calculated according to the following Equation 2.2:

𝐹𝑚×𝑛 = (𝑓 ∗ 𝑘)𝑚×𝑛 = ∑ ∑ 𝑘(𝑗,ℎ)𝑓(𝑚−𝑗,𝑛−ℎ)

(2.2) The result feature map is denoted by 𝐹 ∈ ℝ𝑚×𝑛, the input image is denoted by 𝑓 ∈ℝ𝑗×ℎ where 𝑗 and ℎ represent the indexes of rows and columns respectively The learnable kernel 𝑘 has size 𝑘 × 𝑘 will pass through the length and width of the input in turn, extract a submatrix of the same size from the input and multiply each element by position and sum them all together to form one element in the result matrix The whole process is described in Figure 2.6

For inputs with multiple channels such as RGB images, the kernels need to have the same number of channels correspond to input This principle, known as convolution over depth, is an important property that both helps convolution work with color images and allows you to use multiple filters in the same layer When using multiple filters for a single input, the convolution operation is performed separately for each filter The resulting feature maps are then concatenated together to form an output For an input 𝑋 ∈ ℝℎ×ℎ×𝑐, implementing convolution with kernel size 𝑘 × 𝑘 × 𝑐 using 𝑛𝑐 filters, we can obtain the result as Equation 2.3:

Copies for internal use only in Phenikaa University

Trang 34

[ℎ, ℎ, 𝑐] ∗ [𝑘, 𝑘, 𝑐] ∗ 𝑛𝑐 = [ℎ + 2𝑝 − 𝑘

ℎ + 2𝑝 − 𝑘

𝑠 + 1, 𝑛𝑐] (2.3) where, 𝑝 and 𝑠 describe padding and strides used in this convolution operation, respectively

The forward propagation process includes the calculation of intermediate matrix 𝑍, and the application of a non-linear activation function 𝛿 Here the model can be learned by adjusting parameters including weights and bias to produce appropriate output This process is described in Equation 2.4:

Where 𝐴[𝑙−1] is activation map obtained from layer 𝑙 − 1 as input to layer 𝑙

Figure II.6 Kernel production operation The process of multiplying the kernel

with each submatrix taken from the input respectively along the length and width of the input

Backpropagation, short for "backward propagation of errors," is a fundamental process used in training CNN models specifically and deep neural networks in general This process optimizes model’s parameters by iteratively adjusting to minimize the difference, called error, between the predicted output and the actual target output In the backward pass, the error signal is propagated backward

Copies for internal use only in Phenikaa University

Trang 35

through the network, layer by layer, using the chain rule of calculus At each layer, the algorithm computes the gradient of the loss function with respect to the parameters of the layer This gradient indicates the direction and magnitude of the adjustment needed to minimize the error For instance, with kernel 𝑘 ∈ ℝ𝑘×𝑘 have weight matrix 𝜔 ∈ ℝ𝑘×𝑘, the upgrade process can be described in Figure 2.7 following by Equation 2.5

𝜔𝑖 = 𝜔𝑖− 𝛼 × 𝜕𝐿

𝜕𝜔𝑖 = ∑𝜕𝑧𝑗𝜕𝜔𝑖∗

(2.5.2)

𝜕𝐿𝜕𝑍 =

Where 𝛼 is learning rate, 𝑦 is predict result, and 𝛿′ is the derivative of the activation function 𝛿

Figure II.7 Forward propagation and Backward propagation of convolution

operation Both processes follow chain rules

2.2.1.3 Depthwise separable convolution

First introduced in Paper MobileNet [9], Depthwise separable convolution has become an essential component in many lightweight modeling architectures [10]

Copies for internal use only in Phenikaa University

Trang 36

[31] The standard convolutional layer performs a convolution operation on each input channel and combines all filters in a single step In contrast, DSC divides these processes into two distinct layers that operate consecutively Initially, a depthwise convolutional layer applies individual filters to each input channel Subsequently, a convolutional layer with a kernel size of 1x1, called pointwise convolution, computes the combination of input channels to generate fresh feature maps The processes can be described in Figure 2.8

Given an input 𝐼 of size 𝐻 × 𝑊 × 𝐶, the standard convolutional layer uses a kernel 𝐾 of size 𝐾 × 𝐾 × 𝐶 × 𝑁𝑐 to produce a feature map 𝑂 of size 𝐻 × 𝑊 × 𝑁𝑐 Assuming stride is one and padding to make output has same size as input, the output feature map can be computed as Equation 2.6:

(2.6) where 𝐻 and 𝑊 are the spatial height and width of the input feature map, respectively, 𝐶 is the number of channels of the input For the output feature, 𝑁𝑐 is the depth of the output which also corresponds to the number of filters used in this convolutional layer 𝐾 × 𝐾⁡are the horizontal and vertical dimensions of a square kernel, respectively From there we can calculate the computational cost by multiplying these parameters together as Equation 2.7 below:

Meanwhile, with the same input 𝐼 as above, the DSC layer is divided into two separate tasks First, the depthwise convolution layer uses C filters with kernel 𝐾̂ ∈ℝ𝐾×𝐾 corresponding to the number of input channels The 𝑐𝑡ℎ filter corresponds to the 𝑐𝑡ℎ channel of input produce the 𝑐𝑡ℎ channel of the filtered output feature map 𝑂̂, the formula can be described as Equation 2.8:

Trang 37

works separately in each channel to create new feature This makes the computational cost 𝑁𝑐 times smaller and can be calculated as Equation 2.9:

After that, pointwise operator can be applied to obtained feature maps by depthwise to combine across entire filters to generate final output features Therefore, to synthesize the computational cost of the entire DSC, we have the formula as Equation 2.10:

𝐻 × 𝑊 × 𝐶 × 𝐾 × 𝐾 + 𝐶 × 𝑁𝑐× 𝐻 × 𝑊 (2.10) Comparing the two types of layers together, we obtain a reduction in computational cost calculated as Equation 2.11

𝐻 × 𝑊 × 𝐶 × 𝐾 × 𝐾 + 𝐶 × 𝑁𝑐× 𝐻 × 𝑊𝐻 × 𝑊 × 𝐶 × 𝐾 × 𝐾 × 𝑁𝑐 =

(2.11)

2.2.1.4 Pooling layer

Pooling layer is also known as down sample layers, are an essential component of convolution neural networks (CNNs) used in Deep Learning It is responsible for reducing the spatial dimensions of the input data, in terms of width and height, while retaining the most important information

Pooling layers divide the input data into small regions, called pooling windows or receptive fields, and perform an aggregation, such as taking the maximum or average value of each window This aggregation reduces the size of the feature maps, resulting in a compressed representation of the input data

The process of Pooling Layers involves three following steps:

1 Divide the input data into non – overlapping regions or windows

2 Apply an aggregation function, such as max pooling or average pooling, on each window to obtain a single value

3 Combine the values obtained from each window to create a down sampled representation of the input data

Copies for internal use only in Phenikaa University

Trang 38

For a feature map having dimensions 𝑛ℎ × 𝑛𝑤× 𝑛𝑐, the dimensions of output obtained after a pooling layer is (𝑛ℎ− 𝑓 + 1)/𝑠 × (𝑛𝑤− 𝑓 + 1)/𝑠 × 𝑛𝑐

Where 𝑛ℎ and 𝑛𝑤 is height and width of the feature map, 𝑛𝑐 is number of channels in this feature map with 𝑓 is size of filter and 𝑠 is stride using in this pooling layer.

Figure II.8 Operation of standard convolutional layer (a) replaced by depthwise convolutional layer with two separate layer included depthwise layer (b) and

pointwise layer (c) [9]

Copies for internal use only in Phenikaa University

Trang 39

There are some types of pooling layers: max pooling, min pooling, average pooling, global pooling The summary of the features in a region are represented by the maximum/minimum/average value of that region Max/ Min/ Average pooling smooths the harsh edges of a picture and is used when such edges are not important

With global pooling, each channel in the feature map is reduced to just one value The value depends on the type of global pooling, which can be any one of the previously explained types

2.2.1.5 Batch Normalization

Batch normalization is a technique used to improve the performance of a deep learning network by first removing the batch mean and then splitting it by the batch standard deviation

During training, the activations of a layer are normalized for each mini-batch of data using the following equation 2.12:

- Batch mean: 𝜇𝐵 = 1

- Batch variance: 𝜎𝐵2 = 1

𝑚∑𝑚𝑖=1 (𝑥𝑖− 𝜇𝐵)2 (2.12.2) - Normalized activations: 𝑥̅ =𝑖 𝑥𝑖−𝜇𝐵

√Var𝑥+𝜖𝑥 + (𝛽 + 𝛾𝐸𝑥

√Var𝑥+𝜖) (2.13.4) Copies for internal use only in Phenikaa University

Trang 40

2.2.1.6 Activation

H-swish activation function is known as Hard Swish activation function Hard

Swish is a type of activation function based on Swish but replace the computationally expensive sigmoid with a piecewise linear analogue H-swish takes one input tensor and produces output tensor where the hard version of the swish function is applied to tensor element wise It is defined as Equation:

hswish(𝑥𝑖) = 𝑥𝑖∗ReLU6(𝑥𝑖+ 3)

where 𝑥𝑖 is the 𝑖-th slice in the given dimension of the input Tensor

SoftMax is an activation function that transforms the raw outputs of the neural

network into a vector of probabilities, essentially a probability distribution over the input classes The equations 2.15 of the SoftMax function is given as follows:

2.2.1.7 Squeeze and Excitation

The Squeeze-and-Excitation Block is an architectural block that allows a network to dynamically undertake channel-wise feature recalibration, hence increasing its representational power [23] The detailed process of SE block is shown in Figure 2.9

Squeeze-and-Excitation Networks (SENets) introduce a building block for CNNs that improves channel interdependence at almost no computational cost To achieve adaptive weighting, there are three phases inner working of the SE block: Squeeze Phase, Excitation Phase, and Scale and Combine

Copies for internal use only in Phenikaa University

Ngày đăng: 25/05/2024, 11:53

Xem thêm:

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w