DEVELOPING A GESTURE RECOGNITION SYSTEM USING CAMERAS BASED ON LIGHTWEIGHT MACHINE LEARNING TECHNOLOGY TINYML OPERATING ON A Sinh viên: Trần Mạch Tuấn Kiệt Mã số sinh viên: 19010214 Kh
INTRODUCTION
Introduction
With the development and increasing popularity of computers in daily life, bridging the communication gap between humans and machines has become a vital area of research While conventional input methods like keyboards and mice remain prevalent, they do not represent the most natural way of human interaction between people Furthermore, as technologies like virtual reality, remote control, and augmented reality gain traction, traditional input devices often prove inappropriate Considering these constraints, hand gesture recognition (HGR) emerges as a promising alternative, offering a more intuitive and natural means of interaction between humans and technology
Hand gesture recognition, a pivotal component of human-computer interaction, has emerged as a transformative technology enabling users to communicate with computers through intuitive hand movements, eliminating the need for traditional input devices such as keyboards and mice, and offering a more natural and immersive interaction experience HGR aims to develop methods to detect, recognize human hand gestures and translate it into commands [1] To achieve this objective, some different techniques are employed to collect data or information for recognition, broadly categorized into two main approaches [2] The first approach is sensor-based, wherein sensors attached directly to the user's hand or arm capture various data types such as shape, position, movement, or trajectory
Today’s commonly used devices are gloves with accelerometers, position sensors, among others, or employ electromyography (EMG) to detect and interpret muscle electrical impulses, then decoding these signals to specific finger movements
More advanced approaches use vision-based techniques by using one or more types of cameras to capture motion or gesture appearance While this method is simple to set up, it requires handling with various features such as a variety of lighting, complex background, along with others [3] However, by leveraging computer vision techniques and deep learning models, hand gesture recognition
Copies for internal use only in Phenikaa University systems can interpret and classify intricate hand poses and motions in real-time, enabling seamless interaction with digital interfaces and environments This approach has gradually become the main direction in recent years due to the development of vision-based applications and the trend of touchless control in many fields such as gaming, electronic device control, and virtual reality applications [4] In this study we only mention related methods based on vision
Hand gesture recognition currently has a multitude of applications across various domains, spanning industrial environmental to daily life Potential fields of application encompass gaming, offering enhanced user interaction experiences, while also extending to touchless control systems, which streamline device interaction without physical contact In the field of accessibility, HGR hold promises to facilitate communication through sign language translation, thereby helping hearing-impaired people integrate more easily Moreover, within professional environments, HGR can be used for control robotic systems, or assist in communicating with disabilities during medical tasks [5] HGR also helps people with disabilities, the elderly, or children to access computers faster, more conveniently and more interestingly In this era of computing, where devices are increasingly integrated into our daily lives, the ability to interact with technology effortlessly and intuitively has become important Hand gesture recognition stands at the forefront of this revolution, promising to redefine the way we interact with computers, devices, and the digital world at large.
Related work
Due to decades of study in vision-based hand gesture recognition, early efforts achieved good outcomes by leveraging the characteristics of images in processing images For instance, color and edge details are two of the most frequent attributes leveraged to identify and discern specific gestures Color detection principally targets isolating the skin tones present on the hand against the environment Edge finding facilitates extraction of the hand region as the figure of interest from the surroundings based on discontinuities in pixel intensities Method [6] approached
Copies for internal use only in Phenikaa University webcam image acquired through a low-cost camera in multistep process First, Jayashree et al applied a gray threshold technique along with median and Gaussian filters in tandem to prune noise and transform the original RGB into a denoised binary images Following this preprocessing stage, a “Sobel” edge detection algorithm was leveraged to extract region of interest Finally, using a feature matching methodology called Euclidean distance, authors quantified the similarity between centroid and area of differentiated edges in the test versus training set
The proposed method was evaluated on a dataset consisting of the American Sign Language (ASL) alphabet, which contained 26 static hand gestures corresponding to each letter from A to Z Using this test set, the method achieved a positive recognition rate of 90.19% However, following the success of state-of-the-art deep learning models in image-related tasks, the domain of image processing take advantage from deep learning [7] [8] [9] [10] [11] Rather than completely eliminate traditional vision techniques, a hybrid approaches using a Dual–Channel Convolutional Neural Network (DC-CNN), fusion the hand gesture images and hand edge images after preprocessing using Canny edge detection [12] The output can be classified using the SoftMax classifier, and each of the two-channel CNNs has a unique weight The proposed system's recognition rate is 98.02% However, the performance of these methods is limited by how well the handcrafted extractor selected features represent the characteristic of hand Researchers recognized the potent representational proficiencies of learned functions over hand-engineered extraction and progressively transitioned toolkits toward end-to-end structured learning drawn directly from pixels Hussain et al [13] used two CNNs parallel which each of that is a state-of-the-art architecture - Inception V3 and Efficient B0 - that have achieved noticeable performance on various image-related tasks Both models were trained in the same RGB images of recorded hand gestures These models were evaluated on the ASL dataset yielding accuracy of 90% and 99%, respectively Improvements in sensors technologies bring new approaches to leveraging depth image data captured by devices such as Kinect and Intel RealSense The method [14] uses two VGG19 with same architectures different input types Specifically, VGG19-v1 was fed the RGB images to extract skin-color
Copies for internal use only in Phenikaa University maps, while VGG19-v2 took the depth images as input to learn depth-based information By combining the two streams of information, the authors were able to achieve classification accuracy as high as 94.8% on the ASL dataset More advanced deep learning models aim to combine multi-scale or multi-level features to enhance network learnability Method [15] using an innovative approach with 2 stages First, they applied deep learning with a lightweight encoder-decoder structure to transfer the RGB images to segmented images This lightweight structure is based on dilated residual and using atrous spatial pyramid pooling as multi-scale extractor After finding the desired segmentation, the double-channel CNN was fed by the input RGB images and the corresponding segmented images, to learn the necessary features separately This method achieved 91.17% on OUHANDS dataset with only 1.8 MB in model size To do this, the model takes advantage of depthwise separable convolution (DSC) layers in building encoder- decoder and CNN structures Recent methods move the problem from classification to detection to achieve better results while still trying to keep a compact structure Recent works such as [16] and [17] have improved YOLO architecture by replacing the original backbone and neck components by lightweight modules In particular, the approach described in [16], which using ShuffleNetV2 as the backbone in YOLOv3, has achieved impressive results on two challenging datasets with complex backgrounds such as senz3D dataset reaching 99.5% and Microsoft Kinect dataset reaching 99.6% Significantly, the model size was only 8.9 MB compared to the 123.5 MB size of the original YOLOv3 network
Although many studies have been carried out, these studies focus heavily on improving model accuracy and pay little attention to computational costs This poses challenges for executing the model on low-cost, constraint hardware such as microcontroller devices with limited memory capacity and computational speed, resulting in significant inference time delays.
Motivation
The proliferation of IoT devices using always-on microcontrollers is experiencing exponential growth, with a staggering number of 250 billion devices reported [18]
Copies for internal use only in Phenikaa University
This surge in adoption opens endless possibilities for various applications, such as smart manufacturing, personalized healthcare, precision agriculture, automated retail, and UAV applications, along with others The appeal of these low-cost and energy-efficient microcontrollers lies in their potential to facilitate a new frontier in technology known as tiny machine learning (Tiny-ML) [19] By deploying deep learning models directly on these compact devices, data analytics can be performed near the sensor, leading to a substantial expansion in the realm of Artificial Intelligence (AI) applications Utilizing deep learning models on microcontrollers allows for localized intelligent tasks, leading to improved performance, privacy, security, and energy efficiency The integration of deep learning on such tiny platforms presents an exciting opportunity to revolutionize the field of AI and further democratize its capabilities Nevertheless, integrating deep learning models with microcontrollers poses significant challenges due to the constrained resources available on these devices [20] Limited memory capacity can restrict the size of the model that can be deployed, while processing power limitations can impact the speed and efficiency of model execution Additionally, the limited battery life of microcontrollers necessitates the use of energy-efficient algorithms to ensure prolonged operation without draining the power source rapidly Overcoming these challenges is crucial to fully leverage the potential of deep learning on microcontrollers and enable the deployment of intelligent applications in resource- constrained environments
The popular adoption of microcontrollers offers an opportunity to deploy HGR system on these hardware platforms, thereby enhancing flexibility and expanding the application scope of such systems Furthermore, leveraging resource- constrained devices enables improved optimization in terms of energy consumption, construction, and operational costs.
Problems and Research methods
Considering the above limitations, the research proposed micro convolutional neural network (micro-CNN) architecture based on TinyML to identify the
Copies for internal use only in Phenikaa University morphology of hand gestures From this premise, we have developed a comprehensive HGR system tailored for controlling a non-destructive testing (NDT) system to conduct basic experiments This marks the first step in investigating the practicality of the method in real-world environments Our focus lies in constructing a lightweight deep learning model adept at recognizing gestures from RGB images captured by conventional cameras Emphasizing the achievement of accurate results from a single input frame, our model significantly enhances inference time efficiency The qualified model is integrated into an ARM Cortex M7 processor used on the OpenMV H7 platform Additionally, we established a framework facilitating motor control through LabView software
Figure 1.2 illustrates the workflow in this study Furthermore, our approach undergoes evaluation and comparison with state-of-the-art models presented in prior research.
Article structure
The following sections of this document are presented in order: Chapter II will detailly describe our proposed method, encompassing the processes of data preprocessing, augmentation, and our proposed model architecture In Section 3, a comprehensive evaluation of the results, including in-depth analysis and comparative against preceding methods, will be provided This section will also describe the benchmarking of our model's performance on a variety of microcontrollers from the STM32 family The application of this proposal to build a practical system is presented in chapter IV This section also describes in detail the process of building a framework that is integrated into the NDT system Further experimental results and demo are discussed in Chapter V The conclusions will be summarized and presented in the final Chapter VI
Copies for internal use only in Phenikaa University
Figure I.1 Workflow of the entire research for hand gesture recognition (HGR)
Part 1 involves the process of building a Deep Learning model After obtaining the model and successfully deploying it on OpenMV, part 2 describes the process of building a control system to integrate the HGR model into real applications.
PROPOSED MODEL ARCHITECTURE
Dataset preparation
Data augmentation is a critical technique in training deep learning models, crucial for addressing limited datasets and enhancing model generalization By artificially expanding the dataset through transformations such as rotation, flipping, and adding noise, data augmentation exposes the model to a broader range of input variations, mitigating overfitting and improving robustness This is particularly important in scenarios where the amount of original training data is limited or when the model needs to be invariant to certain transformations Additionally, data augmentation helps models adapt to diverse environmental conditions and input variations encountered in real-world settings, enhancing their reliability and performance [26] Integrating data augmentation into the training pipeline is essential for developing highly effective and generalized models across various domains and applications
Image augmentation leverages fundamental principles of image processing to enhance the training dataset for deep learning models An object may be photographed from different angles and distances, appear in various sizes, or be partially obscured Scale invariance, where models should recognize objects regardless of size, and viewpoint invariance, which allows models to recognize objects from different angles are the key principles for geometrics transformations
By simulating real-world variations during training, geometric transformations like rotation, scaling, and translations help the model to not only recognize the object regardless of its orientation or scale but also to understand the underlying structure of the object This adaptability is crucial for the model to accurately identify objects across diverse visual scenarios and is especially beneficial for tasks such as object detection and classification where perspective and scale have significantly affected [27] Color invariance plays a pivotal role in enhancing model performance by allowing the model to focus on structural and texture features rather than color, which can be highly variable due to lighting conditions or camera settings This principle is particularly important for tasks like object
Copies for internal use only in Phenikaa University recognition, where the shape and texture of an object are more defining characteristics Implementing color space transformations, such as adjusting brightness, contrast, and saturation, improving model accuracy and reliability in practical applications [28]
In the field of HGR, the accurate recognition of distinct gestures heavily depends on the hand’s shape and structural characteristics Given the variability inherent in real-life scenarios, where gestures may be captured from diverse angles and positions within the frame, geometric transformations are essential to simulate these conditions within the dataset These transformations, including translation, rotation, and random zoom, serve to augment the dataset, enabling the model to learn robust representations of hand gestures Moreover, to reduce the model’s sensitivity to variations in lighting conditions during usage, a random brightness operation adjustment method was implemented This method introduces controlled coefficient in brightness, ranging from -0.1 to 0.1, thereby enhancing the model’s resilience to environmental factors and contributing to its overall robustness and generalization capabilities Because the purpose is to help the model learn the necessary features, the augmentation technique is only applied to the training data set and not to the testing data set as Figure 2.2 described Examples of some images after the augmentation process are shown in Figure 2.3
Figure II.2 Data augmentation processing
Copies for internal use only in Phenikaa University
Figure II.3 Some samples after data augmentation with Random Zoom, Random
Normalizing an image refers to the process of adjusting the pixel values to conform to a standardized scale or distribution This typically involves scaling the pixel values to have a mean of zero and a standard deviation of one, or normalizing them to a specific range, such as [0, 1] or [-1, 1] This process aids in improving convergence during optimization, as it minimizes issues related to features at different scales, thereby facilitating smoother convergence and preventing oscillation The process of updating the weights matrix involves adding the gradient error vector, which is multiplied by a learning rate, via backpropagation
These adjustments are made continuously throughout the training process If an input is not normalized before being fed into model training, the learning rate can lead to differences in the corrections applied to different dimensions, potentially leading to overcompensation in one area dimension and compensate for the deficiency in another dimension Furthermore, normalization acts as a form of regularization by preventing models from being outrageously sensitive to certain features, thus aiding in preventing overfitting and enhancing generalization performance
Given an image 𝐼 represented as a matrix of pixel values, where 𝐼 𝑖𝑗 denotes the pixel value at row 𝑖 and column 𝑗, the process of normalizing using Min-Max scaling to rescale the pixel value to range [0, 1] as Equation 1 below:
Copies for internal use only in Phenikaa University
Proposed architecture
2.2.1 Theoretical basis 2.2.1.1 Machine Learning, Deep Learning & TinyML
Machine Learning (ML) represents transformative paradigms within the field of artificial intelligence, revolutionizing the way in which computers learn from data and making decisions At its core, ML encompasses a diverse set of algorithms and techniques that enable computers to learn patterns and relationships from data without being explicitly programmed These algorithms are capable of iteratively improving their performance over time, allowing for the development of predictive models and systems that can make decisions and perform tasks autonomously
Deep Learning, a subset of ML, has emerged as a particularly powerful approach, leveraging artificial neural networks with multiple layers of abstraction to extract intricate features and representations from raw data Unlike traditional ML methods, DL excels at handling large-scale, high-dimensional datasets, enabling the development of sophisticated models capable of learning complex patterns and relationships directly from raw input data These raw inputs are passed through a series of learnable convolutional layers, fully connected to updating their weights without any specific transformation method as shown in Figure 2.4
These traditional Deep Learning models focus a lot on their robustness but rarely put much emphasis on optimizing computational costs This makes it infeasible to deploy on-edge devices such as mobile phones or microcontrollers With the current popularity of IoT devices, microcontrollers have played an important role in operating many systems in the world To address the limitations of deep learning, TinyML has emerged as a promising field focusing on techniques that help deploy models to devices with limited hardware Lightweight models are directly embedded to operate on these small devices such as sensors and actuators to perform specific tasks Furthermore, the adoption of TinyML techniques in systems brings about significant advantages in data transmission efficiency Rather than transmitting all raw data to the central processing unit, terminals can transmit condensed, processed information, reducing the burden on network bandwidth and
Copies for internal use only in Phenikaa University enhancing overall system scalability Besides, by processing data locally before transmission, the privacy of raw data can be effectively protected Figure 2.5 depicts a comparison between systems with and without TinyML Using TinyML, data is processed locally before seeking help from edge AI or cloud AI [29]
Figure II.4 Basic Convolutional Neural Network A typical deep learning model can take an input and transform it into corresponding outputs Raw image data can be fed into the model without going through any traditional data extraction techniques [30]
Figure II.5 Comparison between systems with TinyML (a) and without TinyML (b) TinyML is embedded directly on the microprocessors on the sensor to process data before seeking help from edge AI or cloud AI [29]
Copies for internal use only in Phenikaa University
Kernel convolution is a fundamental operation utilized not only in Convolutional Neural Networks (CNNs) but also in various other Computer Vision algorithms It involves applying a small matrix of numbers, referred to as a learnable kernel or filter, over an image and transforming it based on the values within the filter These kernels, typically small in spatial dimensionality, extend across the entirety of the input depth During the traversal of data through a convolutional layer, each filter convolves across the spatial dimensions of the input, generating a 2D activation map This process enables the network to use kernels that trigger upon detecting specific features at distinct spatial positions within the input These 2D activation maps, also known as feature maps, can be calculated according to the following Equation 2.2:
The result feature map is denoted by 𝐹 ∈ ℝ 𝑚×𝑛 , the input image is denoted by 𝑓 ∈ ℝ 𝑗×ℎ where 𝑗 and ℎ represent the indexes of rows and columns respectively The learnable kernel 𝑘 has size 𝑘 × 𝑘 will pass through the length and width of the input in turn, extract a submatrix of the same size from the input and multiply each element by position and sum them all together to form one element in the result matrix The whole process is described in Figure 2.6
For inputs with multiple channels such as RGB images, the kernels need to have the same number of channels correspond to input This principle, known as convolution over depth, is an important property that both helps convolution work with color images and allows you to use multiple filters in the same layer When using multiple filters for a single input, the convolution operation is performed separately for each filter The resulting feature maps are then concatenated together to form an output For an input 𝑋 ∈ ℝ ℎ×ℎ×𝑐 , implementing convolution with kernel size 𝑘 × 𝑘 × 𝑐 using 𝑛 𝑐 filters, we can obtain the result as Equation 2.3:
Copies for internal use only in Phenikaa University
𝑠 + 1, 𝑛 𝑐 ] (2.3) where, 𝑝 and 𝑠 describe padding and strides used in this convolution operation, respectively
The forward propagation process includes the calculation of intermediate matrix 𝑍, and the application of a non-linear activation function 𝛿 Here the model can be learned by adjusting parameters including weights and bias to produce appropriate output This process is described in Equation 2.4:
Where 𝐴 [𝑙−1] is activation map obtained from layer 𝑙 − 1 as input to layer 𝑙
Figure II.6 Kernel production operation The process of multiplying the kernel with each submatrix taken from the input respectively along the length and width of the input
Backpropagation, short for "backward propagation of errors," is a fundamental process used in training CNN models specifically and deep neural networks in general This process optimizes model’s parameters by iteratively adjusting to minimize the difference, called error, between the predicted output and the actual target output In the backward pass, the error signal is propagated backward
Copies for internal use only in Phenikaa University through the network, layer by layer, using the chain rule of calculus At each layer, the algorithm computes the gradient of the loss function with respect to the parameters of the layer This gradient indicates the direction and magnitude of the adjustment needed to minimize the error For instance, with kernel 𝑘 ∈ ℝ 𝑘×𝑘 have weight matrix 𝜔 ∈ ℝ 𝑘×𝑘 , the upgrade process can be described in Figure 2.7 following by Equation 2.5
Where 𝛼 is learning rate, 𝑦 is predict result, and 𝛿′ is the derivative of the activation function 𝛿
Figure II.7 Forward propagation and Backward propagation of convolution operation Both processes follow chain rules
First introduced in Paper MobileNet [9], Depthwise separable convolution has become an essential component in many lightweight modeling architectures [10]
Copies for internal use only in Phenikaa University
[31] The standard convolutional layer performs a convolution operation on each input channel and combines all filters in a single step In contrast, DSC divides these processes into two distinct layers that operate consecutively Initially, a depthwise convolutional layer applies individual filters to each input channel
Subsequently, a convolutional layer with a kernel size of 1x1, called pointwise convolution, computes the combination of input channels to generate fresh feature maps The processes can be described in Figure 2.8
Given an input 𝐼 of size 𝐻 × 𝑊 × 𝐶, the standard convolutional layer uses a kernel 𝐾 of size 𝐾 × 𝐾 × 𝐶 × 𝑁 𝑐 to produce a feature map 𝑂 of size 𝐻 × 𝑊 × 𝑁 𝑐 Assuming stride is one and padding to make output has same size as input, the output feature map can be computed as Equation 2.6:
(2.6) where 𝐻 and 𝑊 are the spatial height and width of the input feature map, respectively, 𝐶 is the number of channels of the input For the output feature, 𝑁𝑐 is the depth of the output which also corresponds to the number of filters used in this convolutional layer 𝐾 × 𝐾are the horizontal and vertical dimensions of a square kernel, respectively From there we can calculate the computational cost by multiplying these parameters together as Equation 2.7 below:
Meanwhile, with the same input 𝐼 as above, the DSC layer is divided into two separate tasks First, the depthwise convolution layer uses C filters with kernel 𝐾̂ ∈
Training process
As mentioned above, the training process is divided into two parts Because each hand gesture dataset usually does not have many samples, we use the ASL set for pre-training because this data set has nearly 100,000 samples The obtained model was benchmarked on STM32 before being transferred learning with the FOMO technique on the OUHANDS and the self-collected data set
For multi-class classification tasks, we utilize the categorical cross-entropy loss function This type of cross-entropy loss function quantifies the dissimilarity between the predicted probabilities and the true categorical labels The categorical cross-entropy loss formula is represented as Equation 2.23:
The lower the loss, the more accurate the model By minimizing the loss, the model learns to assign higher probabilities to correct class, improving accuracy To optimize it, we employ the Adam optimizer, an adaptive gradient descent algorithm based on gradient magnitude
Copies for internal use only in Phenikaa University
Adam Optimizer involves a combination of two gradient descent methodologies:
Momentum and Root Mean Square Propagation (RMSP) This method is really efficient when working with large problems involving a lot of data or parameters
It requires less memory The core of Adam’s algorithm lies in its computation of adaptive learning rates for each parameter
Momentum is the algorithm used to accelerate the gradient descent algorithm by taking into consideration the “exponentially weighted average” of the gradient
Using averages makes the algorithm converge towards the minimum at a faster pace Momentum can be calculated as Equation 2.24
Where 𝑚 𝑡 is aggregate of gradients at time t (current), 𝑚 𝑡−1 is aggregate of gradients at time 𝑡– 1 (previous), 𝑤 𝑡 is weights at time 𝑡, 𝑤 𝑡+1 is weights at time 𝑡 + 1, 𝛼 is learning rate, 𝛿𝐿 is derivative of loss function, 𝛿𝑤 𝑡 is derivative of weights at the time 𝑡, 𝛽is moving average parameter
Root Mean Square Propagation (RMSP) is an adaptive learning algorithm that attempts to improve AdaGrad It uses the “exponential moving average" rather than the cumulative Sum of squared gradients as AdaGrad does as describe in Equation 2.25:
Where 𝑣 𝑡 is sum of the square of past gradients, 𝜀 is a small positive constant The Adam Optimizer builds on the advantages and strong points of the earlier techniques to produce a gradient descent that is more optimal Combining the above two equations, we obtain the mathematical formula of this optimizer as Equation 2.26:
Copies for internal use only in Phenikaa University
Where 𝛽 1 and 𝛽 2 are decay rates of average of gradients in the above two methods
The rate of gradient descent is controlled in such a way that there is minimum oscillation when it reaches the global minimum while taking big enough steps (step – size) in order to pass the local minimal hurdles along the way Therefore, combining the features of the above methods to reach the global minimum efficiently
The initial learning rate is set to 0.01, and we apply a learning rate decay strategy, reducing the rate by a factor of 0.2 when the validation loss does not decrease for 13 consecutive epochs as Table 2.3 During the training process, we conducted experiments with two different expansion factors on the bottleneck layers in two directions: traditional Residual Block and inverted Residual Block The model is trained on an NVIDIA RTX A5000 High-Performance Computer (HPC) for 100 epochs with a batch size of 64 samples
Table II.3 Details of learning process hyperparameters
EXPERIMENTAL RESULTS & BENCHMARK
Experimental setup & dataset
Hand gesture recognition has been a prominent research area for quite some time, leading to the availability of numerous open-source datasets Leveraging these
Copies for internal use only in Phenikaa University resources, we use existing datasets for training and evaluating our proposed model
However, recent datasets collected under various real-life scenarios, encompassing diverse environments from indoor to outdoor settings with multiple participants, present certain challenges While these datasets provide a more realistic simulation, they are limited in size, making it difficult to train a model from scratch to achieve optimal performance
To address this limitation, we turned to a larger dataset containing a substantial number of samples, although typically featuring only the hand in the image
Therefore, we decided to pre-train our proposed model on this dataset
Subsequently, we employ transfer learning to fine-tune this pre-trained model using a smaller dataset After careful consideration, we selected the American Sign Language (ASL) dataset [24] for pre-training and the OUHANDS [25] dataset for transfer learning
With ASL, the dataset includes 87,000 images of American Sign Language alphabets, each 200x200 pixels, across 26 classes including letters A-Z Each image in the dataset is a hand captured in different experimental environments and under different lighting conditions However, this dataset does not appear people, only hands inside the image Some examples of this data set are illustrated in Figure 3.1 For the OUHANDS dataset, this data is more diverse and better simulates real-life scenario The data was recorded in many different environments by many volunteers and there were people in the images as a distracting agent
However, this data set only has 3000 images for 10 gestures including A, B, C, D, E, F, H, I, J, K Figure 3.2 illustrates some sample from OUHANDS dataset
Figure III.1 Some sample from ASL dataset
Copies for internal use only in Phenikaa University
Figure III.2 Some sample from OUHANDS dataset
In addition, we also collect a data set in our environment with classes corresponding to OUHANDS This data set will be mixed with the OUHANDS set to increase the number of samples as well as help the model generalize better Our dataset includes 300 images for 10 classes, each class will contain 30 images, corresponding to the number of experimental repetitions of a participant in the OUHANDS dataset Figure 3.3 describes our experimental setup Each recording will record a 30-second video, the hand will maintain the same gesture and move to different angles of the camera to capture many different perspectives
All the datasets would be resized in 96 × 96 × 3 because this is the input that matches all the data when training the model
Copies for internal use only in Phenikaa University
Evaluating metrics
Accuracy is a metric that measures the performance of the model across all samples It will calculate the ratio between the number of correct predictions above the total number of predictions This metric can be formulated as Equation 3.1
𝑇𝑜𝑡𝑎𝑙 𝑝𝑟𝑒𝑑 (3.1) where 𝑡𝑟𝑢𝑒 𝑝𝑟𝑒𝑑 is number of correct predictions and 𝑡𝑜𝑡𝑎𝑙 𝑝𝑟𝑒𝑑 is the number of total predictions
Confusion matrix is a table metric used for evaluating the performance of a classification algorithm Each row of table is a predicted class, and each column represents a true class Each cell in the table represents the number of predicted samples corresponding to the label of that row A good confusion matrix will have high values on the diagonal and vice versa
Using the confusion matrix gives detailed insight into the model's behavior for each class Through quantification of the difference between true and predicted class labels, this evaluation metrics sheds light on any non-uniform prediction or class imbalances that may afflict the model's decision-making.
Experimental results & Ablation studies
After training the model with different expansion factors, included 0.25 and 3.0 for SE Conv Block in two models, we observed the result in Figure 3.4 and Figure 3.5 Figure 3.4 illustrates the training progress of the model with an expansion factor (𝑡) of 3.0, reaching the desired threshold after approximately 50 epochs
Remarkably, the model exhibits no signs of overfitting, as evidenced by the convergence of loss curves for both the training and validation sets In Figure 3.5, a comparison between the models employing t values of 0.25 and 3.0 reveals high
Copies for internal use only in Phenikaa University accuracies of 96.7% and 99.6%, respectively Although the model using 𝑡 = 0.25 is 3% lower in accuracy, it is also 1.4 times smaller than the model using 𝑡 = 3.0
However, when employing transfer learning with the OUHANDS dataset using the FOMO technique, significant differences appear, as depicted in Figure 3.6 The confusion matrix highlights pronounced disparities between the two models While the t = 3.0 model achieves high accuracies exceeding 80% across all classes, the t
= 0.25 model exhibits instability and variability among different classes Notably, class Non achieves a score exceeding 99%, contrasting sharply with class C, which registers only 43%
This observation underscores a drawback of lightweight deep learning models based on the TinyML technique Because it must be optimized for specific datasets, these models often sacrifice generalizability A comparison between the OUHANDS and ASL datasets further elucidates this point While both datasets relate to HGR, OUHANDS presents distinct challenges, particularly in background differentiation, absent in the ASL dataset Consequently, despite optimization for ASL, the model utilizes an expansion factor of 0.25, yet still struggles to perform effectively on OUHANDS
Figure III.4 Learning curve diagram when training the model on the ASL with expansion factor t = 3.0 The learning curve diagram shows the decrease in loss function and increase in accuracy The model achieves the best results after more than 50 epochs and is nearly 100% accuracy for both training and validation sets
Copies for internal use only in Phenikaa University
Figure III.5 Accuracy and number of parameters when using two different expansion factor, t = 0.25 and t = 3.0
Table III.1 Summary of ablation evaluation of the proposed model in comparing its accuracy and model size with other some modifications
This section aims to showcase the effectiveness of the proposed architecture, characterized by a fusion of SE blocks with a lightweight classifier (LC), through an ablation study To achieve this, we conduct a comparison among various CNN structures, each constructed by adjusting specific segments of the proposed model
The first modification involves removing the SE block from both the SE Conv and SE Residual blocks while maintaining identical hyperparameters This allows us to assess the impact of SE blocks on model performance The second modification entails replacing the lightweight classifier layer with a Flatten layer followed by Fully Connected layers This alteration enables us to evaluate the model's
Copies for internal use only in Phenikaa University effectiveness in terms of computational costs Looking at table 3.1, removing the SE block reduces model size, but it also reduces the model's performance on the OUHANDS data set as well as the ASL data set The model only achieved 85.5% when evaluated on OUHANDS and approximately 95% when evaluated on the ASL set If we eliminate the Lightweight Classifier and replace it with a classifier block including Flatten and Dense, the model increases the number of parameters to 99.7K but the model's performance also drops to 96.12% on ASL This proves that our proposed model is superior when using the proposed modules
Copies for internal use only in Phenikaa University
Figure III.6 Confusion matrix of proposed method with a) t = 0.25 and b) t = 3.0 when deploying transfer learning in OUHANDS dataset
Copies for internal use only in Phenikaa University
Benchmark in microcontrollers
To prepare the models for deployment on microcontrollers, we utilize the post- training integer quantization method to convert the existing floating-point model to 8-bit precision This process is achieved using the TFLite framework's tools, allowing us to logically transform the model without the need for manual retraining, while still preserving its accuracy The quantization leads to a significant reduction in model size, up to 4 times smaller, enhancing both size and inference speed on the CPU Looking at Figure 3.7, it can be seen that our model is much smaller than previous models With only 155 KiB after optimization, the proposed model is 6 times smaller than the Cross Feat model with 975.5 KiB, the smallest of the previous models However, the model's accuracy as mentioned above is still extremely competitive The other models on the table are all larger than 1000 KiB This can be a big barrier to being able to deploy the model on microcontrollers The summary of benchmark in Table 3.2 Our model also has the fastest with average inference time about 269 ms, twice as fast as extriDeNet [32](about 500ms), and 10 times faster than other models (CrossFeat [33] with 3281 ms and Lightweight CNN [34] with 2025 ms)
Figure III.7 Comparing size (a) before and after implementing the quantization technique for each of the following models: proposed model with micro- bottleneck, Lightweight CNN [34], CrossFeat [33] and ExtriDeNet [32]
Copies for internal use only in Phenikaa University
Comparison with state-of-art approaches
The proposed model delivers superior performance, achieving accuracy of over 99% with ASL dataset Some other methods conducted previously also produced similar results on the ASL dataset Methods [35] using convolutional neural networks with spatial pyramid pooling, this structure enables neural networks to accept input images of arbitrary sizes and aspect ratios, overcoming the need for resizing or cropping images to fit a fixed input size Besides, SPP typically utilizes multiple levels of pooling, with each level representing a different grid size This enables the network to capture both fine-grained details and global context information from the input image The results obtained from this model have an accuracy of up to 99.99% However, this model has a number of parameters up to 6.3 million parameters and is completely incapable of embedding in microcontrollers With two other models that are built lighter, method [34] built light bottleneck blocks in succession gives a result of 98.72% accuracy However, with a size of up to 848K parameters, this model is still 20 times larger than the proposed model CrossFeat [33] uses convolutional with different kernel sizes to extract multi-scale data as well as remember low-level features so you can add spatial information to the layers behind Thanks to that, the model achieves an accuracy of up to 99.5% with 975K parameters Because it must run many kernel sizes of different sizes in parallel to extract model data, this model is not optimized for RAM-limited hardware
For the OUHANDS data set, there is a model like ExtriDeNet [32] that also uses a multi-scale feature extractor to extract data, but this model's evaluation on the OUHANDS set is only 65.1% even though it has up to 1.3 million parameters A more optimized approach produces particularly good results up to 98.75% using two architectures: Multi-scale structure and lightweight attention to enhance the power of the model [36] This model is also lighter than ExtriDeNet with only 666.7K parameters but still 15 times larger than the proposed model Therefore, our proposed model offers a viable trade-off between accuracy and model size, making it suitable for deployment on microcontrollers with various hardware constraints The summary of comparison is described in Table 3.2
Copies for internal use only in Phenikaa University
Table III.2 Summarize the result of the proposed model and previous model in
1018 KiB in Flash 271.97 KiB in RAM Average inference time 2025 ms
956.38 KiB in Flash 510.69 KiB in RAM Average inference time 3281.5 ms
1.3 MiB in Flash 269 KiB in RAM Average inference time 500.4 ms
140 KiB in Flash 290.13 KiB in RAM Average inference time 269 ms
Copies for internal use only in Phenikaa University
CHAPTER IV BUILDING A CONTROL SYSTEM USING HAND
System diagram
Figure IV.1 System design block diagram
Figure 4.1 shows the block diagram of the system components OpenMV H7 will take on the task of recognizing gestures every time there is a hand in its field of view Subsequently, OpenMV processes this signal, executes the model, and transmits the classified gesture results back to the LabView software installed on the computer LabView, based on the recognized gesture type, performs the appropriate logical operations In the event of a gesture signaling motor control, the computer will send a command to the motion control via the RS232 connection, from there the motion control controls the driver to supply pulses to the stepper motor Figure 4.2 depicts the actual image of the control system
Copies for internal use only in Phenikaa University
Figure IV.2 The control system in practice
Hardware
4.2.1 Motion Control PMC-1HS-USB
Figure IV.3 Motion Control PMC-1HS-USB
Copies for internal use only in Phenikaa University
Table IV.1 PMC-1HS-USB Specifications
Input/Output Contact Parallel I/F: Input 13, Output 4
X-axis: Input 8, Output 5 (general output 1)
Drive Speed 1 pps to 4 Mpps
Figure IV.4 Driver MD5-HD14
Copies for internal use only in Phenikaa University
Table IV.2 MD5-HD14 Specifications
Operation method Bipolar constant current pentagon drive
Applied Motor 5-phase stepper motor
Resolution FULL STEP(1-division), HALF
STEP(2-division), Micro STEP (4, 5, 8, 10, 16, 20, 25, 40, 50, 80, 100, 125, 200, 250-division)
Copies for internal use only in Phenikaa University
Basic Step Angle [FULL/HALF] 0.72º/0.36º
Max holding torque 8.3kgf.cm
Rotor Moment of Inertia 280gãcm²
Copies for internal use only in Phenikaa University
Supported Image Formats Grayscale/RGB565JPEG
Grayscale: 640x480 RGB565: 320x240 Grayscale JPEG: 640x480 RGB565 JPEG: 640x480
Software
OpenMV is a powerful and affordable machine vision platform that combines a camera module with the flexibility of programming in Python It allows users to easily implement machine vision algorithms, track colors, detect faces, and control I/O pins in the real world [37]
The OpenMV is programmed by Python, a high-level language This makes it easy to learn, and suitable for a wide range of users With the aim of becoming the
“Arduino of Machine Vision” [37], OpenMV includes many built-in machines vision modules for tasks like color tracking, face detection Moreover, the OpenMV Cam is designed to be compact and cost-effective, and expandable with add-on modules, allowing users to enhance its capabilities and make it suitable for various applications and projects
The OpenMV has a user-friendly IDE with powerful text editor, frame buffer viewer, examples, and file management as Figure 4.7 Copies for internal use only in Phenikaa University
LabVIEW is a graphical programming environment that provides unique productivity accelerators for test system development, such as an intuitive approach to programming, connectivity to any instrument, and fully integrated user interfaces [38]
In recent years, due to its ease of coding, LabVIEW has become one of the most popular data collection systems Moreover, LabVIEW can be used to acquire data and control instruments, develop with graphic programming, monitor and interact with test systems, gain insights from data, communicate using industry protocols, or even add code from other programming languages [38] In fact, LabVIEW has a native user interface for monitoring and controlling, it can connect to any instrument, regardless of vendor, and works with popular languages, such as Python, C, and NET
In LabVIEW, each program is referred to as a Virtual Instrument (VI) A VI has two main parts: front panel, which serves as user interface, and block diagram, which contains the graphical code As a graphical programming language, instead of writing lines of code, the user interface can be made by simply dragging and dropping objects [39] The objects can be represented as buttons, potentiometers,
Copies for internal use only in Phenikaa University knobs, charts and graphs from oscilloscopes, LEDs to represent on or off, and other input mechanisms or other output displays Figure 4.8 is an example:
Figure IV.8 LabVIEW front panel, which contains many types of objects placed inside many decorations’ blocks
The way that each object in the user interface works can be determined in block diagram It can be done easily by using thousands of engineering analysis functions in LabVIEW with a user-friendly interface
Figure IV.9 LabVIEW block diagram The input, output value and button are represented in front panel
Copies for internal use only in Phenikaa University
Using LabVIEW to talk to OpenMV can be solved easily by applying an add-on module [40]
Figure IV.10 Simple block diagram of connecting to OpenMV through LabVIEW
Figure IV.11 LabVIEW & OpenMV connection UI [40]
Copies for internal use only in Phenikaa University
The image and the data received from the OpenMV are continuously updated as they are inside a while loop, with stop condition controlled by the user The UI looks (and behaves) similar to the OpenMV IDE, with the only difference being that there are 3 areas instead of one: Simple, REPL, and Python script The python script works like the OpenMV IDE while the first area helps write the python code - make a few choices and then hit connect and run The second - REPL - option allows user to run code live (line by line) in the Python REPL - a handy way to debug [40].
Controller programming
4.4.1 Deploy model into OpenMV Cam H7
Deploying the model on OpenMV involves several steps: translation of the model into binary code, building the firmware, and embedding it into OpenMV
Fortunately, the Edge Impulse tool supports this entire process, facilitating automatic deployment of models to various microcontrollers Figure 4.12 describes the interface to help deploy the model
Copies for internal use only in Phenikaa University
Figure IV.12 User interface for deploying the model to OpenMV
From Figure 4.13 we can see first the camera sensor continuously captures images
Next, the model then classifies them to determine whether there is a gesture, if so, identify the specific gesture After that, the class of the gesture is transmitted to LabVIEW The process continues until receiving stop command from LabVIEW
The interface of this process is illustrated in Figure 4.14
Copies for internal use only in Phenikaa University
Figure IV.13 OpenMV control algorithm flowchart
Figure IV.14 Result is shown in OpenMV interface
Copies for internal use only in Phenikaa University
Figure IV.15 Algorithm flowchart of using LabVIEW
After receiving the class of the gesture being made in front of the camera sensor, LabVIEW checks which type of the command corresponding to the result There are 9 gestures being used in LabVIEW (shown below), which is divided into 3 main commands: change setting, change mode, and action If change setting or change mode command is true, change the settings by making gestures of that command If action command is true, transmit the command to the VISA port
Copies for internal use only in Phenikaa University
Here, the command is converted to pulse signal which then is transmitted to motor
Otherwise check whether user chooses to exit the system, if so, stop receiving results from OpenMV and stop the system, else continue receiving class of the gesture
According to Figure 4.16, the class of gestures from 1 to 5 are used for action command, while gesture 6 and 7 are used for determining which of setting or mode is chosen to modify Gesture 8 and 9 are used for changing the value of setting or mode up and down
Figure IV.16 9 gestures being used Where 1 – Left means that the class of the gesture below is 1 and LabVIEW considers it as command go left
LabVIEW’s user interface as Figure 4.17 describes The first time running this VI requires mode must be chosen, otherwise all the action command will not be transmitted The button of check change mode or setting turns yellow means that mode or setting is ready to be modified The buttons next to the values of paras setting help point out which value among those is chosen to be modified by turning yellow, while in mode setting, modes themselves are buttons Use gesture 8 and 9 to choose parameters or mode And to change the value of the parameter that has
Copies for internal use only in Phenikaa University the button next to it turning yellow, first remake gesture 6, if that button is blinking, use gesture 8 or 9 to increase or decrease that value
There are a total of 3 modes: continuous, default, use paras In continuous mode, when no action command is detected, the motors stay In default mode, the motors work as follows the action command, with the moving distance of 5 mm, speed of 6 mm/s; and can be interrupted by “stop” command In the last mode, the motors work as follows the action command, with the moving distance and speed equal to values from the paras setting part (place at bottom left of the UI); and can be interrupted by “stop” command.
Experimental result
The process consists of two main activities: setup and control
During the setup phase, users configure parameters such as moving distance, speed, and the horizontal and vertical scanning area of the sensor Additionally, they select the operating mode Mode 1 enables continuous movement, where the motor activates upon recognition of any control gesture and ceases only when no control gestures are detected In Mode 2, the sensor head moves according to a default setting, facilitating fine-tuning operations Each control gesture in this mode prompts a movement of 0.1 mm The third mode allows movement based on user-defined parameters, necessitating the specification of distance and speed
Once configured, each control gesture triggers the sensor head to move according to the preset settings
The second phase involves control gestures corresponding to the selected mode
With the parameters set, users assign control gestures to manage the system effortlessly
Overall, the transmitting and receiving command work smoothly with low latency
The gesture should be made not too close or not too far away from the camera
There is a challenge about the bright light condition, such as from the ceiling lamps, which make the model predict wrong class of the gesture
Copies for internal use only in Phenikaa University
Figure IV.17 LabVIEW’s user interface
Figure IV.18 when not define mode, no action is received even that OpenMV capture a “left” gesture
Copies for internal use only in Phenikaa University
Copies for internal use only in Phenikaa University
Figure IV.21 Motors work in continuous mode
Figure IV.22 Motors work in use parameters mode