Báo cáo nghiên cứu khoa học: Vietnamese sign language recognition using machine learning and deep learning

VIETNAM NATIONAL UNIVERSITY, HANOIINTERNATIONAL SCHOOL STUDENT RESEARCH REPORT VIETNAMESE SIGN LANGUAGE RECOGNITION USING MACHINE LEARNING AND DEEP LEARNING CN.NC.SV.23_49 Team Leader: N

Project Name

- English: Vietnamese Sign Language Recognition Using Machine Learning and Deep Learning

- Vietnamese: Nhận dạng ngôn ngữ ký hiệu tiếng Việt bằng học máy và học sâu

Project Code

Member List

Lê Phương Thảo Nguyên BDA2021A 21070173

Cao Đỗ Gia Khanh BDA2022B 22070771

Nguyễn Trần Hà Oanh BDA2021A 21070073

Advisor(s)

Phạm Thị Việt Hương PhD

Abstract

Sign language is the only way for deaf and mute people to communicate with each other and with everyone in the community, but not everyone who is normal can understand sign language Therefore, in this report, we propose a method of combining humans and computers to create a system that recognizes Vietnamese sign language and specifically theVietnamese alphabet, the numbers Vietnamese numbers and signs to serve as a bridge between deaf and mute people and normal people, helping them better integrate into the community.

Based on the success of CNN's image processing, we propose to use CNN combined withVGG to increase efficiency in recognizing symbols And LSTM - a special form of RNN to remember information from previous parts of the data series, helping to accurately predict the next parts We use images for static symbols and videos for dynamic symbols as input data.The experimental results demonstrate the proposed VSLR recognized the Vietnamese sign language (VSL) with high accuracy.

Keywords

SUMMARY REPORT IN STUDENT RESEARCH

LITERATURE REVIEW

Concerning the Rationale of the Study

The rationale behind the study on Vietnamese Sign Language (VSL) lies in the imperative to address the communication barriers faced by the deaf community in Vietnam Despite the rich linguistic and cultural significance of VSL, there remains a pronounced gap in accessibility and inclusivity for deaf individuals in various aspects of daily life, including education, employment, and social interaction By undertaking research focused on VSL recognition, the study aims to leverage advanced machine learning and deep learning techniques to develop robust and accurate recognition systems that facilitate seamless communication between deaf individuals and the broader society Moreover, the study is motivated by the pressing need to preserve and promote VSL as a vital component of the cultural heritage of Vietnam's deaf community By advancing the state-of-the-art in VSL recognition technology, the research seeks to foster greater awareness, understanding, and acceptance of sign language as a legitimate means of communication, thereby contributing to a more inclusive and equitable society for all individuals, regardless of their hearing abilities. Ultimately, the rationale of the study lies in its commitment to empowering deaf individuals in Vietnam to fully participate and engage in all facets of society, while also celebrating and preserving the unique linguistic and cultural heritage of VSL.

Research questions

- What are the most effective feature representations for encoding VSL gestures in machine learning models?

Leveraging machine learning algorithms to interpret sign language gestures requires exploring effective representation methods Feature extraction techniques like handshape descriptors, motion trajectories, and spatial-temporal representations become crucial in this endeavor The goal is to identify the most suitable representation that optimizes recognition accuracy.

- How can deep learning architectures be adapted or designed specifically for VSL recognition?

Deep learning has shown remarkable success in various pattern recognition tasks, but its application to sign language recognition poses unique challenges due to the sequential and dynamic nature of sign language gestures This question aims to explore novel deep learning architectures, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), or their combination (e.g., convolutional LSTM networks), that are tailored to capture the temporal dependencies and spatial configurations inherent in VSL.

- What is the impact of data augmentation techniques on the robustness and generalization of VSL recognition models?

Data augmentation methods, such as temporal warping, spatial transformation, or synthetic data generation, can help increase the diversity and quantity of training data, thereby enhancing the resilience of recognition models to variations in signer style, lighting conditions, or camera perspectives This question investigates the efficacy of different augmentation strategies in improving the performance and scalability of VSL recognition systems.

- How can cross-modal learning approaches be leveraged to enhance VSL recognition from multiple modalities (e.g., video, depth, or skeletal data)?

Cross-modal learning techniques aim to exploit complementary information from different modalities to improve recognition accuracy and robustness This question explores the potential benefits of integrating multiple data sources, such as RGB video streams, depth maps, or skeletal joint coordinates, within a unified deep learning framework for VSL recognition.

- What are the real-world usability and accessibility implications of VSL recognition systems developed using machine learning and deep learning?

Beyond technical performance metrics, this question examines the practical considerations and user-centered design aspects of deploying VSL recognition technology in real-world settings It involves evaluating factors such as system latency, user interface design, ease of interaction, and user satisfaction through user studies and usability testing with members of the deaf community in Vietnam.

Motivation and Objective

Enhancing Accessibility: The motivation behind this research stems from the overarching goal of enhancing accessibility and inclusivity for the deaf community in Vietnam By developing robust VSL recognition systems, the aim is to empower individuals with hearing impairments to effectively communicate and interact with the broader society, thereby breaking down communication barriers and fostering a more inclusive environment.

With the rapid advancements in machine learning and deep learning, there is a growing opportunity to leverage these technologies for the development of innovative assistive technologies This research is motivated by the desire to harness the power of these cutting-edge methodologies to create accurate, efficient, and scalable VSL recognition systems These systems have the potential to greatly benefit deaf individuals by improving their quality of life and enabling them to communicate more effectively.

Cultural Preservation: Vietnamese Sign Language is not only a means of communication but also a vital component of the cultural identity of the deaf community in Vietnam. Preserving and promoting the use of VSL is essential for maintaining cultural heritage and ensuring the linguistic rights of deaf individuals Thus, this research is motivated by the cultural significance of VSL and the importance of supporting its recognition and use within the community.

Developing Accurate Recognition Models: The primary objective of this research is to develop accurate and reliable VSL recognition models using machine learning and deep learning techniques These models should be capable of accurately interpreting sign language gestures from video or sensor data with high precision and robustness across different users, environments, and signing variations.

Exploring Novel Architectures: Another objective is to explore novel deep learning architectures and methodologies specifically tailored for VSL recognition This includes investigating architectures such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), attention mechanisms, and their combinations to effectively capture the temporal and spatial dynamics of sign language gestures.

Addressing Real-World Challenges: The research aims to address real-world challenges associated with VSL recognition, such as variability in signing styles, occlusions, background clutter, and environmental factors By incorporating data augmentation techniques, cross-modal learning approaches, and domain adaptation methods, the objective is to improve the robustness and generalization capabilities of VSL recognition systems.

Evaluating Usability and Accessibility: Beyond technical performance metrics, the research also seeks to evaluate the usability and accessibility implications of VSL recognition systems in real-world scenarios This involves conducting user studies and usability testing with members of the deaf community to assess factors such as system latency, user interface design, ease of interaction, and user satisfaction, ensuring that the developed systems meet the needs and preferences of end-users.

Research Methods

Data Collection and Preprocessing: The study begins with the collection of a comprehensive dataset comprising video recordings or images of VSL gestures performed by native signers. This dataset is annotated with labels corresponding to different sign language gestures. Preprocessing techniques are applied to clean and standardize the data, including resizing images, normalizing pixel values, and possibly augmenting the dataset to increase its diversity and robustness.

Model Selection and Architecture Design: The study selects deep learning models suitable for VSL recognition, such as Convolutional Neural Networks (CNNs) and VGG16 These models are chosen for their ability to capture spatial hierarchies of features in images, which is essential for recognizing complex hand gestures and movements in sign language The architecture of the chosen models is tailored to accommodate the unique characteristics of VSL data, such as sequential temporal patterns and spatial configurations of gestures.

To enhance the performance of CNN and VGG16 models in recognizing VSL gestures, training and evaluation are crucial Training entails optimizing model parameters by minimizing a loss function through backpropagation and gradient descent Subsequently, evaluating these trained models involves utilizing a separate testing dataset to determine their accuracy, precision, and recall, among other pertinent performance metrics.

Deployment and Application: Once trained and validated, the optimized CNN and VGG16 models are deployed in real-world applications for VSL recognition This may involve integrating the models into user-friendly interfaces or communication devices, allowing deaf individuals to interact with the system through gestures captured by cameras or sensors.Usability testing and user feedback are crucial for refining the deployed system and ensuring its effectiveness and accessibility.

DATA & METHODOLOGY

Sign Language Recognition Basics

Understanding sign language recognition is a pivotal aspect of research within the realm of Vietnamese Sign Language (VSL) This linguistic modality not only serves as a means of communication for the deaf community in Vietnam but also embodies a rich cultural and social identity In research reports focused on VSL, there is a profound emphasis on deciphering the intricate gestures, facial expressions, and body movements that constitute this vibrant language From exploring the nuances of handshapes to analyzing the grammatical structure, researchers aim to develop robust recognition systems that bridge the gap between sign language users and the broader society Through meticulous observation, data collection, and technological advancements, these studies contribute to the enhancement of accessibility and inclusion for the deaf community in Vietnam, fostering a more equitable society where communication barriers are minimized.

The sign language recognition process unfolds as a multifaceted endeavor aimed at comprehensively understanding and effectively interpreting the linguistic expressions of the deaf community in Vietnam At its core, this process involves a meticulous examination of the intricate handshapes, movements, facial expressions, and body language that constituteVSL's rich vocabulary and grammar Researchers delve into the linguistic structure of VSL,dissecting its phonological, morphological, and syntactic components to discern patterns and rules governing its usage Through extensive data collection, often involving video recordings of native signers, researchers amass a corpus of annotated sign language data,which serves as the foundation for training and testing sign language recognition algorithms.

Advanced machine learning techniques, including deep learning and computer vision, are employed to develop robust recognition systems capable of accurately interpreting sign language gestures in real-time These systems undergo rigorous evaluation to assess their accuracy, efficiency, and usability in practical scenarios, such as communication devices, educational tools, or accessibility applications Moreover, researchers collaborate closely with members of the deaf community to ensure that recognition systems are culturally sensitive, inclusive, and aligned with the diverse linguistic variations and cultural nuances present within VSL By advancing the state-of-the-art in sign language recognition technology, research efforts in VSL contribute to fostering greater accessibility, autonomy, and social integration for deaf individuals in Vietnam, empowering them to fully participate in all facets of society.

Overview of Sign Language Recognition Models

In the realm of sign language recognition, convolutional neural networks (CNNs) and theVGG16 architecture stand as prominent methodologies for accurate and efficient gesture classification CNNs, inspired by the visual cortex's organization, excel in capturing spatial hierarchies of features within images Specifically, in sign language recognition, CNNs are adept at extracting intricate patterns from video frames or image sequences, enabling robust classification of hand gestures and movements Meanwhile, VGG16, a deep CNN architecture characterized by its simplicity and effectiveness, comprises multiple convolutional layers followed by max-pooling layers, culminating in fully connected layers for classification Its hierarchical structure facilitates the extraction of complex features from input data, making it particularly suitable for tasks with high-dimensional inputs like sign language recognition Both CNNs and VGG16 have been instrumental in advancing the state-of-the-art in sign language recognition, offering powerful tools for developing accurate and scalable recognition systems that cater to the needs of the deaf community.

Model Development and Optimization Process

3.1.1 A comparison between ASL and VSL

In this research, we build our own Vietnamese Sign Language (VSL) dataset To do that, we consider the differences between the two sign languages.

The ASL alphabet includes 26 characters: “A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q,

R, S, T, U, V, W, X, Y, Z” 24 of them can be regarded as a static hand pose, except for J and

Z are considered as a dynamic hand motion [1]

Meanwhile, the VSL alphabet is quite different from any other sign language fingerspelling as we have a number of additional accent marks and remove some common characters of the Latin alphabet The standard of Vietnamese Sign Language was introduced by the Vietnamese Ministry of Education and Training [2].

As shown in Figure 1 and Figure 2, we can see that according to Vietnamese standard, there are 22 letters of VSL that are similar to ASL, 11 of them have the same hand pose including

“A, B, C, G, L, O, P, Q, U, V, Y” On the other hand, 11 others including “D, E, H, K, M, N,

Apart from 22 similar letters, VSL remove 4 letters of ASL (“F, J, W, Z” which are already included in Fig.2 for reference) and include a unique letter Đ (D with stroke – dyet) with 3 additional accent marks: “ ^ ” ( the circumflex), “ ˘ ” (the breve), and “ ’ ” (the horn). Vietnamese also has 5 tonal symbols in the last two rows of Fig 1: “ ` ” (grave), “ ´ ” (acute),

“ ̉ ” (hook), “ ˜ ” (tilde) and “ ̣” (underdot), and these marks are combined with vowels only.

In this phase, we collect images of gestures by recording with participation of our group members and 5 volunteers, both male and female, aged 19 to 21 All participants were trained with the Vietnamese Alphabet Sign Language standard, each participant executed all of 25 static symbols of the Vietnamese alphabet and the signs of numbers from 0 to 9, inclusive.

Smartphones of various brands (iPhone, Samsung, Xiaomi, Oppo, etc.) are employed to capture videos of subjects Each video exceeds 20 minutes and is saved in MP4 or MOV format, featuring an RGB color space During recording, all videos adhere to a uniform FPS of 30 and a progressive HD signal format of 720p.

To vary environmental conditions, we use two types of backgrounds (green and white). Besides, we set up the cameras at different angles in order to vary the views as shown in Figure 3 below.

Figure 3 Recording process with different angles

Figure 4 Data captured with white and green backgrounds, brightness, contrast and distance

We use data collected with 8 people for training, the rest and some public data found on kaggle or github are used for testing.

The video trimming phase in this script involves two main steps: converting a video into frames and processing these frames to detect hand landmarks.

- The script starts by reading a video file from a specified path.

- It then iterates through each frame of the video, extracting them one by one.

- Each frame is saved as an image in a specified output folder.

- This process effectively converts the video into a sequence of images (frames) that can be further processed.

Overall, this video trimming phase prepares the data for further analysis or application by converting the video into individual frames and extracting relevant hand landmarks from each frame These processed frames with hand landmarks can then be used for tasks such as gesture recognition, hand tracking, or gesture-based interaction in applications.

3.2.2 Processing Frames to Detect Hand Landmarks

In related works, they normally used thresholding or binary map to segment regions of interest (ROI) or segment only the hands in images [19] Beyond normal pre-processing methods, we introduce another method to train and learn how to classify images.

In our proposed method, we use a pre-processing method that assists in detecting 21 joints on the hand (hand mark points) and reconstructing the hand shape across the joints Libraries used are OpenCV [3] and MediaPipe [4] For motion tracking and high-precision hand and finger recognition, MediaPipe is a good option because it is based on some machine learning (ML) models that were trained with large data by Google With ML, this library can shape 21 hand mark points with 3D coordinates from only a single image Furthermore, the MediaPipe analytical method enables faster and lighter identification of methods of applying powerful algorithms on computers, making it simple and quick to use on phones based on performance in real-time.

A palm detection model is used by the MediaPipe ML method This model is intended to improve real-time usability on mobile devices and only requires one shot (one shot detector technique) To recognize the hand and point out the points, MediaPipe employs two models: a reduced model and a full model, both of which are applicable to different hand sizes and have magnifications of up to approximately 20x Furthermore, the model is given more information about the arm and the human body, which increases the contrast and makes hand recognition more accurate To recognize the position of the hand, the model is learned to find the palm area, and then expands to find other locations on the hand This method combines with learning about more than 30k images worldwide, each image is marked with 21 hand mark points on the 3D coordinate system and then returns to the 2D coordinate system ignoring the image depth.

Auto brightness and contrast balance method

2 Input range = array of images

Before applying the MediaPipe library to find a hand with 21 points, we apply an algorithm to balance brightness and contrast.

We used the OpenCV library to scan through each frame of the video, then applied the MediaPipe library to find hands on images and cropped the ROI into a square box containing the hand We resized the new square box to 224*224 and reapplied the MediaPipe library to get marked coordinates to reconstruct an image consisting of only 21 hand mark points and joins between them (representing joints and bones) on a smooth white background The MediaPipe applications highlight to create maximum contrast, help highlight the shape of the hand and reduce noise parameters that come from the typical image background Finally, to reduce the size of data, we normalize the image pixel’s color range from (0, 255) to (0, 1).

Pre-processing algorithm of proposed method (8 steps)

1 Input = Images with various sizes split from videos

2 Apply auto brightness and contrast balance method

3 Apply MediaPipe to find hand

4 Crop ROI with square bounding box over the hand

5 Use MediaPipe to find the hand joints on new resized image

6 Draw the joints on an image with the coordinates in step 6, use it as bone diagram image

7 Normalize the bone diagram image to (0, 1)

Figure6 A hand image of VSL after step 6

Convolutional neural networks (CNN) is one of the most used deep learning methods to analyze visual imagery CNN involves less pre-processing compared to other image classification algorithms The network learns the filters that are normally hand-engineered in other systems The use of CNN reduces the images into a format that is easier to process while preserving features that are essential for making accurate predictions There are four types of operations in a CNN: convolution, pooling, flattening, and fully connected layers

[5] The convolution layer usually captures low-level features such as color, edges, and gradient orientation In CNNs, a matrix, also known as a kernel, is passed over the input real-time matrix to generate a feature map, which is then used in the subsequent layer A mathematical operation called convolution is performed by using a kernel sliding over the input real-time matrix At each position, matrix multiplication is carried out, and the set of results is added to the final feature map For example, consider a 2D kernel filter denoted as

K and a 2D input image denoted as I In this case, the convolutional operation is calculated as shown in Equation (1):

Non-linear activation functions (ReLU): The node after the convolutional layer is referred to as the activation function The rectified linear unit (ReLU) can be considered a piecewise

V X V Circumflex Horn linear function, which will provide the output as the input if it is positive, or zero otherwise. The expression of the ReLU function is R(z) = max(0, z).

The pooling layer decreases the spatial dimension of the convolved feature This operation reduces the required computational time for dealing with the data through dimensionality alleviation Furthermore, it has the advantage of maintaining dominant features that are positionally and rotationally invariant during the model training process After the input image has been processed the higher-level features may be used for classification Therefore, the image is flattened into a 1-D vector In CNN, the flattened output is supplied to a fully connected layer After training, using SoftMax classification, the model can provide probabilities of prediction of objects in the image [6].

The CNN model in our study is the VGG16 model which is a convolutional neural network (CNN) architecture that was proposed by the Visual Geometry Group (VGG) at the University of Oxford It is characterized by its depth, consisting of 16 layers, including 13 convolutional layers and 3 fully connected layers [7] The model’s architecture features a stack of convolutional layers followed by max-pooling layers, with progressively increasing depth Figure 7 and figure8 below illustrate the proposed structure of the VGG16.

Figure7 VGG16 structure (source https://neurohive.io/en/)

Figure8 VGG16 architecture map (source https://neurohive.io/en/)

Implementation of a Sign Language Recognition System

In the implementation phase of a Sign Language Recognition (SLR) system within a research report focused on Vietnamese Sign Language (VSL), meticulous attention is paid to every aspect of the system's design, development, and deployment The process typically begins with data collection, where a diverse corpus of VSL gestures is compiled, annotated,and preprocessed to serve as the training and evaluation dataset for the recognition models.Next, state-of-the-art machine learning and deep learning algorithms, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), are employed to develop the core recognition model These models are trained using the collected data, with rigorous validation and tuning procedures to optimize their performance metrics, including accuracy,precision, and recall Additionally, techniques such as data augmentation, transfer learning,and ensemble methods may be employed to enhance the robustness and generalization capabilities of the system Once the recognition model is trained and validated, it is integrated into a user-friendly application or interface, allowing deaf individuals to interact with the system through gestures captured by a camera or sensor device Throughout the implementation process, considerations of accessibility, usability, and cultural sensitivity are paramount, with continuous feedback and collaboration with the deaf community to ensure that the system effectively meets their needs and preferences Finally, thorough evaluation and testing are conducted to assess the system's performance in real-world scenarios, with a focus on accuracy, efficiency, user satisfaction, and practical usability The implementation of a robust and effective SLR system in the context of VSL research represents a significant step towards enhancing accessibility, communication, and social inclusion for the deaf community in Vietnam.

UTILIZING DEEP LEARNING IN SIGN LANGUAGE RECOGNITION SYSTEM

Deep Learning Basics

1.1 Importance and Applications of Deep Learning

Deep learning's transformative impact in sign language recognition has revolutionized Vietnamese Sign Language (VSL) research This subset of machine learning excels in extracting complex patterns from sequential and high-dimensional data, making it ideal for recognizing VSL gestures Deep learning architectures like CNNs and RNNs play a crucial role in capturing the temporal dynamics and spatial configurations inherent in VSL gestures, enabling accurate recognition and interpretation Leveraging deep learning techniques allows researchers to develop robust models that accurately capture subtle nuances and variations in sign language expressions These advancements pave the way for innovative assistive technologies, educational tools, and communication aids that enhance accessibility, communication, and social inclusion for the deaf community in Vietnam.

Neural networks, particularly deep learning models, are instrumental in accurately interpreting the complex visual patterns inherent in sign language gestures These networks are designed to mimic the structure and function of the human brain, consisting of interconnected layers of artificial neurons that process and transform input data to produce meaningful outputs.

In the context of VSL recognition, convolutional neural networks (CNNs) are particularly effective due to their ability to capture spatial hierarchies of features in images or video frames By leveraging convolutional layers, CNNs can automatically extract relevant features from VSL data, such as handshapes, movements, and facial expressions, which are crucial for accurate gesture classification Additionally, recurrent neural networks (RNNs) are well-suited for capturing temporal dependencies in sequential data, making them suitable for modeling the sequential nature of sign language gestures over time.

The research report delves into the architecture, training, and optimization of neural networks for VSL recognition It explores various neural network architectures, including CNNs, RNNs, and their combinations (e.g., convolutional LSTM networks), to determine the most effective approach for capturing the intricate dynamics of sign language gestures. Furthermore, the report investigates techniques for training neural networks, such as backpropagation, gradient descent optimization, and regularization, to improve the models' accuracy and robustness.

Through comprehensive experimentation and evaluation, the research report seeks to elucidate the capabilities and limitations of neural networks in VSL recognition It evaluates the performance of different neural network architectures on benchmark datasets of VSL gestures, analyzing metrics such as accuracy, precision, recall, and computational efficiency.

Effective recognition of Sign: Machine learning approaches

CNNs are a class of deep learning models specifically designed for processing visual data, making them particularly well-suited for tasks such as image classification, object detection, and gesture recognition Within the context of VSL recognition, CNNs excel at capturing spatial hierarchies of features within video frames or image sequences, enabling them to effectively discern the intricate patterns and configurations of hand gestures, facial expressions, and body movements characteristic of sign language The architecture of a CNN typically comprises multiple layers, including convolutional layers, pooling layers, and fully connected layers Convolutional layers apply filters to input data, extracting local features and spatial patterns through convolution operations Pooling layers then downsample the feature maps, reducing their spatial dimensions while preserving essential information. Finally, fully connected layers aggregate the extracted features and perform classification based on learned representations By leveraging the hierarchical structure of CNNs and their ability to automatically learn discriminative features from data, researchers can develop robust VSL recognition systems capable of accurately interpreting sign language gestures in real-time Thus, a comprehensive understanding of CNNs is essential for researchers aiming to advance the state-of-the-art in VSL recognition technology and contribute to enhancing accessibility and inclusivity for the deaf community in Vietnam.

2 Effective recognition of Sign: Deep learning approaches

2.1 Utilizing MediaPipe for Sign Recognition

A framework called MediaPipe can be used to create machine learning pipelines for processing time-series data, including audio and video Cross-platform integration Framework runs on Android, iOS, Desktop/Server, and embedded devices such as Raspberry

Pi and Jetson Nano Identifying a person's hands, face landmarks, and posture from video frames is how MediaPipe Holistic monitors important points By integrating these essential details, it generates an in-depth representation of the person inside the video frame,facilitating precise analysis of their movements MediaPipe uses key points or landmarks to represent the pose or hand signs, hand movements and body language.

After training a machine learning model with the annotated dataset, you can train the model using your favorite machine learning framework by using MediaPipe, which supports a number of frameworks, including TensorFlow Depending on the requirements of the sign recognition challenge, you can employ either recurrent neural networks (RNNs) or convolutional neural networks (CNNs) The sign recognition model can be integrated into a MediaPipe pipeline when it has been thoroughly trained MediaPipe features an adaptable framework for assembling various modules to create unique pipelines In general, you would build a module that receives video frames as input, uses your trained model for analyzing them, and outputs the sign gesture that has been identified.

2.2 Advantages and Limitations of MediaPipe

MediaPipe excels in computer vision and multimedia processing with its modular architecture allowing for customizable pipelines and rapid prototyping Its low latency is achieved through optimized algorithms and hardware acceleration, ensuring real-time performance across diverse devices Cross-platform compatibility enables seamless app development across multiple operating systems The open-source nature of MediaPipe fosters an active community contributing to its continuous improvement Additionally, MediaPipe integrates seamlessly with machine learning frameworks like TensorFlow Lite, leveraging learned models for object detection, tracking, segmentation, and posture estimation.

However, MediaPipe has numerous limitations to consider and fix It may require an extended period for developers to become familiar with the design and APIs of the framework in order to utilize it effectively for their projects Furthermore, despite the fact that MediaPipe offers examples and documentation, it isn't as comprehensive as other frameworks, requiring community support and experimentation to work for particular use cases Additionally, MediaPipe is limited to operations connected to vision and multimedia processing; therefore, you might need to integrate MediaPipe with other frameworks or libraries if you wish to use it for tasks involving natural language processing or other non-vision-related tasks Last but not least, even though MediaPipe is designed for real-time performance, a MediaPipe pipeline's actual performance depends on the hardware of the intended device To attain real-time performance, more powerful hardware may be needed for complicated models or resource-intensive operations.

RESEARCH METHODS

Optimize Parameters To Improve Gesture Recognition Accuracy

Data augmentation techniques like rotation, scaling, shifting, flipping, and noise addition enhance the diversity of training data By learning invariant features, the model's generalization capabilities are improved, enabling it to perform well on unseen data.

Before data augmentation, the images in the dataset are in their original form, representing the raw input data These images may have limited variability in terms of pose, orientation,lighting conditions, and other factors As a result, the model trained solely on these original images may not generalize well to new, unseen data with different characteristics.

After data augmentation, the images exhibit a wider range of variations due to the applied transformations For example, some images may be rotated, flipped horizontally or vertically, shifted horizontally or vertically, zoomed in or out, or have changes in brightness and contrast By introducing these variations, data augmentation helps the model learn to recognize and classify objects under different conditions, making it more robust and improving its generalization performance.

Data augmentation expands the effective size of the training dataset, it does not actually increase the number of unique samples Instead, each augmented image is considered as a variation of its original sample As a result, while the total number of images remains the same, the number of unique samples used for training (X_train) is determined by the size of the original dataset plus the additional augmented samples generated during training.

In this study, the inputs are 12,533 images in total but it appears that the original dataset contained around 2502 unique samples after data augmentation, which were then split into training and testing sets Therefore, the support values in the classification report reflect the number of unique samples used for training and testing after data augmentation, rather than the total number of images in the dataset.

Utilize pre-trained models such as VGG16 as feature extractors and fine-tune them on your specific dataset This leverages the knowledge learned from large-scale datasets and can lead to improved performance, especially when having limited training data.

Research Process a) Establishment of Testing Dataset

In the context of real-time hand gesture recognition, the establishment of a testing dataset involves collecting a diverse set of hand gesture images or video clips that represent various gestures to be recognized These images or clips serve as the data for evaluating the performance of the gesture recognition system It's crucial to ensure that the testing dataset encompasses a wide range of hand gestures to assess the system's ability to accurately recognize different gestures in real-world scenarios. b) Scaled Data Collection

Scaled data collection ensures the gathered hand gesture data from varying sources accurately represents real-world variability It involves collecting data under different conditions, such as varying lighting and hand orientations, to create a comprehensive dataset This process ensures the dataset's representativeness and allows it to model complex hand gestures effectively.

Assessing variable impact involves analyzing the contribution of different factors or variables to the performance of the gesture recognition system This may include evaluating the impact of factors such as lighting conditions, hand orientation, background clutter, and occlusions on the accuracy and robustness of the system By systematically assessing these variables, developers can identify potential challenges and optimize the system to improve its performance under diverse conditions The model architecture, including the VGG16 base model and additional layers, determines how the input data (hand gesture images) is transformed into predictions By experimenting with different architectures and hyperparameters, one can assess the impact of various factors on the model's performance. d) Real-Time Hand Gesture Recognition

While the provided code focuses on training a model for hand gesture recognition, achieving real-time recognition would involve additional steps such as integrating the trained model into a real-time video processing pipeline Real-time hand gesture recognition refers to the ability of the system to detect and interpret hand gestures in real-time, typically using a camera or sensor input This involves processing incoming data streams and applying machine learning algorithms or computer vision techniques to classify and recognize hand gestures instantaneously Achieving real-time performance requires efficient algorithms and optimized processing pipelines to minimize latency and ensure responsive interaction with the system.

RESULTS & DISCUSSIONS

Detailed Results

We have implemented our model with a variety of sizes of dataset and the model shows an approximate rate of accuracy of 95% with precise performance of the actor/performer.

Classification Report: precision recall f1-score support

Y 0.98 0.99 0.99 110 accuracy 0.95 2502 macro avg 0.96 0.95 0.95 2502 weighted avg 0.96 0.95 0.95 2502

While background color and light shade minimally impact model detection, a cluttered background with numerous nearby objects can disrupt the model's Region of Interest (ROI) recognition and hand tracking capabilities.

Analysis

Several factors can influence the performance of gesture recognition systems, impacting their ability to accurately identify and interpret hand gestures One significant factor is lighting conditions, as variations in lighting can affect the visibility and contrast of hand gestures, leading to difficulties in feature extraction and classification Poor lighting may result in shadows, glare, or uneven illumination, making it challenging for the model to distinguish between different gestures.

Another crucial factor is hand orientation and positioning The model may be trained on specific hand orientations, but in real-world scenarios, users may present gestures from various angles or perspectives This variability can introduce ambiguity in the captured hand features, making it harder for the model to recognize gestures accurately.

Background clutter and occlusions are additional factors that can impact performance In cluttered environments or when hands are partially obscured by objects or other body parts, the model may struggle to isolate the hand region and extract relevant features for classification Occlusions can also obscure critical parts of the hand, such as fingers or joints, reducing the discriminative power of the input data.

Gesture recognition systems face challenges due to the intricate and diverse nature of human gestures Gestures can be highly similar or have subtle variations, demanding models to discern fine patterns and distinctions Furthermore, the swiftness and fluidity of gestures impact recognition accuracy, particularly in fast-paced or dynamic environments, requiring models to adapt to the varying speeds and continuous flow of gestures.

To build robust and generalizable gesture recognition models, it is imperative to address factors including data augmentation, feature normalization, extraction, and model regularization These techniques strengthen the system's performance in real-world conditions Furthermore, continuous evaluation and refinement using real-world feedback are crucial to enhance accuracy and practicality By addressing these challenges, gesture recognition models can be optimized for effective deployment in real-world applications.

Methodological Review

The methodology of utilizing both MediaPipe and VGG16 in sign language recognition involves a multi-stage process that leverages the strengths of both frameworks.

Firstly, MediaPipe is employed for hand landmark detection, accurately localizing key points on the hand such as knuckles and fingertips This provides a robust representation of hand gestures in real-time, overcoming challenges such as partial occlusion and varying hand orientations.

Subsequently, the landmarks extracted by MediaPipe serve as input to the VGG16 model, which acts as a feature extractor and classifier VGG16 is pre-trained on a large dataset, allowing it to capture high-level features from the hand landmarks and learn discriminative patterns associated with different sign language gestures.

In this methodology, MediaPipe enhances the robustness of hand gesture recognition by providing precise hand landmark localization, while VGG16 utilizes these landmarks to perform fine-grained gesture classification This combined approach offers a comprehensive solution for sign language recognition, capable of handling diverse hand gestures with high accuracy and efficiency.

By integrating both frameworks, this methodology capitalizes on the strengths of each component, resulting in a powerful and versatile system for real-time sign language recognition.

The loss function being used in this study is categorical cross entropy as the target variable is represented as a one-hot-encoded matrix.

One-hot encoding converts categorical labels into a binary matrix where each row corresponds to a sample (data point) and each column corresponds to a unique class In this binary matrix:

- Each row has a single 1 (hot) and all other elements are 0s (cold).

- The position of the 1 indicates the class that the sample belongs to.

For example, consider our classification task with three classes: 'A', 'B', and 'C' A one-hot encoded representation of these classes might look like this:

So, if a sample is an ‘A’, its corresponding one-hot encoded vector would be [1, 0, 0] If it is a ‘B’, the vector would be [0, 1, 0], and so on.

Using one-hot encoding helps in training the model effectively, as it allows the model to learn the categorical nature of the labels and make predictions accordingly.

The categorical cross-entropy loss function then computes the dissimilarity between the true class distributions (one-hot encoded target labels) and the predicted class distributions(output probabilities from the model) during training This helps the model adjust its parameters (weights) to minimize this dissimilarity, thereby improving its ability to make accurate predictions.

Challenges and Limitations

During the model training process, our report faces several limitations and challenges, both in terms of technology and data availability Our system trains with still images as input, so factors such as distance from the camera to the operator, viewing angle, image quality, and frame rate can affect the accuracy of the system system These factors cause difficulties and challenges for our recognition algorithm, making model running more complicated That leads to the hand gesture recognition system becoming more complex.

However, factors such as image file size, camera type, or whether the user's hands are wearing jewelry or not do not or have a negligible impact on the identification system.The presence of distractions in the background and the presence of reflections that may obscure the hand added further complications to the recognition process.

The current study was conducted with a relatively small participant pool, which limited the statistical power and the applicability of the findings Moreover, the research primarily utilized the VGG16 model, and VGG16 is a deep network with 16 layers, resulting in a large model size This can lead to high memory usage, which may be a concern when deploying the model on devices with limited resources.

These challenges and limitations create difficulties in the performance of the Vietnamese sign language recognition system but also create a premise for continued research and development in this field As technology continues to advance, these systems will play an increasingly important role in bridging communication gaps and empowering the Vietnamese deaf community.

Future Work

As technology continues to advance, students engaging in this field can delve into the development of sophisticated algorithms and machine learning models tailored to recognize and interpret the intricacies of VSL gestures By leveraging emerging technologies such as computer vision and deep learning, students can strive to create robust systems capable of real-time recognition, thereby facilitating communication accessibility for the Vietnamese Deaf community Moreover, interdisciplinary collaboration between students specializing in linguistics, computer science, and assistive technology can enrich the research landscape, fostering a holistic approach towards understanding and advancing VSL recognition As students embark on this journey, they not only contribute to the academic discourse but also pave the way for inclusive technologies that empower individuals with hearing impairments to participate fully in society.

The above is future work on practical applications, here are some potential directions:

- Dataset Expansion and Diversity:Future research could focus on expanding the data set by including more diverse signers across different backgrounds and conditions This will help improve model generality and performance, making the training of models robust.

- Model refinement and customization: Although our study primarily used the VGG16 model, there may be opportunities to refine and customize using other models such as SVM, LSTM, Dlib in the future to Discover the ever-evolving VSL Exploring different parameters and model architectures can lead to better results.

- Integration with other biometric methods: Integrating VSL detection with other biometric methods such as facial expression recognition, voice recognition and body gesture recognition is not limited to hand gestures That expands the overall accuracy of the system.

- Ethical and user-centered development: Ethics must be put first in every job, AI systems are no exception, ethical considerations are essential Because research focuses on an identity system that takes data from users so it must respect privacy, strict security, inclusivity and fairness among different user groups.

- Application integration: We can explore and integrate VSL into other applications such as healthcare, education, community, and many other areas to expand the system's applications in life.

We believe that with the strong development of AI, deep learning, and computer vision,these future directions will be successfully implemented, contributing to bringing enormous benefits to the lives of the deaf community in general and beyond are other applications aimed at a broader audience.

CONCLUSION & RECOMMENDATIONS 1 Summary of Findings

Concluding Statement

From what we experienced implementing the study research, we can see a promising solution for enhancing accessibility and communication for individuals with hearing or talking impairments By leveraging advanced computer vision and deep learning techniques,this approach enables real-time recognition and translation of sign language gestures,facilitating meaningful interactions between individuals who are deaf or hard of hearing and the broader community However, further refinement and optimization, along with considerations for data collection, preprocessing, and model tuning, are essential to ensure robust performance and widespread adoption of such technology in practical applications.Overall, this innovative combination of technology holds immense potential to bridge communication barriers and promote inclusivity in diverse settings.

Practical Applications

Sign language recognition or gesture recognition in general can be integrated into a mobile app to facilitate two-way translation of sign language Users can input sign language gestures through their device's camera, which are then translated into text or spoken language in real-time Similarly, the app can interpret spoken or typed text into sign language gestures displayed on the screen This bi-directional translation enables seamless communication between individuals who are deaf or hard of hearing and those who do not understand sign language, fostering inclusivity and accessibility in various settings such as education,healthcare, and everyday communication.

Recommendations

To develop inclusive VSL recognition systems, students must collaborate with Deaf community stakeholders to align research with their needs Tailoring data collection to the diverse gestures and expressions in VSL enhances system robustness and inclusivity Exploring innovative methodologies, such as combining traditional and deep learning techniques, optimizes accuracy and efficiency Emphasizing transparency and ethical considerations mitigates bias and promotes fairness in algorithm development Disseminating findings through academic publications and community initiatives fosters knowledge exchange and societal impact in assistive technologies for the Vietnamese Deaf community.

APPENDIX A DETAILED DESCRIPTION OF VIETNAMESE SIGN LANGUAGE DETECTION

The hand gesture recognition algorithm involves several steps First, video frames are captured and preprocessed using machine learning to enhance their resolution and identify the hand region Next, a machine learning algorithm extracts 21 hand features, creating unique identifiers for gestures Finally, these features are labeled, stored in a database, and used to train and test a model The VGG16 CNN is commonly used for this task, but its accuracy can be impacted by factors such as distance, angle, image quality, and frame rate Hence, data processing is crucial for optimal system performance.

APPENDIX B OPERATION OF THE VSL DETECTION SYSTEM

Our system operates according to the following steps :

1.Video and Image Processing :we use a pre-processing method that assists in detecting 21 joints on the hand (hand mark points) and reconstructing the hand shape across the joints

2.Gesture Identification: Each class will have an unique label(name of classes) And they will be classified based on the hand reconstruction model based on the matches and 21 hand points (landmarks).

3.Identification:The system recognizes hand gestures based on images of hands from the original video processed into still images Then use a machine learning algorithm to extract the hand skeleton with 21 hand points (landmarks) The extracted features will be labeled with the recorded labels.

4.Training model:For motion tracking and high-precision hand and finger recognition, we use libraries are MediaPipe andOpenCV After data processing, we begin to train the model based on VGG16, then use test to check accuracy of the system

5.Classification:If the system finds a match, it will display the corresponding hand gesture name based on the labeling If no match is found, the system may send an alert or manual verification reminder.

The system offers reporting functionality, leveraging VSL language recognition technology These reports analyze communication data, providing insights into the progress of deaf-mute sign language learners By monitoring their communication abilities, this system empowers the deaf-mute community, fostering improved communication skills and seamless community integration.

This system provides a safe and effective method of tracking attendance It can help deaf-mute people integrate into life like normal people, increasing their work opportunities. However, the accuracy and reliability of the system depends on the quality of the recognition methods and requires signers to understand how to perform gestures in accordance with the system's format.

Implementation of our model with a variety of sizes of dataset and the model shows an approximate rate of accuracy of 95% with precise performance of the actor/performer Or else, the model mixing up with other signs could be very likely to happen especially when performing those signs that bear a great resemblance such as the letter G and the letter R.

The light shade and the background color are recorded to have slight effect on the model’s detection ability, however, a background with too many objects behind with a close distance would somewhat disturb the ROI or the hand tracking task of the model Factors such as image file size, camera type, or whether the user's hands are wearing jewelry or not do not or have a negligible impact on the identification system.The presence of distractions in the background and the presence of reflections that may obscure the hand added further complications to the recognition process These challenges and limitations create difficulties in the performance of the Vietnamese sign language recognition system but also create a premise for continued research and development in this field As technology continues to advance, these systems will play an increasingly important role in bridging communication gaps and empowering the Vietnamese deaf community.

Tiêu đề	Vietnamese Sign Language Recognition Using Machine Learning and Deep Learning
Tác giả	Ngo Minh Ngoc
Người hướng dẫn	Pham Thi Viet Huong
Trường học	Vietnam National University, Hanoi
Chuyên ngành	Business Data Analytics
Thể loại	Student Research Report
Năm xuất bản	2024
Thành phố	Hanoi

Định dạng
Số trang	43
Dung lượng	2,23 MB