The main aims of implementing Applied Machine Learning "YOLOv5 - Detecting and recognizing hand sign language" we obtained both theory and practice results.. 4CHAPTER 2: DESIGN AND IMPLE
Trang 1HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION
FACULTY FOR HIGH QUALITY TRAINING
FINAL REPORT
APPLIED MACHINE LEARNING
YOLOv5 - DETECTING AND RECOGNIZING
HAND SIGN LANGUAGE
Ho Chi Minh City, …/…
Trang 2The main aims of implementing Applied Machine Learning "YOLOv5 - Detectingand recognizing hand sign language" we obtained both theory and practice results Itunderstands how to analyze a system and apply the algorithm model to the system Weneed to get a large collection of datasets by images
Trang 3ABSTRACT i
CONTENT ii
LIST OF ABBREVIATIONS iii
LIST OF FIGURE iv
CHAPTER 1: BACKGROUND KNOWLEDGE 1
1.1 INTRODUCTION 1
1.2 OVERVIEW 1
1.3 YOLO ARCHITECTURE 2
1.4 YOLO’s OUTPUT 3
1.5 YOLOv5 ARCHITECTURE 3
1.6 HAND SYMBOL DETECTION 4
CHAPTER 2: DESIGN AND IMPLEMENTATION APPLICATION OF "YOLOV5 -DETECTING AND RECOGNIZING HAND SIGN LANGUAGE" 6
2.1 DATA SET 6
2.1.1 Images 6
2.1.2 Labels 6
2.2 SET UP ENVIRONMENT YOLOv5 7
2.3 THE TRAINING PROCESS 8
CHAPTER 3: RESULTS AND DISCUSSION 11
3.1 RESULTS: DETECT HAND SIGN LANGUAGE 11
3.2 DISCUSSION 12
3.3 CONCLUSION AND RECOMMENDATION 16
APPENDIX 17
REFERENCE 18
Trang 4LIST OF ABBREVIATIONS
YOLO You Only Look Once
WHO World Health Organization
CNN Convolutional Neural Network
CV Computer vision
AI Artificial intelligence
PANet Path aggregation network
CSPNet Cross stage partial network
FPN Feature pyramid network
FLOPS Floating-point operations per second
TP True Positive
FP False Positive
FN False Negative
TN True Negative
MAP Micro-Average Precision
MAR Micro-Average Recall
Trang 5LIST OF FIGURE
Figure 1: YOLO network architecture diagram 2
Figure 2: The network architecture of Yolov5 4
Figure 3: Dataset 6
Figure 4: Label a Image 7
Figure 5: Parameter of ratio Frame 7
Figure 6: Clone repository and set up all dependencies in YOLOv5 indirectly from GOOGLE Colab 8
Figure 7: Install library YOLOv5 8
Figure 8: Linking datasets from GOOGLE Drive 8
Figure 9: Data collection 9
Figure 10: Mapping to the path site and adjust for the number of classes 9
Figure 11: Start training process 9
Figure 12: Training process 9
Figure 13: Show the results of the training process 10
Figure 14: The results after the training process: 11
Figure 15:The results after the detection process 12
Figure 16: Confusion Matrix 13
Figure 17: F1 Score (F1_curve) 13
Figure 18: Precision (P_curve) 14
Figure 19: Precision (PR_curve) 14
Figure 20: Recall (R_curve) 15
Figure 21: Results 15
Trang 6CHAPTER 1: BACKGROUND KNOWLEDGE
1.1 INTRODUCTION
One in every six people in the world has a hearing problem, and the number israpidly increasing According to Ms Suchitra Prasansuk, President of the WorldAssociation of Audiologists, World Health Organization (WHO) statistics show thatthere were approximately 250 million people worldwide with deafness and hearingloss in 2010, and this number increased to approximately 360 million people in 2015.Our country currently has 1 to 2.5 million speech and hearing impaired people,roughly the population of a province This demonstrates an increase in the number ofpeople suffering from hearing loss The ability to communicate verbally in the deafcommunity is severely limited due to impaired hearing To replace the ability tocommunicate verbally, sign language, which uses the representation of hands and body,was created
Artificial intelligence (AI) is becoming increasingly popular and is affecting manyaspects of daily life Computer vision (CV) is a branch of artificial intelligence thatincludes digital image acquisition, processing, analysis, and recognition DeepLearning Networks is a discipline that examines algorithms and computer programs sothat computers may learn and make predictions in the same manner that humans do It
is used in a variety of applications, including science, engineering, and other fields oflife, as well as object detection and classification A good example is CNN(Convolutional Neural Network) learning to distinguish patterns from images bysuccessively stacking layers on top of each other CNN is now regarded as a model inmany applications Full image classifier and leverages technologies in the field ofcomputer vision to leverage machine learning
More and more algorithms and models have been introduced for the recognitionproblem, including the YOLOv5 model, which is applied specifically to hand-signrecognition Therefore, we choose the topic "YOLOv5 - Detecting and recognizinghand sign language" for the final report on Applied Machine Learning
1.2 OVERVIEW
Trang 7YOLO (You Only Look Once) that is a CNN network model used to detect andidentify objects Additionally, the convolution of layers will extract features in animage, and give the coordinates and order of labels assigned to each frame.
Furthermore, YOLO is considered to be the fastest algorithm in object recognitionmodels but may not be the best
The main purpose of YOLO is to predict labels for objects in the classification anddetermine the coordinates of the object Therefore, YOLO can detect many objectswith different labels in the fastest time
YOLO has released 5 versions so far as v1, v2, v3, v4 and v5 Each stage ofYOLO has upgraded classification, optimized real-time label recognition and extendedprediction limits for frames
1.3 YOLO ARCHITECTURE
Base networks are convolution networks that perform feature extraction in theYOLO architecture The Extra Layers are used to detect objects on the base network'sfeature map in the back part
The base network of YOLO is composed primarily of convolutional layers andfully connected layers YOLO architectures are also quite diverse and can becustomized to accommodate a wide range of input shapes
Figure 1: YOLO network architecture diagram
The base network component of Darknet Architecture has a feature extractioneffect The base network produces a 7x7x1024 feature map, which is used as input forExtra layers that predict the label and bounding box coordinates of the object
Trang 8: help define the bounding box where ��, �� are the coordinates of
the center and ��, �ℎ are the width and length dimensions of the bounding box.(�1, �2, ��)
In the hand symbol detection task, detection speed and accuracy are critical, andcompact model size influences inference efficiency on resource-constrained edgedevices
Second, to improve information flow, the YOLOv5 used a path aggregationnetwork (PANet) as its neck PANet uses a new feature pyramid network (FPN)structure with an improved bottom-up path, which improves low-level featurepropagation Simultaneously, adaptive feature pooling, which connects the feature gridand all feature levels, is used to ensure that useful information in each feature levelpropagates directly to the next subnetwork PANet improves the utilization of accuratelocalization signals in lower layers, which obviously improves the object's locationaccuracy
Trang 9Third, the YOLO layer, the head of Yolov5, generates different sizes of featuremaps to achieve multi-scale prediction, allowing the model to handle small, medium,and large objects.
Figure 2: The network architecture of Yolov5
It consists of three parts: (1) Backbone: CSPDarknet, (2) Neck: PANet, and (3)Head: YOLO Layer The data is first supplied into CSPDarknet, which extractsfeatures, and then into PANet, which fuses them Finally, YOLO Layer outputsdetection results (class, score, location, size)
YOLOv5 include 4 different types: YOLOv5-small, YOLOv5-medium,YOLOv5-large, YOLOv5-extraLarge In this project, we use YOLOv5-small to train.1.6 HAND SYMBOL DETECTION
We will use a camera and OpenCV in real-time to detect the hand symbol It iscommonly assumed that videos are composed of still images known as frames Handsymbol detection was performed in every frame of a video To detect hand symbols,we'll utilize the YOLOv5 pre-trained model
Trang 10It is a real-time object detection algorithm Because it has been trained to movequickly Furthermore, it returns the relative accuracy It is also intended to distinguishobjects in a video or image.
To begin, the detection of hand symbols involves the detection of a large number
of images In this section, we will label the frames in each image Then, pass them tothe model, which will train and return results
The hand symbol variable, which contains the height and width of the rectangle aswell as the top-left corner coordinates enclosing the hand, can be used to generate ahand frame
The method for preprocessing is the same as the method for training the modeldescribed in the second section The following step is to draw a rectangle on top of theface and label it based on the predictions
Though YOLOv5 and its variants are not as accurate YOLOv5 performsadmirably when confronted with standard-sized objects, but it is incapable of detectingsmall objects
When dealing with objects that appear to have rapidly changing properties,accuracy suffers significantly
Trang 11CHAPTER 2: DESIGN AND IMPLEMENTATION APPLICATION OF "YOLOV5 - DETECTING AND RECOGNIZING HAND SIGN LANGUAGE"
Figure 3: Dataset2.1.2 Labels
Trang 12Figure 4: Label a ImageImage input data of YOLOv5 in Darknet format with each txt file will give animage containing the object that we label The txt file will have the following format:
- Each row will be an object
- Each row will have the following format: class x_center y_center width height
- The coordinates of the boxes will be normalized in the format x, y, w,h
- Class will start at 0
Figure 5: Parameter of ratio Frame2.2 SET UP ENVIRONMENT YOLOv5
Trang 13To complete the hand symbol detection training, we use the Google Colabplatform Then YOLOv5 will begin training.
We begin by download the YOLOv5 repository and installing the requireddependencies to run YOLO v5
Download indirectly from GOOGLE Colab
Figure 6: Clone repository and set up all dependencies in YOLOv5 indirectly
from GOOGLE ColabInstall library YOLOv5 and supported other
Figure 7: Install library YOLOv52.3 THE TRAINING PROCESS
First, we link the Images and Labels datasets from Drive and extract the dataset
Figure 8: Linking datasets from GOOGLE Drive
Trang 14Figure 9: Data collectionNow, classify each image with label in coco128.yaml file
Figure 10: Mapping to the path site and adjust for the number of classesTrainedwith50epoches:
Figure 11: Start training process
Figure 12: Training processAfter training, we get result:
Trang 15Figure 13: Show the results of the training process
Trang 16CHAPTER 3: RESULTS AND DISCUSSION
3.1 RESULTS: DETECT HAND SIGN LANGUAGE
Below are some images of the results after the training process:
Figure 14: The results after the training process:Here are some images of the results after the detection process:
Trang 17Figure 15:The results after the detection process
3.2 DISCUSSION
After training, we have some discussion about the result as bellow:
Diagram: Confusion Matrix
Confusion matrix is a quantity that gives us a better view of whether data pointsare classified as true or false
The model detects well when confusion_matrix is a diagonal The correlationbetween the TRUE and PREDICTED sets is 100%
Here, there is a "FIVE" point and the FP background is matched, that is, when thetraining label is TRUE but the model recognizes FALSE At points “I” and “ONE”have the same but the ratio is low
A good model will produce a confusion_matrix with large values for the elements
on the main diagonal, and when represented in color, the darker the diagonal thebetter
Trang 18Figure 16: Confusion MatrixDiagram: F1 Score (F1_curve)
F1 score: Accuracy of classifiers
Trang 20Diagram: Precision (R_curve)
Recall is the accuracy of the omitted points Recall is high which means high TruePositive Rate (TPR), which means that the rate of missing really positive points is low
������ =��
�� + ��
Figure 20: Recall (R_curve)Note:
+ True Positive(TP): The positive points are classified as positive
+ Tre Negative(TN): Negative points are classified as negative
+ False Positive(FP): The negative scores are classified as positive
+ False Negative(FN): The positive points are classified as negative
Total of the Results
Figure 21: Results
Trang 213.3 CONCLUSION AND RECOMMENDATION
a) Conclusion:
The model we made detected basic hand sign language
However, with the number of images in the dataset is still small So the accuracy isnot high
If multiple hand symbols are detected at the same time, or occur in a complexenvironment, the model will ignore some of the symbols in the image As a result, theresults will not be as accurate as when a single hand symbol is detected in the image.b) Recommendation:
In the future, we hope to develop a detection model for many hand sign languagesfrom basic to complex
In addition, we want to develop a device that makes it easier for people withhearing impairment to communicate in all different situations
Trang 221 "TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head forObject Detection on Drone-captured Scenarios", Xingkui Zhu, Shuchang Lyu, XuWang,QiZhao,https://arxiv.org/pdf/2108.11539.pdf
2 "YOLOv5",https://github.com/ultralytics/yolov5
3 "YOLOv5",https://pytorch.org/hub/ultralytics_yolov5/
4 "Thủngữ–ngôn ngữkýhiệutay",
https://pro.edu.vn/thu-ngu-ngon-ngu-ky-hieu-tay/