VIETNAM NATIONAL UNIVERSITY, HANOIINTERNATIONAL SCHOOL RESEARCH REPORT BUILDING A STUDENT MONITORING SYSTEM IN THE CLASSROOM BASED ON COMPUTER VISION Advisors: PhD... 3.4.1 Keywords: com
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI
INTERNATIONAL SCHOOL
RESEARCH REPORT
BUILDING A STUDENT MONITORING SYSTEM IN THE
CLASSROOM BASED ON COMPUTER VISION
Advisors: PhD Kim Dinh Thai
PhD.Ha Manh Hung
Team leader: Pham Anh Phuong
ID: 22070154 Class: AIT2022B
Code: CN.NC.SV.23_ 04
April 15, 2024
Trang 2TEAM LEADER INFORMATION
- Program: Applied Information Security
- Address: Thach That — Ha Noi
- Phone no /Email: 032826538 1/anhphuong08032004@gmail.com
II Academic Results (from the first year to now)
Academic year Overall score Academic rating
lst semester, Ist year 3.25 Distinction
2nd semester, Ist year 3.6 High Distinction
1st semester, 2nd year 3.44 Distinction
1H Other achievements:
1 "Technology Talent" Scholarship
2 “Light up Talent" Lazada Scholarship
3 The scholarship encourages studying in the 2nd semester of the first year
Advisor
(Sign and write fullname)
Kim Dinh Thai
(Sign and write fullname)
Pham Anh Phuong
Trang 3Mục lục
1l1] Student profile) HQ HQ HQ Q k kg 1
25
Trang 4[/.1 Data Cleaning and Standardization)
/2 MTCNN Modell
ne
7.3.3 7.3.4 Using Facenet in students attendance checking system
TripletSeleclion| -7.4.1 Databaselnteeraion|
7.4 Attendance ÏIistupdatel
Trang 6Danh sách hình vẽ
] Common topology of a convolutional neural network [OlÏ 14
¬¬ 16some eigenfaces from ATT Laboratories Cambridge [I4]} 2 18
43D model of a human face [Í4Ï]] - 19
21 Model structure This network consists of a batch input layer and a deep CNN
| _ †ollowed by L2 normalization, which results in the face embedding [36]| 36
22 Anchor, Positive, Negative [37] 0 0 00000 eee ee eee 37
28 lestIlmaseedataset[} Ặ Ặ Q Q Q ee 43
¬— 4
Cees 43
Trang 72 ACKNOWLEDGE
We would like to express our deep gratitude to PhD.Kim Dinh Thai and PhD.Ha Manh Hung
for their invaluable guidance and support throughout our research process Their dedication
to detail, expertise and care were crucial in keeping us on track and completing our scientific
research successfully
Their guidance and support helped us navigate the complex research process with ease They
provided us with important insights and ideas in shaping the direction of our research Theyare always available to answer our questions, give feedback and give constructive comments,
helping us improve our work
Without the contributions of PhD.Kim Dinh Thai, PhD.Ha Manh Hung and other advisors,
we would not be able to complete this study Their dedication, expertise and commitment toexcellence have been instrumental in helping us achieve our goals
We are truly grateful for the important contributions of PhD.Kim Dinh Thai and PhD.Ha
Manh Hung, and look forward to working with them on future projects Their guidance and
support has been invaluable to us, and we feel honored to have the opportunity to work withpeople as talented and dedicated as they are We hope to continue to cooperate with them in the
future and learn from their vast experience and knowledge
Student,Pham Anh Phuong
Trang 8Doan Thi Phuong Thao | BEL2022C | 22070018 dthao18 102004 @ gmail.com
Nguyen Khac Ton AAI2022A | 22070277 khacton2004@ gmail.comNguyen Khac Truong | 22070156 | AAI2022 Gnourt2004 @ gmail.com
Nguyen Ngoc Trung | AIT2022B | 22070167 22070167 @vnu.edu.vn
3.3 Advisor(s):
Kim Dinh Thai, Falcuty of Applied Sciences, PhD
Ha Manh Hung, Falcuty of Applied Sciences, PhD
3.4 Abstract:
This study presents a novel system that utilizes computer vision techniques to automate
at-tendance taking and student monitoring in the classroom The proposed system leverages the
FaceNet model for face recognition, combined with the MTCNN algorithm to accurately tect and locate student faces Additionally, YOLOv8-object detection is employed to analyze
de-student behavior, specifically focusing on assessing their concentration levels The system’s
performance was evaluated using a comprehensive test set, and the results demonstrated a high
level of accuracy, with a recorded average accuracy of 90% for face recognition using FaceNet
Moreover, the utilization of the MTCNN algorithm significantly contributed to precise facelocalization, ensuring reliable attendance tracking In terms of behavior analysis, the YOLOv8-
object detection model achieved an average accuracy rate of 85-90% in assessing students’
con-centration levels This capability enables teachers to efficiently monitor students’ engagementduring class, facilitating timely interventions and enhancing the overall quality of instruction
By automating the attendance process, the proposed system alleviates the burden on teachers,
saving valuable instructional time Furthermore, the real-time monitoring of students’
concen-tration levels allows educators to identify and address potential issues promptly, fostering a
Trang 9more engaging and productive learning environment The findings of this study highlight the
potential of computer vision techniques in revolutionizing traditional classroom management
practices The integration of FaceNet, MTCNN, and YOLOvé8-object detection enables rate attendance tracking and behavior analysis, empowering educators with valuable insights
accu-into students’ participation and focus levels The system’s high accuracy rates underscore its
efficacy and practicality in real-world classroom settings In conclusion, this system presents
a robust and efficient solution for automating attendance taking and monitoring students’
con-centration levels in the classroom The combination of FaceNet, MTCNN, and YOLOv8-objectdetection contributes to accurate face recognition, precise face localization, and behavior analy-
sis The system’s performance demonstrates its potential to enhance teaching effectiveness, save
instructional time, and create a conducive learning environment
3.4.1 Keywords:
computer vision, FaceNet, MTCNN, object detection, YOLOv8
3.5 Concerning rationale of the study
Student disengagement has become a significant challenge in modern educational settings
Fac-tors such as lack of focus, tardiness, and absenteeism not only affect individual student formance but also disrupt the overall progress of the class To address this issue, our research
per-aims to develop a platform that supports educators and institutions in accurately monitoring and
recording student attendance By leveraging computer vision applications such as face
recog-nition and activity tracking, our goal is to create a reliable and convenient system This system
will enable educators and organizations to capture essential information about student presenceand activities during the learning process, while ensuring automatic and efficient compliance
with classroom regulations By accessing the platform, users can access reports and statistical
data on student attendance, activities, and academic progress, empowering them to implementappropriate measures to support and enhance teaching quality
In addition, monitoring student engagement through real-time object detection using YOLOv8
is an essential aspect of our research In dense classrooms, where teachers may struggle to
mon-itor individual student preferences, it is crucial to closely track each student’s level of focus and
participation Studies have consistently shown that engaged and active learners tend to form their peers Therefore, teachers must closely monitor each student in the classroom and
outper-adapt their approaches to meet individual needs to capture and sustain their attention during
instruction However, this task becomes increasingly challenging in large class settings, whereinstructors may not always have full awareness of student attentiveness and engagement, sig-
nificantly impacting their academic progress Real-time visualization of each student’s level of
interest during lectures is necessary for instructors to adjust their teaching methods and enhance
student engagement effectively The development of a comprehensive monitoring system holds
Trang 10significant potential in the teaching profession and can improve student learning outcomes By
combining technology with innovative teaching methods, educators can gain valuable insights
into student engagement levels and timely adjust their teaching strategies This proactive
ap-proach not only fosters a dynamic and interactive learning environment but also ensures thatstudents actively participate and are motivated to achieve success in their studies
In summary, prioritizing student engagement and leveraging modern tools and techniques
to enhance teaching practices can significantly contribute to the overall effectiveness of the
educational experience and ultimately lead to improved academic outcomes for students
3.6 Research questions
The use of Computer Vision is applied to student monitoring with the purpose of finding and
de-tecting student behavior and attitudes during class to bring about a new environment where the
quality of education can be improved Knowing students’ mistakes during class can help thempay attention to lectures and respect teachers more In addition, the system can also identify stu-
dents’ attendance daily, thereby saving time for teachers and improving learning performance
To attain these goals, the following research questions are addressed below
¢ What does this study contribute to the quality of school education? (attendance checking
work, student monitoring )
¢ What does it contribute to students’ consciousness? Does reminding students make them
more diligent?
3.7 Research Objectives
The research objective of the project to build a student monitoring system in the classroom based
on computer vision is to develop a sophisticated and efficient system that can accurately track
and monitor student behavior, engagement, and attendance during classroom sessions This
sys-tem aims to utilize computer vision technology to automatically detect and recognize students’faces, gestures, and movements, enabling real-time monitoring and analysis of their activities
The primary goal is to create a non-intrusive and privacy-aware system that can provide
valu-able insights to educators about student participation, attention levels, and overall classroom
dynamics
Additionally, the research seeks to explore the potential of integrating machine learning
algorithms and artificial intelligence (AJ) to interpret and analyze the captured visual data,leading to the development of predictive models for identifying patterns and trends in studentbehavior The ultimate objective is to enhance the overall learning environment by providingeducators with actionable information that can support personalized teaching strategies, early
intervention for at-risk students, and the optimization of classroom management techniques
Trang 11Furthermore, the research aims to address ethical considerations and privacy concerns by
im-plementing robust data protection measures and ensuring transparency in the use of student
monitoring technology Overall, the research objective is to contribute to the advancement of
educational technology by leveraging computer vision for the development of a comprehensivestudent monitoring system that promotes a more engaging and supportive learning experience
in the classroom
3.8 Research Methodology
We have divided the project BUILDING A STUDENT MONITORING SYSTEM IN THE
CLASSROOM BASED ON COMPUTER VISION into 2 smaller projects: the project
"Build-ing an automatic student face recognition and attendance system" and the project "Build"Build-ing an
automatic monitoring system, recognizing student behaviors in the classroom" This project
requires a combination of knowledge in image processing, machine learning, and Python
pro-gramming language to build an accurate and efficient face recognition system
1 Project Building an automatic student face recognition and attendance system
¢ This project aims to develop a system that automatically recognizes students’ faces
and records their attendance
¢ The training process involves four stages: face detection using MTCNN,
normaliza-tion and preprocessing of detected faces, feature extracnormaliza-tion using FaceNet model,
and storing the extracted features in a database
¢ Images of team members, totaling over 2000, are used for training, covering various
facial expressions, angles, accessories, lighting conditions, and resolutions
¢ MTCNN is employed for face detection, consisting of three layers: Proposal
Net-work (P-Net), Refine NetNet-work (R-Net), and Output NetNet-work (O-Net)
¢ FaceNet model, developed by Google, is utilized for feature extraction using a
128-dimensional embedding vector and Triplet Loss function
¢ The combination of MTCNN for face detection and FaceNet for face recognition
yields high accuracy in verifying students’ identities
2 Project Building an automatic monitoring system, recognizing student behaviors in
the classroom
¢ This project focuses on building a system using YOLOv8-object detection for
mon-itoring student behaviors in the classroom
¢ The system provides benefits such as time-saving compared to manual attendance,
detailed attendance records, and detection of rule violations like mobile phone usage
or lack of focus
Trang 12¢ Ithelps teachers identify students with positive learning attitudes and those who may
need adjustments in teaching methods or content to optimize learning performance
3.9 Structure
This summary report covers the main program years:
¢ Chapter 1: introduces the foundation of topics, including integral neural networks,
be-havior recognition and face recognition It presents CNN, a deep neural network
com-monly used in computer vision It explains the basic structure and operation of CNN and
how it is applied in object recognition and image classification
¢ Chapter 2 focuses on student identification activities This chapter describes the
prob-lem of behavior identification in the classroom and the benefits of behavior monitoring
It provides methods and technologies to identify behavior through the use of computervision and other types of algorithmic analysis This chapter also provides details about
the process of building an automated monitoring system, including data collection, data
preprocessing, model training, and performance evaluation
¢ Chapter 3 focuses on student automation systems This chapter addresses the problem
of students taking attendance in the classroom and is beneficial for building an automatic
attendance system It introduces methods and techniques for facial recognition and
stu-dent istu-dentification This chapter describes in detail the process of building an automatedsystem, including data collection, preprocessing, model training, and performance evalu-
ation
¢ Finally, Chapter 4 presents a summary, limitations and future work This is a program
that synthesizes the results and components of building an automatic student ing system in the classroom It also addresses the limits and limitations of the system,along with the difficulties to be encountered during the research process Finally, thisprogram proposes directions for future development and research, aimed at improving
monitor-performance, expanding applications to other learning environments, and practical
test-ing
Through this structure, the comprehensive report provides a comprehensive view of building
an automatic student monitoring system in the classroom, from student behavior identification
to automatic attendance system It also addresses limitations and future directions for further
development and research
Trang 134 LITERATURE REVIEW
Facial recognition presents significant challenges in image analysis and computer vision (Oloyede
& colleagues, 2020) The adoption of student monitoring systems in classrooms using computer
vision has garnered notable attention for its potential in enhancing education and classroommanagement Various studies have explored different aspects of such systems, from technical
implementation to effectiveness in improving student engagement and behavior
D Bhavana and collaborators proposed an automatic attendance system employing the
Lo-cal Binary Pattern algorithm for face and voice recognition of students in the classroom,
achiev-ing an accuracy of up to 85% IHIỆ This highlights the potential of usachiev-ing facial and voice
recog-nition for automatic attendance
Nguyen Thi Uyen Nhi and others have also shown that using MTCNN and Facenet canachieve high accuracy in verifying the identity of students in the exam room, with accuracies of
88.4% and 92.1% respectively on the FACE_STUDUE and Yale datasets (2 This underscores
the strength of CNN models combined with Facenet’s triplet loss function in facial recognition
Partha Chakraborty and colleagues proposed a similar system using the Principle Face
Recognition (PCA) Algorithm, achieving an average identification rate of 80.22% Bi Although
this accuracy is lower compared to studies using MTCNN and Facenet, it still demonstrates the
potential of traditional methods in automatic attendance
Lastly, Jose et al.’s study demonstrated that using Facenet achieved a high accuracy of 97%
in multi-camera facial recognition
Based on the presented research, it can be observed that utilizing facial recognition
algo-rithms and models such as MTCNN and Facenet is becoming a common trend in developing
automatic attendance systems for students The studies have demonstrated that the combination
of CNN models and facial recognition algorithms like Facenet can effectively identify faces
with high accuracy
The feasibility and urgency of facial recognition in real-life situations have been proven, tivating our research to develop facial recognition methods for taking attendance of students in
mo-class based on images , taking advantage of the effectiveness of MTCNN and Facenet models.Additionally, we also try to improve the accuracy of facial recognition in this context
Behavior detection technology has made it possible to analyze student behavior in class
videos, it can provide information on the classroom status and learning performance of students,
making it an essential tool for teachers, administrators, students, and parents in schools i.
By incorporating the biformer attention module and Wise-IoU into the YOLOv7
frame-work, they were able to enhance the detection precision significantly This enhancement resulted
in an mAP0.5 of 79%, surpassing the previous outcomes by 1.8%.The experimental findings
demonstrate that our model surpasses the original YOLOv7 in terms of precision, mAP0.5, and
mAP0.5:0.95
Effective classroom instruction requires monitoring student participation and interaction
Trang 14during class, identifying cues to simulate their attention The ability of teachers to analyze
and evaluate students’ classroom behavior is becoming a crucial criterion for quality teaching
Artificial intelligence (AI)-based behavior recognition techniques can help evaluate students’
attention and engagement during classroom sessions Bi The research paper demonstrates the
precision-recall curve of the model trained with YOLOv5s The YOLOvS5 models obtained anmAP@0.5 of 0.762, with the class "eating food" achieving the highest mAP@0.5 of 0.921
However, the class "reading book" exhibited relatively lower results, with an mAP of 0.689
Automated learning analytics is becoming an essential topic in the educational area, whichneeds effective systems to monitor the learning process and provide feedback to the teacher
Recent advances in visual sensors and computer vision methods enable automated monitoring
of behavior and affective states of learners at different levels, from university to preschool (6).
The paper’s analysis of the confusion matrix for student ID identification reveals a strongand accurate outcome, indicated by the prominently colored diagonal Additionally, when uti-
lizing our summarization algorithm, we achieved an Fl-score of 82.81%, surpassing the score
of 72% obtained without using the algorithm Moreover, by manually labeling the unknownset of sequences generated by the summarization algorithm, a labeling technique referred to as
"semi-assist," we were able to achieve an impressive F1-score of 99.23%
Students’ action behavior performance is an important part of classroom teaching
evalua-tion To detect the action behavior of students in classroom teaching videos, and based on the
detection results, the action behavior sequence of individual students in the teaching time of
knowledge points is obtained and analyzed (7.
In this paper, a novel approach is presented for recognizing students’ action behaviors using
time-series images captured in a classroom setting The results demonstrate that the enhanced
AIA network proposed in this study exhibits stable convergence during the training process and
achieves an impressive accuracy of 92%
Trang 155 BACKGROUND
5.1 CNN Network
5.1.1 Overview
¢ CNN stands for Convolutional Neural Network, which is a specialized type of neural
network commonly used for image and video processing tasks
¢ Convolutional Neural Networks (CNNs) are similar to an artificial neural network (ANN)
in that it includes automatic neural optimization information during the learning process
Each neuron still receives input and implements an allowed feature (like a dot product
followed by a nonlinear function) - the basis of digital ANN [8Ì
5.1.2 CNN architecture
¢ Overall architect
Convolutional Neural Network architecture consists of 3 main parts:
conealution poaling conpolwtion pooling dense / recurrent
bàvế aha l ,
J J l
Input Convolutional Layers Fully-Connected Layers
Figure 1: Common topology of a convolutional neural network (9)
— Input Layer: This is the first part of the network where the input data is fed into
the CNN For images, the input layer usually has dimensions corresponding to the
width, height, and number of channels of the image
— Convolutional Layers: These layers perform convolution operations using filters
(kernels) to extract features from the input data Each convolutional layer generates
multiple feature maps that capture different patterns or features
— Fully Connected Layers: These layers are responsible for making predictions based
on the features extracted by the previous layers They connect every neuron from
the previous layer to every neuron in the current layer Fully connected layers are
commonly used in the final layers of the network for classification or regression
tasks [T0].
Trang 16Learning behavior classification such as paying attention to lectures, reading books,
writ-ing essays, or dowrit-ing exercises This can help teachers and schools assess the level of
interaction and engagement of students during the learning process
Cheating behavior can be detected during tests(using_phone) This enhances fairness andaccuracy in assessing students
CNN can also analyze students’ classroom participation behaviors(raising hand) This
provides information about the level of initiative and engagement of students during the
Behavior recognition, also known as activity recognition, is the process of automatically
identifying and understanding human or object behaviors from visual data, such as images
or videos It involves developing algorithms and systems that can analyze and classify
different types of behaviors based on patterns and features extracted from the data
Has gained significant attention in computer vision and artificial intelligence research due
to a wide range of applications By accurately recognizing behaviors, behavior
recogni-tion can understand and interpret human acrecogni-tions, activities, interacrecogni-tions, or anomalies in
the final decision-making process, where the behavior is classified based on the learned
patterns and discriminative information present in the features [HT].
Trang 17be-real-time It can assist in crowd monitoring and crowd behavior analysis, helping ensure
public safety at events or in crowded areas
¢ Human-Computer Interaction (HCD: contributes to enhanced interaction between mans and computers It enables gesture recognition, allowing users to control devices
hu-or interfaces through hand movements hu-or body gestures Behavihu-or recognition can also
be used in affective computing, where it helps detect and interpret human emotions or
expressions, enabling more personalized and responsive user experiences
¢ Healthcare and Assistive Technologies: In healthcare, behavior recognition plays a
sig-nificant role in monitoring and assisting individuals It can be used for fall detection,where it identifies sudden changes in body posture or movement that may indicate a fall,
triggering alerts or assistance
¢ Driver Monitoring Systems: Behavior recognition finds applications in driver ing systems, particularly in the context of driver safety and attention By analyzing driver
monitor-behavior, such as eye movements, head pose, or facial expressions, behavior recognition
can detect signs of drowsiness, distraction, or inattentiveness This allows for timely alerts
or interventions to prevent accidents and improve road safety
Trang 18¢ Sports Analysis: the system utilized in sports analysis to track and analyze the movements
and actions of athletes It helps in identifying specific actions or gestures relevant to the
sport, such as recognizing different types of shots in basketball or detecting specific poses
in gymnastics Behavior recognition can provide valuable insights for coaches, trainers,and sports analysts, aiding in performance evaluation, injury prevention, and strategic
planning [12].
5.3 Face Recognition
5.3.1 Face Recognition problem
The field of facial recognition has undergone significant development since its inception in
the early 1960s Initially regarded as a simple computer application, today this technology hasbeen widely integrated into consumer electronics devices, such as smartphones and robots
Facial recognition is an automated computer application capable of identifying a person
from a digital image or a frame of a video; it is part of the field of biometrics, which involves
measuring and analyzing human physiological characteristics A common method used to complish this is by comparing facial features from an image with facial data previously stored
ac-in a database
While the accuracy of facial recognition systems may be lower compared to other nologies such as iris or fingerprint recognition, its major advantage lies in its non-contactprocess[14] This makes it a valuable tool in many applications, from video surveillance to
tech-personnel management and even passenger screening
Techniques for Face Recognition :
¢ Traditional [14]:
The methods of facial recognition algorithms typically work by applying feature tion techniques from user facial images and then comparing them with stored facial data
extrac-Here are two common approaches:
— Feature analysis: This method focuses on determining the position, size, and shape
of important facial components such as eyes, nose, and mouth Subsequently,
algo-rithms search for similar points within the stored data
— Normalization and data compression: This method involves normalizing and
com-pressing facial data to store only the most important information for recognition
purposes.
¢ Human identification at a distance (HID) (14)
Low-resolution facial images are often enhanced using face hallucination techniques
(i5}, which are applied before the images are sent to the face recognition system These
Trang 19Figure 3: Some eigenfaces from ATT Laboratories Cambridge [14]
techniques utilize machine learning technology based on examples to replace pixels oruse nearest neighbor distribution indices
For the face hallucination algorithm to work effectively, it needs to be trained on both
masked and unmasked facial images To fill in the occluded regions after removing themask, this algorithm needs to accurately map the entire facial state This can be challeng-
ing due to the facial expressions captured at the moment in low-resolution images
3-dimensional recognition [14]:
Three-dimensional facial recognition technology utilizes 3D sensors to gather information
about the facial structure, including features such as the contours of the eyes, nose, and
chin It is not affected by lighting variations and can identify multiple angles of the face
This technology has been developed due to advancements in complex light projectionsensors on the face
Thermal Cameras [14]
In this model, the camera will disregard accessories such as hats, glasses, masks, etc., and
Trang 20Figure 4: 3D model of a human face
only detect the shape of the head Especially in low-light conditions and at night without
the need for flash, the camera maintains a discreet position However, current thermalfacial recognition systems still face difficulties in reliably detecting faces in outdoor en-
vironments
Therefore, in 2018, researchers in the U.S Army Research Laboratory (ARL) developed
a technique to align thermal images captured by thermal cameras with the database of
conventional cameras The method utilizes a non-linear regression model to map a
ther-mal image onto a visible facial image and an optimization process to project the result
back into the image space
Figure 5: A pseudocolor image of two people taken in long-wavelength infrared
(body-temperature thermal) light (14
Trang 215.3.2 Application:
¢ ID verification: Facial recognition technology is increasingly being utilized in ID
verifi-cation services It has become a popular form of biometric authentiverifi-cation across various
computing platforms and devices [16].
¢ Face ID: Apple introduced Face ID on the iPhone X, replacing Touch ID with a facial
recognition-based biometric authentication system (4) Face ID uses infrared
technol-ogy to project over 30,000 dots onto the user’s face, analyzing the pattern to authenticate
against the device owner’s registered face in a secure enclave (7 The system adapts to
changes in appearance and works with accessories like hats, scarves, glasses, and
vari-ous sunglasses (i8} Additionally, it functions in low-light conditions using a dedicated infrared flash to capture facial points accurately [i9].
¢ Healthcare: Facial recognition algorithms have been employed to potentially diagnose
certain illnesses by analyzing distinct features present on various facial regions such as the
nose, cheeks, and other areas (20) Leveraging well-established datasets, machine
learn-ing techniques have been applied to detect genetic irregularities solely based on facial
measurements (Zi) Additionally, Facial Recognition Technology (FRT) has been utilized
for patient authentication before undergoing surgical procedures
Trang 226 STUDENT BEHAVIOR RECOGNITION
6.1 YOLOv8
YOLO, stands for You Only Look Once, is a popular object detection model popular for its high
speed and accuracy First introduced in 2016 by Joseph Redmon and his colleagues, YOLO
has been continuously improved through many versions, with YOLOv8 model being the
latest version today 3] The unique feature of YOLO compared to previous object detection
algorithms is that it uses an end-to-end neural network to predict both the bounding box and
the probability Class probability in just one pass This difference helps YOLO achieve higher
performance compared to previous methods, which often use a redesigned classification model
to perform the detection function (24) YOLOv8 Version is an advanced version developed by
Ultralytics (YOLOv5 Developer) and released in January 2023 Clearly seeing the effectiveness
of YOLOv8 is the motivation for the team to deploy and use this model (24) Compared to
previous models, YOLOv8 has new changes such as: Anchor-Free Detection & Mosaic
Aug-mentation.The head of YOLOv8 has been replaced with Anchor-Free instead of the previous
Anchor-Based of YOLOvS as in Figure [6] This technique allows object positioning by
refer-ring to the object’s center and then predicting the distance from the center to the bounding
box It helps to improve model processing speed by minimizing the number of bounding box
predictions This leads to the simplification of Non-Maximum Suppression(NMS), a complexpost-processing step typically performed after prediction As for Mosaic Augmentation, it is a
simple method, it works by concatenating four different images into a new image, which forms
the input to the model Although simple, it is effective in improving object detection model
performance, especially in cases with limited or low-quality training data
YOLOv8 delivers remarkable improvements in adaptability to a variety of datasets and narios Diverse versions such as YOLOv8n, YOLOv8s, YOLOv8m, YOLOv81, YOLOv8x meet
sce-diverse usage needs, suitable for many levels of computing resources (22) Seamless
integra-tion with TensorFlow and PyTorch makes YOLOV8 easy to apply to existing computer vision
workflows, simplifying deployment for developers and researchers Compared with previousversions of YOLO and other advanced object detection models, YOLOV8 asserts its competi-
tive position with impressive performance in both accuracy and speed 25} In short, YOLOv8
is the ideal choice for object detection applications that require high efficiency and flexibility
YOLOv8 is known for its speed and efficiency in object detection tasks It processes images
quickly while maintaining high accuracy, making it suitable for real-time applications It alsooften achieves state-of-the-art performance on benchmark datasets, especially in terms of Mean
Average Precision (mAP) Its architecture is designed to balance speed and accuracy effectively
YOLOv8 offers flexibility in terms of model configuration and hyperparameters, allowingusers to customize the model according to their specific requirements and constraints Pre-
trained versions of YOLOv8 on large datasets are often available, which can be fine-tuned or
Trang 2340«40=S12esrs(1+i) 40540~E14 se
Backbone Head
Figure 6: The detailed network model architecture of YOLOv8
used directly for specific object detection tasks, saving time and computational resources
YOLOvVS8 provides an end-to-end solution for object detection, including both training and
inference stages, simplifying the development process for users
6.2 Datasets
Classroom teaching has always played a fundamental role in education, as it provides a direct
and interactive learning environment for students However, understanding and analyzing
stu-dent behavior in the classroom is crucial for evaluating the effectiveness of education Human
behavior is inherently diverse and complex, making it a challenging task to accurately assessand interpret student actions
Trang 24In order to address this challenge, we conducted a comprehensive study using image data
collected directly from actual classroom recordings We utilized the SCB-dataset (26) and
various online sources to gather a diverse range of classroom images These images were then
processed and labeled using Roboflow, a powerful annotation tool that enabled us to accurately
identify and label specific student behaviors
The labeled images were divided into three distinct sets: Training, Validation, and Testing
We allocated 75% of the data for training purposes, 15% for validation to fine-tune our models,
and the remaining 10% for testing the performance of the developed system This division sured that our models were trained on a diverse range of data and evaluated on unseen instances,enhancing their generalization capabilities
en-Our labeling process focused on five main classes of student behavior, as illustrated in Table
classroom setting By accurately labeling and categorizing these behaviors, we aimed to develop
These classes encompassed a wide range of actions and activities commonly observed in a
a robust system capable of effectively recognizing and analyzing student actions in real-time
Class Train | Val | Test
focused 1715 | 331 | 246raising_hand 301 | 130} 82distracted 1248 | 173 | 100
sleep 392 99 | 67
using phone 952 | 293 | 203
Number of images | 1835 | 366 | 241Table 1: The number of instances and images of each class in a Train, Valid, Test dataset
6.3 Evaluation Metrics
When constructing a classification model, it’s crucial to assess the proportion of correctly
pre-dicted instances relative to the total number of instances This ratio is known as accuracy curacy serves as a metric to gauge the predictive effectiveness of a model on a dataset A higher
Ac-accuracy signifies a more precise model performance[28]
TP+TN
IOU (Intersection over Union) is a crucial performance metric that plays a significant role inassessing the accuracy of annotation, segmentation, and object detection algorithms It provides
a quantitative measure of the overlap between the predicted bounding box or segmented region
and the ground truth bounding box or annotated region derived from a dataset By calculating
the ratio of the intersection area to the union area of these regions, IOU offers valuable insightsinto the precision and reliability of these algorithms It serves as a useful tool for evaluating and
comparing different models, enabling researchers and practitioners to make informed decisions
about the effectiveness of their approaches in various computer vision tasks It can be explained
Trang 25as representing the percentage of overlap between two bounding boxes: the ground truth box
(G) and the detection box (D) (22} It is calculated using the following equation:
Area
AreaPrecision: Precision addresses the question of how many predicted instances are truly pos-
itive Essentially, it measures the accuracy of positive predictions, particularly focusing on the
positive group (in this case, the "BAD" records) A higher precision indicates better
perfor-mance in classifying the positive group, which in turn reflects the model’s ability to accurately
identify instances belonging to the "BAD" category 9].
TP
recision Fotal Predicted Positive (3)
Recall: Recall assesses the ability to correctly identify positive instances among all samplesthat actually belong to the positive group It quantifies the proportion of true positives that the
model correctly identifies E9] The formula for recall 1s as follows:
TP
Recall = Total Actual Positive 4)
Confusion Mask ( best case):
TP (True Positive) is the total number of detections with IoU greater than or equal to 0.5; FP
(False Positive) is the total number of detections with IoU less than 0.5; and FN (False Negative)
is the total number of undetected objects in the test set Precision is a performance measure thatevaluates the ability to accurately identify true positives (TP) among all positive predictions The
Average Precision (AP) is the average accuracy of the model, while the mean Average Precision
(mAP) is the average of AP across all detected classes K represents the number of categories
In this paper, mAP50 and mAP50-95 are used to evaluate the performance of different models
mAPSO refers to the average precision for all classes at an IoU of 0.5, while mAP50-95 refers
to the average precision for all classes within a range of IoU from 0.5 to 0.95, with a step size
of 0.05 [Z9].
The TP, FP, TN, and FN indicators have the following meanings:
TP (True Positive): Total number of positive pattern-matching prediction cases
TN (True Negative): Total number of negative pattern-matching prediction cases
FP (False Positive): Total number of cases that predict observations belonging to the negative
label to be positive
FN (False Negative): Total number of cases that predict observations belonging to the
posi-tive label to be negaposi-tive [29].