Báo cáo nghiên cứu khoa học: Building a student monitoring system in the classroom based on computer vision

VIETNAM NATIONAL UNIVERSITY, HANOIINTERNATIONAL SCHOOL RESEARCH REPORT BUILDING A STUDENT MONITORING SYSTEM IN THE CLASSROOM BASED ON COMPUTER VISION Advisors: PhD... 3.4.1 Keywords: com

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI

INTERNATIONAL SCHOOL

RESEARCH REPORT

BUILDING A STUDENT MONITORING SYSTEM IN THE

CLASSROOM BASED ON COMPUTER VISION

Advisors: PhD Kim Dinh Thai

PhD.Ha Manh Hung

Team leader: Pham Anh Phuong

ID: 22070154 Class: AIT2022B

Code: CN.NC.SV.23_ 04

April 15, 2024

Trang 2

TEAM LEADER INFORMATION

- Program: Applied Information Security

- Address: Thach That — Ha Noi

- Phone no /Email: 032826538 1/anhphuong08032004@gmail.com

II Academic Results (from the first year to now)

Academic year Overall score Academic rating

lst semester, Ist year 3.25 Distinction

2nd semester, Ist year 3.6 High Distinction

1st semester, 2nd year 3.44 Distinction

1H Other achievements:

1 "Technology Talent" Scholarship

2 “Light up Talent" Lazada Scholarship

3 The scholarship encourages studying in the 2nd semester of the first year

Advisor

(Sign and write fullname)

Kim Dinh Thai

(Sign and write fullname)

Pham Anh Phuong

Trang 3

Mục lục

1l1] Student profile) HQ HQ HQ Q k kg 1

25

Trang 4

[/.1 Data Cleaning and Standardization)

/2 MTCNN Modell

ne

7.3.3 7.3.4 Using Facenet in students attendance checking system

TripletSeleclion| -7.4.1 Databaselnteeraion|

7.4 Attendance ÏIistupdatel

Trang 6

Danh sách hình vẽ

] Common topology of a convolutional neural network [OlÏ 14

¬¬ 16some eigenfaces from ATT Laboratories Cambridge [I4]} 2 18

43D model of a human face [Í4Ï]] - 19

21 Model structure This network consists of a batch input layer and a deep CNN

| _ †ollowed by L2 normalization, which results in the face embedding [36]| 36

22 Anchor, Positive, Negative [37] 0 0 00000 eee ee eee 37

28 lestIlmaseedataset[} Ặ Ặ Q Q Q ee 43

¬— 4

Cees 43

Trang 7

2 ACKNOWLEDGE

We would like to express our deep gratitude to PhD.Kim Dinh Thai and PhD.Ha Manh Hung

for their invaluable guidance and support throughout our research process Their dedication

to detail, expertise and care were crucial in keeping us on track and completing our scientific

research successfully

Their guidance and support helped us navigate the complex research process with ease They

provided us with important insights and ideas in shaping the direction of our research Theyare always available to answer our questions, give feedback and give constructive comments,

helping us improve our work

Without the contributions of PhD.Kim Dinh Thai, PhD.Ha Manh Hung and other advisors,

we would not be able to complete this study Their dedication, expertise and commitment toexcellence have been instrumental in helping us achieve our goals

We are truly grateful for the important contributions of PhD.Kim Dinh Thai and PhD.Ha

Manh Hung, and look forward to working with them on future projects Their guidance and

support has been invaluable to us, and we feel honored to have the opportunity to work withpeople as talented and dedicated as they are We hope to continue to cooperate with them in the

future and learn from their vast experience and knowledge

Student,Pham Anh Phuong

Trang 8

Doan Thi Phuong Thao | BEL2022C | 22070018 dthao18 102004 @ gmail.com

Nguyen Khac Ton AAI2022A | 22070277 khacton2004@ gmail.comNguyen Khac Truong | 22070156 | AAI2022 Gnourt2004 @ gmail.com

Nguyen Ngoc Trung | AIT2022B | 22070167 22070167 @vnu.edu.vn

3.3 Advisor(s):

Kim Dinh Thai, Falcuty of Applied Sciences, PhD

Ha Manh Hung, Falcuty of Applied Sciences, PhD

3.4 Abstract:

This study presents a novel system that utilizes computer vision techniques to automate

at-tendance taking and student monitoring in the classroom The proposed system leverages the

FaceNet model for face recognition, combined with the MTCNN algorithm to accurately tect and locate student faces Additionally, YOLOv8-object detection is employed to analyze

de-student behavior, specifically focusing on assessing their concentration levels The system’s

performance was evaluated using a comprehensive test set, and the results demonstrated a high

level of accuracy, with a recorded average accuracy of 90% for face recognition using FaceNet

Moreover, the utilization of the MTCNN algorithm significantly contributed to precise facelocalization, ensuring reliable attendance tracking In terms of behavior analysis, the YOLOv8-

object detection model achieved an average accuracy rate of 85-90% in assessing students’

con-centration levels This capability enables teachers to efficiently monitor students’ engagementduring class, facilitating timely interventions and enhancing the overall quality of instruction

By automating the attendance process, the proposed system alleviates the burden on teachers,

saving valuable instructional time Furthermore, the real-time monitoring of students’

concen-tration levels allows educators to identify and address potential issues promptly, fostering a

Trang 9

more engaging and productive learning environment The findings of this study highlight the

potential of computer vision techniques in revolutionizing traditional classroom management

practices The integration of FaceNet, MTCNN, and YOLOvé8-object detection enables rate attendance tracking and behavior analysis, empowering educators with valuable insights

accu-into students’ participation and focus levels The system’s high accuracy rates underscore its

efficacy and practicality in real-world classroom settings In conclusion, this system presents

a robust and efficient solution for automating attendance taking and monitoring students’

con-centration levels in the classroom The combination of FaceNet, MTCNN, and YOLOv8-objectdetection contributes to accurate face recognition, precise face localization, and behavior analy-

sis The system’s performance demonstrates its potential to enhance teaching effectiveness, save

instructional time, and create a conducive learning environment

3.4.1 Keywords:

computer vision, FaceNet, MTCNN, object detection, YOLOv8

3.5 Concerning rationale of the study

Student disengagement has become a significant challenge in modern educational settings

Fac-tors such as lack of focus, tardiness, and absenteeism not only affect individual student formance but also disrupt the overall progress of the class To address this issue, our research

per-aims to develop a platform that supports educators and institutions in accurately monitoring and

recording student attendance By leveraging computer vision applications such as face

recog-nition and activity tracking, our goal is to create a reliable and convenient system This system

will enable educators and organizations to capture essential information about student presenceand activities during the learning process, while ensuring automatic and efficient compliance

with classroom regulations By accessing the platform, users can access reports and statistical

data on student attendance, activities, and academic progress, empowering them to implementappropriate measures to support and enhance teaching quality

In addition, monitoring student engagement through real-time object detection using YOLOv8

is an essential aspect of our research In dense classrooms, where teachers may struggle to

mon-itor individual student preferences, it is crucial to closely track each student’s level of focus and

participation Studies have consistently shown that engaged and active learners tend to form their peers Therefore, teachers must closely monitor each student in the classroom and

outper-adapt their approaches to meet individual needs to capture and sustain their attention during

instruction However, this task becomes increasingly challenging in large class settings, whereinstructors may not always have full awareness of student attentiveness and engagement, sig-

nificantly impacting their academic progress Real-time visualization of each student’s level of

interest during lectures is necessary for instructors to adjust their teaching methods and enhance

student engagement effectively The development of a comprehensive monitoring system holds

Trang 10

significant potential in the teaching profession and can improve student learning outcomes By

combining technology with innovative teaching methods, educators can gain valuable insights

into student engagement levels and timely adjust their teaching strategies This proactive

ap-proach not only fosters a dynamic and interactive learning environment but also ensures thatstudents actively participate and are motivated to achieve success in their studies

In summary, prioritizing student engagement and leveraging modern tools and techniques

to enhance teaching practices can significantly contribute to the overall effectiveness of the

educational experience and ultimately lead to improved academic outcomes for students

3.6 Research questions

The use of Computer Vision is applied to student monitoring with the purpose of finding and

de-tecting student behavior and attitudes during class to bring about a new environment where the

quality of education can be improved Knowing students’ mistakes during class can help thempay attention to lectures and respect teachers more In addition, the system can also identify stu-

dents’ attendance daily, thereby saving time for teachers and improving learning performance

To attain these goals, the following research questions are addressed below

¢ What does this study contribute to the quality of school education? (attendance checking

work, student monitoring )

¢ What does it contribute to students’ consciousness? Does reminding students make them

more diligent?

3.7 Research Objectives

The research objective of the project to build a student monitoring system in the classroom based

on computer vision is to develop a sophisticated and efficient system that can accurately track

and monitor student behavior, engagement, and attendance during classroom sessions This

sys-tem aims to utilize computer vision technology to automatically detect and recognize students’faces, gestures, and movements, enabling real-time monitoring and analysis of their activities

The primary goal is to create a non-intrusive and privacy-aware system that can provide

valu-able insights to educators about student participation, attention levels, and overall classroom

dynamics

Additionally, the research seeks to explore the potential of integrating machine learning

algorithms and artificial intelligence (AJ) to interpret and analyze the captured visual data,leading to the development of predictive models for identifying patterns and trends in studentbehavior The ultimate objective is to enhance the overall learning environment by providingeducators with actionable information that can support personalized teaching strategies, early

intervention for at-risk students, and the optimization of classroom management techniques

Trang 11

Furthermore, the research aims to address ethical considerations and privacy concerns by

im-plementing robust data protection measures and ensuring transparency in the use of student

monitoring technology Overall, the research objective is to contribute to the advancement of

educational technology by leveraging computer vision for the development of a comprehensivestudent monitoring system that promotes a more engaging and supportive learning experience

in the classroom

3.8 Research Methodology

We have divided the project BUILDING A STUDENT MONITORING SYSTEM IN THE

CLASSROOM BASED ON COMPUTER VISION into 2 smaller projects: the project

"Build-ing an automatic student face recognition and attendance system" and the project "Build"Build-ing an

automatic monitoring system, recognizing student behaviors in the classroom" This project

requires a combination of knowledge in image processing, machine learning, and Python

pro-gramming language to build an accurate and efficient face recognition system

1 Project Building an automatic student face recognition and attendance system

¢ This project aims to develop a system that automatically recognizes students’ faces

and records their attendance

¢ The training process involves four stages: face detection using MTCNN,

normaliza-tion and preprocessing of detected faces, feature extracnormaliza-tion using FaceNet model,

and storing the extracted features in a database

¢ Images of team members, totaling over 2000, are used for training, covering various

facial expressions, angles, accessories, lighting conditions, and resolutions

¢ MTCNN is employed for face detection, consisting of three layers: Proposal

Net-work (P-Net), Refine NetNet-work (R-Net), and Output NetNet-work (O-Net)

¢ FaceNet model, developed by Google, is utilized for feature extraction using a

128-dimensional embedding vector and Triplet Loss function

¢ The combination of MTCNN for face detection and FaceNet for face recognition

yields high accuracy in verifying students’ identities

2 Project Building an automatic monitoring system, recognizing student behaviors in

the classroom

¢ This project focuses on building a system using YOLOv8-object detection for

mon-itoring student behaviors in the classroom

¢ The system provides benefits such as time-saving compared to manual attendance,

detailed attendance records, and detection of rule violations like mobile phone usage

or lack of focus

Trang 12

¢ Ithelps teachers identify students with positive learning attitudes and those who may

need adjustments in teaching methods or content to optimize learning performance

3.9 Structure

This summary report covers the main program years:

¢ Chapter 1: introduces the foundation of topics, including integral neural networks,

be-havior recognition and face recognition It presents CNN, a deep neural network

com-monly used in computer vision It explains the basic structure and operation of CNN and

how it is applied in object recognition and image classification

¢ Chapter 2 focuses on student identification activities This chapter describes the

prob-lem of behavior identification in the classroom and the benefits of behavior monitoring

It provides methods and technologies to identify behavior through the use of computervision and other types of algorithmic analysis This chapter also provides details about

the process of building an automated monitoring system, including data collection, data

preprocessing, model training, and performance evaluation

¢ Chapter 3 focuses on student automation systems This chapter addresses the problem

of students taking attendance in the classroom and is beneficial for building an automatic

attendance system It introduces methods and techniques for facial recognition and

stu-dent istu-dentification This chapter describes in detail the process of building an automatedsystem, including data collection, preprocessing, model training, and performance evalu-

ation

¢ Finally, Chapter 4 presents a summary, limitations and future work This is a program

that synthesizes the results and components of building an automatic student ing system in the classroom It also addresses the limits and limitations of the system,along with the difficulties to be encountered during the research process Finally, thisprogram proposes directions for future development and research, aimed at improving

monitor-performance, expanding applications to other learning environments, and practical

test-ing

Through this structure, the comprehensive report provides a comprehensive view of building

an automatic student monitoring system in the classroom, from student behavior identification

to automatic attendance system It also addresses limitations and future directions for further

development and research

Trang 13

4 LITERATURE REVIEW

Facial recognition presents significant challenges in image analysis and computer vision (Oloyede

& colleagues, 2020) The adoption of student monitoring systems in classrooms using computer

vision has garnered notable attention for its potential in enhancing education and classroommanagement Various studies have explored different aspects of such systems, from technical

implementation to effectiveness in improving student engagement and behavior

D Bhavana and collaborators proposed an automatic attendance system employing the

Lo-cal Binary Pattern algorithm for face and voice recognition of students in the classroom,

achiev-ing an accuracy of up to 85% IHIỆ This highlights the potential of usachiev-ing facial and voice

recog-nition for automatic attendance

Nguyen Thi Uyen Nhi and others have also shown that using MTCNN and Facenet canachieve high accuracy in verifying the identity of students in the exam room, with accuracies of

88.4% and 92.1% respectively on the FACE_STUDUE and Yale datasets (2 This underscores

the strength of CNN models combined with Facenet’s triplet loss function in facial recognition

Partha Chakraborty and colleagues proposed a similar system using the Principle Face

Recognition (PCA) Algorithm, achieving an average identification rate of 80.22% Bi Although

this accuracy is lower compared to studies using MTCNN and Facenet, it still demonstrates the

potential of traditional methods in automatic attendance

Lastly, Jose et al.’s study demonstrated that using Facenet achieved a high accuracy of 97%

in multi-camera facial recognition

Based on the presented research, it can be observed that utilizing facial recognition

algo-rithms and models such as MTCNN and Facenet is becoming a common trend in developing

automatic attendance systems for students The studies have demonstrated that the combination

of CNN models and facial recognition algorithms like Facenet can effectively identify faces

with high accuracy

The feasibility and urgency of facial recognition in real-life situations have been proven, tivating our research to develop facial recognition methods for taking attendance of students in

mo-class based on images , taking advantage of the effectiveness of MTCNN and Facenet models.Additionally, we also try to improve the accuracy of facial recognition in this context

Behavior detection technology has made it possible to analyze student behavior in class

videos, it can provide information on the classroom status and learning performance of students,

making it an essential tool for teachers, administrators, students, and parents in schools i.

By incorporating the biformer attention module and Wise-IoU into the YOLOv7

frame-work, they were able to enhance the detection precision significantly This enhancement resulted

in an mAP0.5 of 79%, surpassing the previous outcomes by 1.8%.The experimental findings

demonstrate that our model surpasses the original YOLOv7 in terms of precision, mAP0.5, and

mAP0.5:0.95

Effective classroom instruction requires monitoring student participation and interaction

Trang 14

during class, identifying cues to simulate their attention The ability of teachers to analyze

and evaluate students’ classroom behavior is becoming a crucial criterion for quality teaching

Artificial intelligence (AI)-based behavior recognition techniques can help evaluate students’

attention and engagement during classroom sessions Bi The research paper demonstrates the

precision-recall curve of the model trained with YOLOv5s The YOLOvS5 models obtained anmAP@0.5 of 0.762, with the class "eating food" achieving the highest mAP@0.5 of 0.921

However, the class "reading book" exhibited relatively lower results, with an mAP of 0.689

Automated learning analytics is becoming an essential topic in the educational area, whichneeds effective systems to monitor the learning process and provide feedback to the teacher

Recent advances in visual sensors and computer vision methods enable automated monitoring

of behavior and affective states of learners at different levels, from university to preschool (6).

The paper’s analysis of the confusion matrix for student ID identification reveals a strongand accurate outcome, indicated by the prominently colored diagonal Additionally, when uti-

lizing our summarization algorithm, we achieved an Fl-score of 82.81%, surpassing the score

of 72% obtained without using the algorithm Moreover, by manually labeling the unknownset of sequences generated by the summarization algorithm, a labeling technique referred to as

"semi-assist," we were able to achieve an impressive F1-score of 99.23%

Students’ action behavior performance is an important part of classroom teaching

evalua-tion To detect the action behavior of students in classroom teaching videos, and based on the

detection results, the action behavior sequence of individual students in the teaching time of

knowledge points is obtained and analyzed (7.

In this paper, a novel approach is presented for recognizing students’ action behaviors using

time-series images captured in a classroom setting The results demonstrate that the enhanced

AIA network proposed in this study exhibits stable convergence during the training process and

achieves an impressive accuracy of 92%

Trang 15

5 BACKGROUND

5.1 CNN Network

5.1.1 Overview

¢ CNN stands for Convolutional Neural Network, which is a specialized type of neural

network commonly used for image and video processing tasks

¢ Convolutional Neural Networks (CNNs) are similar to an artificial neural network (ANN)

in that it includes automatic neural optimization information during the learning process

Each neuron still receives input and implements an allowed feature (like a dot product

followed by a nonlinear function) - the basis of digital ANN [8Ì

5.1.2 CNN architecture

¢ Overall architect

Convolutional Neural Network architecture consists of 3 main parts:

conealution poaling conpolwtion pooling dense / recurrent

bàvế aha l ,

J J l

Input Convolutional Layers Fully-Connected Layers

Figure 1: Common topology of a convolutional neural network (9)

— Input Layer: This is the first part of the network where the input data is fed into

the CNN For images, the input layer usually has dimensions corresponding to the

width, height, and number of channels of the image

— Convolutional Layers: These layers perform convolution operations using filters

(kernels) to extract features from the input data Each convolutional layer generates

multiple feature maps that capture different patterns or features

— Fully Connected Layers: These layers are responsible for making predictions based

on the features extracted by the previous layers They connect every neuron from

the previous layer to every neuron in the current layer Fully connected layers are

commonly used in the final layers of the network for classification or regression

tasks [T0].

Trang 16

Learning behavior classification such as paying attention to lectures, reading books,

writ-ing essays, or dowrit-ing exercises This can help teachers and schools assess the level of

interaction and engagement of students during the learning process

Cheating behavior can be detected during tests(using_phone) This enhances fairness andaccuracy in assessing students

CNN can also analyze students’ classroom participation behaviors(raising hand) This

provides information about the level of initiative and engagement of students during the

Behavior recognition, also known as activity recognition, is the process of automatically

identifying and understanding human or object behaviors from visual data, such as images

or videos It involves developing algorithms and systems that can analyze and classify

different types of behaviors based on patterns and features extracted from the data

Has gained significant attention in computer vision and artificial intelligence research due

to a wide range of applications By accurately recognizing behaviors, behavior

recogni-tion can understand and interpret human acrecogni-tions, activities, interacrecogni-tions, or anomalies in

the final decision-making process, where the behavior is classified based on the learned

patterns and discriminative information present in the features [HT].

Trang 17

be-real-time It can assist in crowd monitoring and crowd behavior analysis, helping ensure

public safety at events or in crowded areas

¢ Human-Computer Interaction (HCD: contributes to enhanced interaction between mans and computers It enables gesture recognition, allowing users to control devices

hu-or interfaces through hand movements hu-or body gestures Behavihu-or recognition can also

be used in affective computing, where it helps detect and interpret human emotions or

expressions, enabling more personalized and responsive user experiences

¢ Healthcare and Assistive Technologies: In healthcare, behavior recognition plays a

sig-nificant role in monitoring and assisting individuals It can be used for fall detection,where it identifies sudden changes in body posture or movement that may indicate a fall,

triggering alerts or assistance

¢ Driver Monitoring Systems: Behavior recognition finds applications in driver ing systems, particularly in the context of driver safety and attention By analyzing driver

monitor-behavior, such as eye movements, head pose, or facial expressions, behavior recognition

can detect signs of drowsiness, distraction, or inattentiveness This allows for timely alerts

or interventions to prevent accidents and improve road safety

Trang 18

¢ Sports Analysis: the system utilized in sports analysis to track and analyze the movements

and actions of athletes It helps in identifying specific actions or gestures relevant to the

sport, such as recognizing different types of shots in basketball or detecting specific poses

in gymnastics Behavior recognition can provide valuable insights for coaches, trainers,and sports analysts, aiding in performance evaluation, injury prevention, and strategic

planning [12].

5.3 Face Recognition

5.3.1 Face Recognition problem

The field of facial recognition has undergone significant development since its inception in

the early 1960s Initially regarded as a simple computer application, today this technology hasbeen widely integrated into consumer electronics devices, such as smartphones and robots

Facial recognition is an automated computer application capable of identifying a person

from a digital image or a frame of a video; it is part of the field of biometrics, which involves

measuring and analyzing human physiological characteristics A common method used to complish this is by comparing facial features from an image with facial data previously stored

ac-in a database

While the accuracy of facial recognition systems may be lower compared to other nologies such as iris or fingerprint recognition, its major advantage lies in its non-contactprocess[14] This makes it a valuable tool in many applications, from video surveillance to

tech-personnel management and even passenger screening

Techniques for Face Recognition :

¢ Traditional [14]:

The methods of facial recognition algorithms typically work by applying feature tion techniques from user facial images and then comparing them with stored facial data

extrac-Here are two common approaches:

— Feature analysis: This method focuses on determining the position, size, and shape

of important facial components such as eyes, nose, and mouth Subsequently,

algo-rithms search for similar points within the stored data

— Normalization and data compression: This method involves normalizing and

com-pressing facial data to store only the most important information for recognition

purposes.

¢ Human identification at a distance (HID) (14)

Low-resolution facial images are often enhanced using face hallucination techniques

(i5}, which are applied before the images are sent to the face recognition system These

Trang 19

Figure 3: Some eigenfaces from ATT Laboratories Cambridge [14]

techniques utilize machine learning technology based on examples to replace pixels oruse nearest neighbor distribution indices

For the face hallucination algorithm to work effectively, it needs to be trained on both

masked and unmasked facial images To fill in the occluded regions after removing themask, this algorithm needs to accurately map the entire facial state This can be challeng-

ing due to the facial expressions captured at the moment in low-resolution images

3-dimensional recognition [14]:

Three-dimensional facial recognition technology utilizes 3D sensors to gather information

about the facial structure, including features such as the contours of the eyes, nose, and

chin It is not affected by lighting variations and can identify multiple angles of the face

This technology has been developed due to advancements in complex light projectionsensors on the face

Thermal Cameras [14]

In this model, the camera will disregard accessories such as hats, glasses, masks, etc., and

Trang 20

Figure 4: 3D model of a human face

only detect the shape of the head Especially in low-light conditions and at night without

the need for flash, the camera maintains a discreet position However, current thermalfacial recognition systems still face difficulties in reliably detecting faces in outdoor en-

vironments

Therefore, in 2018, researchers in the U.S Army Research Laboratory (ARL) developed

a technique to align thermal images captured by thermal cameras with the database of

conventional cameras The method utilizes a non-linear regression model to map a

ther-mal image onto a visible facial image and an optimization process to project the result

back into the image space

Figure 5: A pseudocolor image of two people taken in long-wavelength infrared

(body-temperature thermal) light (14

Trang 21

5.3.2 Application:

¢ ID verification: Facial recognition technology is increasingly being utilized in ID

verifi-cation services It has become a popular form of biometric authentiverifi-cation across various

computing platforms and devices [16].

¢ Face ID: Apple introduced Face ID on the iPhone X, replacing Touch ID with a facial

recognition-based biometric authentication system (4) Face ID uses infrared

technol-ogy to project over 30,000 dots onto the user’s face, analyzing the pattern to authenticate

against the device owner’s registered face in a secure enclave (7 The system adapts to

changes in appearance and works with accessories like hats, scarves, glasses, and

vari-ous sunglasses (i8} Additionally, it functions in low-light conditions using a dedicated infrared flash to capture facial points accurately [i9].

¢ Healthcare: Facial recognition algorithms have been employed to potentially diagnose

certain illnesses by analyzing distinct features present on various facial regions such as the

nose, cheeks, and other areas (20) Leveraging well-established datasets, machine

learn-ing techniques have been applied to detect genetic irregularities solely based on facial

measurements (Zi) Additionally, Facial Recognition Technology (FRT) has been utilized

for patient authentication before undergoing surgical procedures

Trang 22

6 STUDENT BEHAVIOR RECOGNITION

6.1 YOLOv8

YOLO, stands for You Only Look Once, is a popular object detection model popular for its high

speed and accuracy First introduced in 2016 by Joseph Redmon and his colleagues, YOLO

has been continuously improved through many versions, with YOLOv8 model being the

latest version today 3] The unique feature of YOLO compared to previous object detection

algorithms is that it uses an end-to-end neural network to predict both the bounding box and

the probability Class probability in just one pass This difference helps YOLO achieve higher

performance compared to previous methods, which often use a redesigned classification model

to perform the detection function (24) YOLOv8 Version is an advanced version developed by

Ultralytics (YOLOv5 Developer) and released in January 2023 Clearly seeing the effectiveness

of YOLOv8 is the motivation for the team to deploy and use this model (24) Compared to

previous models, YOLOv8 has new changes such as: Anchor-Free Detection & Mosaic

Aug-mentation.The head of YOLOv8 has been replaced with Anchor-Free instead of the previous

Anchor-Based of YOLOvS as in Figure [6] This technique allows object positioning by

refer-ring to the object’s center and then predicting the distance from the center to the bounding

box It helps to improve model processing speed by minimizing the number of bounding box

predictions This leads to the simplification of Non-Maximum Suppression(NMS), a complexpost-processing step typically performed after prediction As for Mosaic Augmentation, it is a

simple method, it works by concatenating four different images into a new image, which forms

the input to the model Although simple, it is effective in improving object detection model

performance, especially in cases with limited or low-quality training data

YOLOv8 delivers remarkable improvements in adaptability to a variety of datasets and narios Diverse versions such as YOLOv8n, YOLOv8s, YOLOv8m, YOLOv81, YOLOv8x meet

sce-diverse usage needs, suitable for many levels of computing resources (22) Seamless

integra-tion with TensorFlow and PyTorch makes YOLOV8 easy to apply to existing computer vision

workflows, simplifying deployment for developers and researchers Compared with previousversions of YOLO and other advanced object detection models, YOLOV8 asserts its competi-

tive position with impressive performance in both accuracy and speed 25} In short, YOLOv8

is the ideal choice for object detection applications that require high efficiency and flexibility

YOLOv8 is known for its speed and efficiency in object detection tasks It processes images

quickly while maintaining high accuracy, making it suitable for real-time applications It alsooften achieves state-of-the-art performance on benchmark datasets, especially in terms of Mean

Average Precision (mAP) Its architecture is designed to balance speed and accuracy effectively

YOLOv8 offers flexibility in terms of model configuration and hyperparameters, allowingusers to customize the model according to their specific requirements and constraints Pre-

trained versions of YOLOv8 on large datasets are often available, which can be fine-tuned or

Trang 23

40«40=S12esrs(1+i) 40540~E14 se

Backbone Head

Figure 6: The detailed network model architecture of YOLOv8

used directly for specific object detection tasks, saving time and computational resources

YOLOvVS8 provides an end-to-end solution for object detection, including both training and

inference stages, simplifying the development process for users

6.2 Datasets

Classroom teaching has always played a fundamental role in education, as it provides a direct

and interactive learning environment for students However, understanding and analyzing

stu-dent behavior in the classroom is crucial for evaluating the effectiveness of education Human

behavior is inherently diverse and complex, making it a challenging task to accurately assessand interpret student actions

Trang 24

In order to address this challenge, we conducted a comprehensive study using image data

collected directly from actual classroom recordings We utilized the SCB-dataset (26) and

various online sources to gather a diverse range of classroom images These images were then

processed and labeled using Roboflow, a powerful annotation tool that enabled us to accurately

identify and label specific student behaviors

The labeled images were divided into three distinct sets: Training, Validation, and Testing

We allocated 75% of the data for training purposes, 15% for validation to fine-tune our models,

and the remaining 10% for testing the performance of the developed system This division sured that our models were trained on a diverse range of data and evaluated on unseen instances,enhancing their generalization capabilities

en-Our labeling process focused on five main classes of student behavior, as illustrated in Table

classroom setting By accurately labeling and categorizing these behaviors, we aimed to develop

These classes encompassed a wide range of actions and activities commonly observed in a

a robust system capable of effectively recognizing and analyzing student actions in real-time

Class Train | Val | Test

focused 1715 | 331 | 246raising_hand 301 | 130} 82distracted 1248 | 173 | 100

sleep 392 99 | 67

using phone 952 | 293 | 203

Number of images | 1835 | 366 | 241Table 1: The number of instances and images of each class in a Train, Valid, Test dataset

6.3 Evaluation Metrics

When constructing a classification model, it’s crucial to assess the proportion of correctly

pre-dicted instances relative to the total number of instances This ratio is known as accuracy curacy serves as a metric to gauge the predictive effectiveness of a model on a dataset A higher

Ac-accuracy signifies a more precise model performance[28]

TP+TN

IOU (Intersection over Union) is a crucial performance metric that plays a significant role inassessing the accuracy of annotation, segmentation, and object detection algorithms It provides

a quantitative measure of the overlap between the predicted bounding box or segmented region

and the ground truth bounding box or annotated region derived from a dataset By calculating

the ratio of the intersection area to the union area of these regions, IOU offers valuable insightsinto the precision and reliability of these algorithms It serves as a useful tool for evaluating and

comparing different models, enabling researchers and practitioners to make informed decisions

about the effectiveness of their approaches in various computer vision tasks It can be explained

Trang 25

as representing the percentage of overlap between two bounding boxes: the ground truth box

(G) and the detection box (D) (22} It is calculated using the following equation:

Area

AreaPrecision: Precision addresses the question of how many predicted instances are truly pos-

itive Essentially, it measures the accuracy of positive predictions, particularly focusing on the

positive group (in this case, the "BAD" records) A higher precision indicates better

perfor-mance in classifying the positive group, which in turn reflects the model’s ability to accurately

identify instances belonging to the "BAD" category 9].

TP

recision Fotal Predicted Positive (3)

Recall: Recall assesses the ability to correctly identify positive instances among all samplesthat actually belong to the positive group It quantifies the proportion of true positives that the

model correctly identifies E9] The formula for recall 1s as follows:

TP

Recall = Total Actual Positive 4)

Confusion Mask ( best case):

TP (True Positive) is the total number of detections with IoU greater than or equal to 0.5; FP

(False Positive) is the total number of detections with IoU less than 0.5; and FN (False Negative)

is the total number of undetected objects in the test set Precision is a performance measure thatevaluates the ability to accurately identify true positives (TP) among all positive predictions The

Average Precision (AP) is the average accuracy of the model, while the mean Average Precision

(mAP) is the average of AP across all detected classes K represents the number of categories

In this paper, mAP50 and mAP50-95 are used to evaluate the performance of different models

mAPSO refers to the average precision for all classes at an IoU of 0.5, while mAP50-95 refers

to the average precision for all classes within a range of IoU from 0.5 to 0.95, with a step size

of 0.05 [Z9].

The TP, FP, TN, and FN indicators have the following meanings:

TP (True Positive): Total number of positive pattern-matching prediction cases

TN (True Negative): Total number of negative pattern-matching prediction cases

FP (False Positive): Total number of cases that predict observations belonging to the negative

label to be positive

FN (False Negative): Total number of cases that predict observations belonging to the

posi-tive label to be negaposi-tive [29].

Tiêu đề	Building a Student Monitoring System in the Classroom Based on Computer Vision
Tác giả	Pham Anh Phuong
Người hướng dẫn	Kim Dinh Thai, Ha Manh Hung
Trường học	Vietnam National University, Hanoi
Chuyên ngành	Applied Information Security
Thể loại	Research Report
Năm xuất bản	2024
Thành phố	Hanoi

Định dạng
Số trang	50
Dung lượng	27,22 MB