7.1 Dataset
7.1.1 General Information
The image dataset was self-collected and processed by the research group consisting of mem- bers from the International School - Vietnam National University, Hanoi. The dataset comprises about 3000 images of 5 members, with an average of about 400 different color images per per- son. Additionally, the images of each individual are stored in their respective labeled directories using their student ID.
To ensure high-quality image dataset, the conditions for collecting the images include: (1) captured from smartphone cameras with a resolution of 750 x 1334 pixels or higher to limit blur, (2) ensuring each image contains only one face, (3) the subject is positioned in the center of the image with focus on the center of the face and proper lighting, (4) the distance from the subject to the camera is maintained at 0.5-1m to ensure clear facial images.
7.1.2 Data Collection
The process of collecting student image data includes the following steps:
¢ Step 1 - Identifying the sampling subject: members of the research group from the Inter- national School - Vietnam National University, Hanoi.
¢ Step 2 - Establishing a list of facial expression states with 13 different expressions: nor- mal, smiling, frowning, wearing glasses, wearing a mask, left eye wink, right eye wink, both eyes closed, turn left, turn right, tilt head up, tilt head down, and various other angled
‘ 5
Figure 13: Example of attendance’s data
32
¢ Step 3 - Capturing facial expression samples by taking approximately 30 images for each expression. Backgrounds can be chosen freely, and the images need to be captured in both strong and weak lighting, with focus on the center of the face. Additionally, recording information about the subject is required: student ID, full name, date of birth, class, and phone number.
¢ Step 4 - Storing the image data and the collected student information.
71.3 Data Cleaning and Standardization
The collected data consists of 2118 images of the 5 group members. These are raw, unprocessed data. To effectively use the data, the process of cleaning and standardizing the data is described as follows:
¢ Step 1 - Data classification and filtering: Classifying images by each student, storing them in separate directories, and then error checking and cleaning the data by removing
noisy and blurry images.
* Step 2 - Labeling each image directory with the student ID as in Figure [I4]
raw
22070018 22070154
22070156 22070167 22070277
Figure 14: Student ID
¢ Step 3 - Extracting faces from the original image data and formatting the new image data.
We have developed an automated program for face detection and extraction from images.
It can align the faces, resize the images, and store the cropped face images in a newly created image data directory (the names of the processed image data directories are the
same as the original image data directories) as in FigurdI5}
All images are cropped to the face and converted to *.png format, with a size of 160x160
pixels in Figure[T6]
33
Y @# processed
>ằ #@ 22070018
>ằ #@ 22070154
>ằ @đ 22070156
>ằ @đ 22070167
>ằ @đ 22070277
ằ ấW bounding_boxes
Figure 15: Processed Image data
7.2 MTCNN Model
First, we need to understand the CNN algorithm. CNN (Convolutional Neural Network) is a type of deep learning neural network commonly applied in the field of computer vision to process tasks such as image classification and face detection. Similar to CNN, MTCNN (Multi-task Cascaded Convolutional Networks) is a variation of CNN improved to perform multiple tasks simultaneously to solve two main tasks: face detection, determination of facial landmark points, and determination of the facial region. Developed in 2016, MTCNN has become a popular and
effective tool for face recognition ỉI.
7.2.1 StructureofMTCNN
MTCNN consists of three stacked CNN (Convolutional Neural Networks), called P-Net (Pro-
posal Network), R-Net (Refine Network), and O-Net (Output Network) EII. Each network 1s used to perform a specific task and collaborates to achieve the final result (BT).
- P-Net (Proposal Network)
34
P-Net is the first stage in the face detection process. It uses a small CNN to scan the entire image and generate candidates for face positions. The goal of P-Net is not only to determine potential face positions but also to generate proposals of different sizes to encompass various
possibilities in Figure [17] [31].
Conv: 3x3 Cony: 3x3 Conv: 3x3 tức: |
MP: 3x3 classification
| -| IxIx2 |
e bounding box |
, regression |
|
| input SỈZ€ sxsx|0 3x3x16 IxIx32\ Facial landmarkxIx4
| 12x 12x3 localization
— — eee _ _ _.
P-Net (from MTCNN paper)
Figure 17: P-Net
- R-Net (Refine Network)
After P-Net generates proposals, R-Net steps in to refine them. R-Net will eliminate inac- curate proposals, reduce the number of candidates, and improve the accuracy of the bounding boxes. R-Net also evaluates the proposals from P-Net to remove false positives as in Figure