Authentication via deep learning facial recognition with and without mask and timekeeping implementation at working spaces

Trang 1

LÊ ĐỨC HUY

AUTHENTICATION VIA DEEP LEARNING FACIAL RECOGNITION WITH AND WITHOUT MASK AND TIMEKEEPING IMPLEMENTATION AT WORKING

SPACES

Major: Computer Science Major code: 8480101

MASTER’S THESIS

Trang 2

LÊ ĐỨC HUY

AUTHENTICATION VIA DEEP LEARNING FACIAL RECOGNITION WITH AND WITHOUT MASK AND TIMEKEEPING IMPLEMENTATION AT WORKING

SPACES

Major: Computer Science Major code: 8480101

MASTER’S THESIS

Trang 3

Supervisors:

Assoc Prof Quản Thành Thơ Dr Nguyễn Tiến Thịnh

Examiner 1: Assoc Prof Đỗ Văn Nhơn

Examiner 2: Assoc Prof Bùi Hoài Thắng

This master’s thesis is defended at HCM City University of Technology, VNU- HCM City on July 10th, 2023

Master’s Thesis Committee:

1 Chairman: Assoc Prof Phạm Trần Vũ 2 Secretary: Dr Nguyễn Lê Duy Lai 3 Examiner 1: Assoc Prof Đỗ Văn Nhơn 4 Examiner 2: Assoc Prof Bùi Hoài Thắng 5 Commissioner: Dr Mai Hoàng Bảo Ân

Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty of Computer Science and Engineering after the thesis being corrected (If any)

Trang 4

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness

THE TASK SHEET OF MASTER’S THESIS

Full name: Lê Đức Huy Student ID: 2170306

Date of birth: 22/12/1996 Place of birth: Ho Chi Minh City Major: Computer Science Major ID: 8480101

I THESIS TITLE (In Vietnamese): HỆ THỐNG NHẬN DIỆN KHN MẶT CĨ VÀ KHƠNG CĨ KHẨU TRANG DỰA TRÊN NỀN TẢNG HỌC SÂU TRONG THỊ GIÁC MÁY TÍNH VÀ ỨNG DỤNG TRONG VIỆC CHẤM CÔNG TẠI CÁC DOANH NGHIỆP HIỆN NAY

II THESIS TITLE (In English): AUTHENTICATION VIA DEEP LEARNING FACIAL RECOGNITION WITH AND WITHOUT MASK AND TIMEKEEPING IMPLEMENTATION AT WORKING SPACES III TASKS AND CONTENTS:

- Conduct research on modern machine learning and deep learning architectures

- Conduct research on applied techniques involved in biometric authentication in all walks of life

- Establish a face recognition model to turn theory into reality

- Discover datasets in search of appropriateness to train the model

- Conduct an end-to-end model training

- Evaluate the model based on evaluation metrics such as Loss and Precision

- Conceptualize the idea, draw the flow diagram and design the time-keeping application using the face recognition method

- Build up the application and put it under quality assurance

IV THESIS START DATE: 06/02/2023

Trang 5

VI SUPERVISORS: Assoc Prof Quản Thành Thơ, Dr Nguyễn Tiến Thịnh

Ho Chi Minh City, date

SUPERVISOR 1 SUPERVISOR 2

(Full name and signature) (Full name and signature)

CHAIR OF PROGRAM COMMITTEE

(Full name and signature)

Quản Thành Thơ Nguyễn Tiến Thịnh

DEAN OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING

Trang 6

ACKNOWLEDGEMENTS

For this honor chance, I am truly elated to express my deepest gratitude to my supervisor, Assoc Prof Quan Thanh Tho and Dr Nguyen Tien Thinh for all the advice and elucidation on the road to reach the final target in this Master of Computer Science program His professions and characteristics paved the way and encouraged me to fulfill the needs of research and complete this thesis writing

Trang 7

ABSTRACT

Face recognition has so far played a crucial role in authentication perspective where it is taken as the most secure and effective way of biometry However, masked faces post the novel of Covid-19 brought a huge challenge to the existing techniques in which part of the face is covered and occlusion ever since becomes a heated topic of research once again In regard to the motivation behind this thesis to contribute to the security matter in the face recognition industry, I have been doing the research following the issue path left uncovered by many state-of-the-art machine learning models and coming up with these proposals to alleviate and somewhat enhance the advanced procedure for face verification under masks considering these specific elements:

• The first ever use of the Siamese Neural Network (SNN) in human face recognition still wearing masks instead of reusing pre-trained state-of-the-art models The datasets for training and testing are well collected with the datasets MLFW (Masked Labeled Faces in the Wild) to produce the final output of SNN that fulfills the expectation towards the high accuracy in the first place

• The advantage and effectiveness of using ensemble learning to separate the tasks of training the models upon different purposes: face with mask and without mask It also lights up the capabilities of ruling out security breaches of a single model of mixed datasets

Trang 8

TÓM TẮT LUẬN VĂN

Nhận diện khn mặt cho đến nay đã đóng một vai trị vơ cùng quan trọng trong việc xác thực và được xem là phương thức bảo mật sinh trắc học an toàn và hiệu quả nhất Tuy nhiên, việc nhận diện những khuôn mặt đeo khẩu trang sau đại dịch Covid-19 đã mang đến một thách thức to lớn đối với các kỹ thuật hiện có và trở thành một đề tài được giới chuyên môn hết sức quan tâm Nhằm gây dựng sự đóng góp cho vấn đề bảo mật trong bài tốn nhận diện khn mặt, học viên đã thực hiện nghiên cứu theo định hướng giải quyết vấn đề cịn tồn đọng của các mơ hình học máy tiên tiến nhất hiện nay và đưa ra giải pháp để đáp ứng việc xác minh khuôn mặt kể cả khi đeo khẩu trang dựa trên các yếu tố cụ thể như sau:

• Lần đầu áp dụng Siamese Neural Network (SNN) cho bài toán nhận diện khn mặt có đeo khẩu trang thay vì sử dụng lại các mơ hình hiện đại đã được huấn luyện từ trước Các tập dữ liệu huấn luyện và kiểm tra được thu thập dựa trên tập MLFW (Masked Labeled Faces in the Wild) để mơ hình SNN có thể cho ra kết quả xác thực với độ chính xác đáp ứng được kỳ vọng đã đặt ra ngay từ khi bắt đầu triển khai

• Tận dụng ưu điểm và tính hiệu quả của việc áp dụng ensemble learning để phân chia nhiệm vụ huấn luyện cho các mơ hình phục vụ nhu cầu các bài tốn khác nhau: nhận diện được khn mặt khi có đeo hoặc không đeo khẩu trang riêng biệt Điều này cũng giúp cho tính bảo mật được đảm bảo so với việc huấn luyện một mô hình chung cho tác vụ nhận diện khn mặt kể cả khi có đeo và khơng đeo khẩu trang

Trang 9

THE COMMITMENT OF THE THESIS’S AUTHOR

I hereby confirm that this thesis and the work presented in it is entirely my own Where I have consulted the work of others is always clearly stated All statements taken literally from other writings or referred to by analogy are marked and the source is always given This paper has not yet been submitted to another examination office, either in the same or similar form

I agree that the present work may be verified with anti-plagiarism software

THE THESIS’S AUTHOR

Trang 10

TABLE OF CONTENTS

CHAPTER 1: INTRODUCTION 1

1.1 Introduction 1

1.2 Problem statement 1

1.3 Objectives and missions .3

1.4 Scope of the thesis .4

1.5 Thesis contributions 5

1.6 Thesis structure 5

CHAPTER 2: BACKGROUND KNOWLEDGE 7

2.1 Face recognition 7

2.2 Convolutional Neural Network (CNN) .8

2.2.1 Convolution 11

2.2.2 Pooling 13

2.2.3 Cross-Entropy Loss 14

2.3 Siamese Neural Network (SNN) 15

2.3.1 Overall of Siamese Neural Network 15

2.3.2 Loss function of SNN 16

2.3.3 Discussion on SNN 17

2.3.4 Evaluation metrics 18

2.4 Ensemble learning 20

CHAPTER 3: RELATED WORKS 29

3.1 Global feature support .31

3.1.1 Appearance based 31

3.1.2 Model based 31

Trang 11

3.2.1 Learning based 32

3.2.2 Hand-crafted based 32

3.3 One-shot learning .33

3.4 Discussion 34

CHAPTER 4: THE PROPOSED MODEL AND IMPLEMENTATION 36

4.1 Reference model .36

4.2 Datasets and pre-process 37

4.2.1 Labeled Faces in the Wild (LFW) datasets 37

4.2.2 Masked Labeled Faces in the Wild (MLFW) datasets 38

4.3 Application architecture 40

4.3.1 New hire model training 40

4.3.2 Verification process 41

4.3.3 Admin portal 42

4.3.4 Database management server (DBMS) 43

4.3.5 Streamlit User Interface 44

4.3.6 Flask 45

4.4 Proposed Model 45

4.4.1 Motivation and idea 45

4.4.2 Parameters configuration 47

4.4.3 Experimental results and Discussion 48

4.5 Ablation study 52

4.6 Time-keeping application 53

4.7 Multiple models for recognizing employees 55

CHAPTER 5: CONCLUSION 56

Trang 12

TABLE OF FIGURES

Figure 1.1: The face recognition and time-keeping application pipeline architecture2

Figure 2.1: The flowchart of Face Recognition 8

Figure 2.2: Human brain processes the image and recognizes 9

Figure 2.3: The process of extracting hidden attributes from of the face 10

Figure 2.4: The sample calculation of convolution 11

Figure 2.5: The sample convolutional neural network for image classification 12

Figure 2.6: A depiction of shared weights in convolutional neural network 13

Figure 2.7: A sample calculation of max pooling 14

Figure 2.8: The sample Siamese Neural Network for face recognition 16

Figure 2.9: A confusion matrix and its actual denotation 19

Figure 2.10: Ensemble learning 21

Figure 2.11: Bagging concept 22

Figure 2.12: Boosting concept 25

Figure 2.13: Stacking concept 27

Figure 3.1: The technique taxonomy for Face Recognition 30

Figure 4.1: The complete reference model for Face Recognition 37

Figure 4.2: Labeled Faces in the Wild datasets 38

Figure 4.3: MLFW is constructed by adding mask to the images in LFW with perturbation for achieving diverse generation effect 40

Figure 4.4: Training model process 40

Figure 4.5: Verification process 41

Figure 4.6: The admin portal for time-keeping boards and visualizations 42

Figure 4.7: Timesheet in the application 54

Trang 13

TABLE OF TABLES

Table 1.1: The output use cases of the face recognition model 3

Table 2.1: The development of Bagging concept .23

Table 2.2: The development of Boosting concept 25

Table 3.1: The summary of recent works relating to one-shot learning 33

Table 4.1: The originally given parameters 47

Table 4.2: Parameters configuration 48

Table 4.3: Overview of training, validation and testing image set 49

Table 4.4: Summary of performance outcome on different face recognition baselines “#Models” is the number of models used in the method for evaluation 50

Table 4.5: Comparison of model-training and model-testing time in seconds of each epoch for different face recognition models .51

Trang 14

CHAPTER 1: INTRODUCTION

1.1 Introduction

After Covid-19 pandemic, the biggest hit in daily life for over three years now, the world is gradually healing but still the virus is a vicious threat and no one can be able to predict whenever a new mutant suddenly appears With the challenge being said, business enterprises are now eager to make ways for adapting post-Covid 19 social distancing to some extent, ranging from wearing masks to contactless authentication methods in public places1 In the midst of Covid-19 resurgence, we also faced the hindrance of timekeeping handled manually by online spreadsheets and it caused a huge delay in terms of regular reports2 These two add up to the existing difficulties that urged scientists to deep dive into the world of Artificial Intelligence (AI) and Machine Learning to mitigate and in the positive manner, contributing to the major accomplishment of AI in all walks of life In order to tackle the issues, one would see the potential of face recognition using the canonical Siamese Neural Network [1] – a biometric authentication method being integrated with a time-tracking system but the problem occurs when a subject is wearing a mask Recent studies indicate promising results in both face mask detection and masked face recognition using the DeepMaskNet [2] This, at the first glance, greatly serves the purpose of making these obstacles fade away but based on the above specific problem, nothing can be concluded unless putting it under real experiment, practicality and questions are still open for true ability of the Siamese Neural Network in terms of recognizing the similarities between the faces with mask wearing

1.2 Problem statement

It is obvious that the input of the face recognition model is the human face either with mask or without mask that has been captured as an image (in png or jpg format) via

1 https://hbr.org/2020/09/adapt-your-business-to-the-new-reality

Trang 15

the live camera system The model will return the output consisting of 2 main components:

• The face recognition result – assert True or False and if True states the name of the person

• Potentially with or without mask result

The face recognition model is presented in details as below:

Figure 1.1: The face recognition and time-keeping application pipeline architecture

Trang 16

Take the below motivating example starting with a human face image captured from the live camera and an output for each step:

Table 1.1: The output use cases of the face recognition model

Input Output at step 1 Output at step 2 Output at step 3

Human face image (with mask)

Mr John McCarthy: True

(with mask)

Access Granted Timein: June 10th, 2023 07:49:05AM Human face image

(without mask)

Mr Andrew Le: True (without

mask)

Access Granted Timein: June 10th, 2023 09:16:23AM Human face image

(with/without mask)

Unable to recognize

Access Denied Timein: No record found

1.3 Objectives and missions

This thesis opted for the comprehension of face recognition model based on machine learning and deep learning with details as follows:

• Understand the fundamental principles of machine learning and deep learning • Identify the problems with face recognition (especially with masks) and the

ways to get it resolved based on recent face recognition articles

• Analyze all the procedures, assess the feasibility of each solution and draw a conclusion on the pros and cons of the proposed solution

• Research on the popular and appropriate human face datasets (with and without mask) and collect in advance for later usage

• Put the face recognition model in real test, understand and suggest an enhancement for better accuracy and performance

Trang 17

opportunities for future research and potentially deploy the product for mass usage

Tasks involved in the thesis:

o Collect all selective papers of face recognition in recent years

o Research on the past and current obstacles or unsolved cases for face recognition with masks

o Conduct the experiment with some face recognition methods, especially for those with masks on and propose the most appropriate one for both face mask and without mask recognition based on the feasibility of timeline and scope o Identify the output and the expectation of the model to support a collection of

relevant datasets

o Establish the model using existing framework, libraries, tools and put it down to testing and validating the result

o Provide the final conclusion and further steps need to be taken in the long run

1.4 Scope of the thesis

There should be a long line of face recognition researches and applications hence the scope of the thesis has been inquired as below:

• Build up the face recognition model using the Convolutional Neural Network for recognizing the human face with and without mask (stating the name of the user)

• The used datasets have to be popular for evaluation and include a variety of facial components

• Machine learning technology: representation using Euclidean distance, evaluation using Precision, Recall, F1, optimizing using Adam, Stochastic Gradient Descent and relevant parameters configuration

Trang 18

1.5 Thesis contributions

In this thesis, the author proposes a solution with the machine learning model so that: • It can be applied for face recognition with both masked and unmasked human

face datasets

• The model acts as a baseline to improve security for face recognition with ensemble learning

• The quality of the model behind a ubiquitous Python web application is proved to be advanced and fulfill the needs of enterprises

1.6 Thesis structure

The structure of the thesis consists of 5 chapters:

- Chapter 1: Introduction The first chapter will be dedicated for the overview

of face recognition and its current implementation across different sectors of information technology Coming up in the important part of this chapter is to reinstate the issue with the time-keeping at firms nowadays, pushing developers to take part in multiple steps to come up with the use of advanced, newly introduced neural networks and deep learning methods Also, the plan, orientation, target and milestone are indispensable segments and should be briefly established in this chapter

- Chapter 2: Background knowledge This is where the author will put all

related knowledge about neural networks, deep learning methods as well as Python web-based development, database management system involvement in order to fully support the project

- Chapter 3: Related works Other methods and models will also be taken into

Trang 19

- Chapter 4: The proposed model and implementation Propose the most

appropriate models and methods with a reason stated in this chapter, leaving room for any breakthrough findings and improvements and the use of

Datasets will depend on how the model is established and trained but a large

number of records and characteristics inside the datasets should be required to verify the accuracy of the model A detailed instruction of how to form and

use the datasets needs to be introduced in this chapter Implementation: This

one should come as the most important part of the project so a deep dive into the existing model would equip the author with a mechanism and thorough understanding towards the goal of the project Coding and compiling will be

captured on how to build the model in this part Experimental results and Discussion: This next section will indicate the results of the model applied

and make a comparison between this one and the other relevant models Loss functions and optimization should be well-defined and stated, opening up opportunities for further research

- Chapter 5: Conclusion The author should do the retro for all phases of the

Trang 20

CHAPTER 2: BACKGROUND KNOWLEDGE

This thesis is anchored in two main theories, one of which is the machine learning inspiring the model to be built upon and the other one vitalizes software development, specializing in web applications Based on these fundamentals, the knowledge basis of the thesis will be expressed following that trail, resulting in the definition of face recognition in general, the Convolutional Neural Network to extract latent features, the recently discovered Siamese Neural Network [4] to tackle the problem of limited time, resources and variations during face recognition and finally the loss function for mathematical optimization and decision

2.1 Face recognition

A facial recognition system is a technology capable of matching a human face from a digital image or a video frame against a database of faces Such a system is typically employed to authenticate users through ID verification services, and works by pinpointing and measuring facial features from a given image [5]

Facial recognition (Figure 2.1) [6] identifies a human face in a kind of a

two-dimensional image Firstly, the face has to be separated from a noisy background Then the face will be cropped into a desired size and posed in grayscale This step is useful for local support feature since it enables the accurate localization of landmarks before entering the final step of feature extraction, where multiple neural networks and filters get involved in representing a complete face to perform a comparison to the existing data in the database

Trang 21

a new type of machine learning methods called deep-learning architectures (which will be elaborately described in the next section)

Figure 2.1: The flowchart of Face Recognition

2.2 Convolutional Neural Network (CNN)

Trang 22

analysis, natural language processing, brain–computer interfaces, and financial time series

Thanks to the nature of CNN for latent feature extraction, CNN has been widely used for image classification problems Take the below one as an instance on how a human being is capable of using the neurons in the brain to think, remember and process the image and make a comparison with an artificial neural network

Figure 2.2: Human brain processes the image and recognizes

Figure 2.23 illustrates the way a machine can be able to reproduce the process native to the human brain to identify the correct face among the others The input parameters consist of multiple components on the face such as eyes, nose, mouth, etc However, these would not be playing the most important role on deciding the truthy of the face but the hidden attributes ranging from sinus to jaw or hairstyles will be the one in charge This takes a step further to enhance the ability of machines onwards since

Trang 23

human beings over time may suffer from memory loss or keeping their focus only on the main characteristics of the face Therefore, these input parameters as stated above combined with hidden attributes extracted from the machine learning model would make a strong step up towards accuracy, authority and security, thus encouraging the effectiveness of face recognition in this thesis

Figure 2.3: The process of extracting hidden attributes from of the face

Figure 2.3 shows the basic learning process of an artificial neural network4 From the input values, the neural network performs processing operations to extract the latent features of the data These hidden attributes can be taken to become the next input parameters in subsequent layers until the final result shows up There will also

be a shared weight available (described in details in the section 2.2.1) for the

Convolutional Neural Network in which reducing the training time for the model instead of involving a huge number of parameters, thus bringing a huge advantage over fully-connected layers

Trang 24

In a normal artificial neural network, a hidden layer is formed by different neural nodes in series and through information processing will create neural nodes in the next layer This process comes with the aid of fully-connected layers where each node of the current layer connects with each node of the next layer However, the problem occurs when the number of input parameters grows rapidly over time, which in turn puts the burden on complexity and performance For instance, when inputting the image of size 64x64x3 to the network, this requires all pixels to be converted into nodes, meaning 64 x 64 x 3 = 12288 is the number of nodes for the input at the moment In addition, by multiplying the number of nodes to the weight (take 1000 for the weight as an example) for the first hidden layer, it brings the figure for nodes well surpasses 12 million That is such a huge number considering the image as stating is captured in a low resolution compared to the standard nowadays, let alone this one only has a single hidden layer With that being said, specific mathematical calculations need to get involved in mitigating the cumbersome Outstandings are the works of convolution and pooling

2.2.1 Convolution

The Convolutional Neural Network (illustrated in Figure 2.5) starts with an idea of

performing convolution calculation as depicted in the figure below:

Figure 2.4: The sample calculation of convolution

Trang 25

𝑆𝑖𝑗 = (𝐼 ∗ 𝐾)𝑖𝑗 = ∑

𝑚

∑

𝑛

𝐼(𝑚, 𝑛)𝐾(𝑖 − 𝑚, 𝑗 − 𝑛) (2.1)

where 𝑖, 𝑗 address the position of the result element, 𝑚 and 𝑛 is the size of the input matrix

Figure 2.5: The sample Convolutional Neural Network for image classification

One of the most recognizable features of the Convolutional Neural Network is the use of shared weights Shared weights is basically defined as the same weight is used for each kernel and neurons in the first hidden layer will precisely discover the similarities between different regions or in other words, the latent features of the input

(as shown in Figure 2.6) This action is handled with the purpose of reducing the

number of input parameters while encouraging the finding of different main features

of the image For instance, a matrix image with size 7x7 in Figure 2.4 with 2 kernels

Trang 26

Figure 2.6: A depiction of shared weights in Convolutional Neural Network

2.2.2 Pooling

As the convolutional layer outputs the matrix, the network would need to reduce the dimensions as long as noise is concerned then the pooling was introduced to perform the matter There are several poolings such as average, max or sum but the max pooling has been proved to be efficient in terms of noise reduction Take the

following example (Figure 2.7) as steps on how to reproduce the max pooling by

Trang 27

Figure 2.7: A sample calculation of max pooling

2.2.3 Cross-Entropy Loss

The Convolutional Neural Network outputs the classification given the input then the loss function is extremely important in assessing the closeness between an observation (𝑦, which is 0 or 1) and the outcome of the network itself (𝑦̂ = 𝜎(𝐰 ⋅𝐱 + 𝑏)), the denoted ℒ is as follows:

ℒ(𝑦̂, 𝑦) = How much 𝑦̂ differs from the true 𝑦 (2.2)

The most commonly used loss function that can be taken into consideration in classification problem is the Least Squared Error (LSE) [8] with the equation as below:

ℒ(𝑦̂, 𝑦) = 1

2(𝑦̂ − 𝑦)

2 (2.3)

Trang 28

𝑝(𝑦 ∣ 𝑥) = 𝑦̂𝑦(1 − 𝑦̂)1−𝑦 (2.4)

Taking a log for both sides of equation (4) would bring the log of the probability:

log 𝑝(𝑦 ∣ 𝑥) = log [𝑦̂𝑦(1 − 𝑦̂)1−𝑦]

= 𝑦log 𝑦̂ + (1 − 𝑦)log (1 − 𝑦̂) (2.5)

Flipping the sign on this log likelihood would give us the formula called the Binary Cross-Entropy loss which should be minimized to do the back propagation with parameters 𝑤, 𝑏 and receive a better outcome:

𝐿CE(𝑦̂, 𝑦) = − log 𝑝( 𝑦 ∣∣ 𝑥 ) = −[𝑦 log 𝑦̂ + (1 − 𝑦) log(1 − 𝑦̂)] (2.6)

2.3 Siamese Neural Network (SNN)

2.3.1 Overall of Siamese Neural Network

Siamese Neural Network (SNN) [4] is a neural network architecture containing two or more identical sub-networks (networks with the same configuration as well as parameters and weights) and any update of parameters if happens will be immediately reflected to its sub-networks

SNN is mostly used to find the similarity of input data by comparing their feature vectors Some popular applications of SNN are worth-mentioning: face recognition, signature verification, anti-spoofing, etc

Trang 29

Reproduce steps of SNN (as shown in Figure 2.8):

• Select a pair of images (or anything that needs to be classified) from the dataset

• Bring each image through each sub-network of the SNN for processing The output of the sub-networks is an embedding vector

• Calculate the Euclidean distance between those two Embedding vectors • A Sigmoid function can be applied over the distance to give a score within a

range of [0,1], representing the similarity between the two Embedding vectors The closer the score is to 1, the more similar the 2 vectors are and vice versa

Figure 2.8: The sample Siamese Neural Network for face recognition

2.3.2 Loss function of SNN

Trang 30

• Triplet Loss function

The idea of Triplet Loss is to use a set of three input data including: Anchor (𝐴),

Positive (𝑃) and Negative (𝑁) where the distance from 𝐴 to 𝑃 is minimized while the distance from 𝐴 to 𝑁 is maximized during training

ℒ(𝐴, 𝑃, 𝑁) = 𝑚𝑎𝑥(∥ f(𝐴) − f(𝑃) ∥2 −∥ f(𝐴) − f(𝑁) ∥2+ 𝛼, 0) (2.7)

where 𝐴 is an anchor input, 𝑃 is a positive input of the same class as 𝐴, 𝑁 is a negative input of a different class from 𝐴, 𝛼 is a margin between positive and negative pairs, and f is an embedding vector

• Contrastive Loss function

The idea of Contrastive Loss is similar to Triplet Loss but it differs in terms of an

only pair of input data either of the same type or of different types If they are of the same type, the distance between their feature vectors will be minimized, and if different types, the distance between their feature vectors will be maximized during training [11]

(1 − 𝑌)1

2(𝐷𝑤)2+ (𝑌)1

2𝑚𝑎𝑥(0, 𝑚 − 𝐷𝑤)

2 (2.8)

where 𝐷𝑤 is the Euclidean distance between input data and 𝑚 is the margin

2.3.3 Discussion on SNN

Trang 31

• Advantages:

 The amount of data required to train the SNN is relatively small compared to other neural networks and the technique behind that miracle would be One-Shot Learning [12] or Few-Shot Learning

 Data-imbalanced issues are hardly a concern

 SNN has a strong sense of uniquity, hence a hybrid model will be possible when collating SNN with other classification models

 Learn from semantic similarity since SNN focuses on learning features in deeper layers, where similar features are placed close to each other

• Disadvantages:

 Training takes longer since it marks the similarity for each pair  The probability for each distribution is unknown

So far the only concern for the SNN would be performance but as the project goes onwards, once the accuracy has been guaranteed (metrics to evaluate the model will be described in the section below), optimization should be considered for an optimum solution

2.3.4 Evaluation metrics

To evaluate the Siamese Neural Network (SNN), which is comparing the similarity between each pair of input data, ones would either use:

• Extrinsic evaluation [13]: Compare the output of SNN model with the other

models in terms of the accuracy e.g matched/not matched face

• Intrinsic evaluation: Measure and assess the outcome of SNN model in each

training epoch via validation checkpoints and verification threshold (the threshold is a decisive factor)

However, within these types of evaluation, it is still ambiguous to select whichever method applicable for this model or other models This is the main reason why

Trang 32

Precision (also called positive predictive value) is the fraction of relevant instances

among the retrieved instances, while Recall is the fraction of relevant instances that

were retrieved [14] Both precision and recall are therefore based on relevance For

classification tasks, the terms true positives (𝑇𝑃), true negatives (𝑇𝑁), false positives

(𝐹𝑃), and false negatives (𝐹𝑁) compare the results of the classifier with trusted

external judgments (Figure 2.9 shows the confusion matrix of statistical

classification problem) The terms positive and negative refer to the classifier’s prediction (sometimes known as the expectation), and the terms true and false refer

to whether that prediction corresponds to the external judgment (sometimes known as the actual observation) The formulas for Precision, Recall and Accuracy (denoted from the Precision and Recall) are shown as:

Precision = 𝑇𝑃𝑇𝑃 + 𝐹𝑃 (2.9) Recall = 𝑇𝑃𝑇𝑃 + 𝐹𝑁 (2.10) Accuracy = 𝑇𝑃 + 𝑇𝑁𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 (2.11)

Trang 33

2.4 Ensemble learning

As machine learning has been widely used in terms of solving multiple problems such as image classification, face recognition, natural language processing, etc, the development of advanced, sophisticated models has to be in an upward trend to keep up with the demand Generally speaking, the goal of each model is to perform at its best to produce the output given the datasets However, soon enough there will be a question raised against the practicality since any individual model has its own pros and cons and we may find it time-consuming to sort out and discover the best fit model to a problem Under this circumstance, one attractive approach looks appealing called ensemble learning comes up It stands the combination of various models to make the final output based on the referendum of each model

To prove the effectiveness of ensemble learning in solving different matters, de

Condorcet [15] expressed a mechanism first used in politics “if the probability of each

voter being correct is above 0.5 and the voters are independent, then addition of more voters increases the probability of majority vote being correct until it approaches 1”

Although it was not native to machine learning, this idea brings to the table a very intuitive way to overcome the issue of looking for a specific model when dealing with a complicated problem The reasons for the success of ensemble learning include: statistical, computational and representation learning [16], bias-variance decomposition [17] and strength-correlation [18]

Trang 34

by Comprehensive review of the ensemble methods and the challenges were given in [22].

Figure 2.10: Ensemble learning

Figure 2.10 illustrates the basic principle of ensemble learning where multiple

models (1, 2 to n) get involved in the process of a single input in order to capture all aspects of datasets and produce the most relevant output that suits the problem needs In the past few years, plenty of work has been conducted against the practicality of ensemble learning and as a result, this should be precisely categorized in three major

methods: bagging, boosting and stacking

• Bagging

Among these three strategies, bagging – also called bootstrap aggregating is the fundamental approach to improve the performance of the classification problem Starting from a single input of data, bagging tries to create sample datasets (representing bags that can be replaced if necessary) and puts it through each

corresponding model in preparation for the final amalgamation of predictions (Figure 2.11) This ensemble prediction generates a better result than a single prediction on

Trang 35

Figure 2.11: Bagging concept

Trang 36

Table 2.1: The development of Bagging concept

Reference Contribution

[23] The idea of Bagging proposed

[30] Case study of bagging, boosting and

basic ensembles

[31] Theoretical analysis of bagging

[18] Bagging with random subspace

Decision trees and ensembling outputs via majority voting

[32] Study of Bayesian regularization, early

stopping and Bagging

[24] Bagging with SVM’s and ensembling

outputs via SVM’s, majority voting and least squares estimation

[33] Theoretical justification of Bagging,

proposed subbagging and half subagging

[25] Bagging with decision trees and

ensembling outputs via Kaplan–Meier curve

[27] Theoretical and experimental analysis of

online bagging and boosting

[26] Proposed asymmetric bagging with

SVM’s and ensembling outputs SVM’s

[34] Roughly balanced bagging on decision

trees and ensembling outputs via majority voting

[29] Bagging with neural networks and

Trang 37

[35] Neighbourhood balanced bagging ensembling outputs via majority voting

Despite the sophistication of ensemble learning and prediction, the bagging method allows parallel processes in computation within a limitation of hardware, thus saving time and cost while maintaining or even improving the precision and quality of ensembling output

• Boosting

The second method, as its name states the mission, was established to enhance the learning model in a positive way It attempts to build a vigorous classifier with the presence of originally weak classifiers This technique has been done in serial standpoint where the latter model continues learning from the past errors of the

previous adjacent model (Figure 2.12) This process should be repeated until the

Trang 38

Figure 2.12: Boosting concept

Table 2.2: The development of Boosting concept

ReferenceContribution

[38] Boosted deep belief network (DBN) as

base classifiers for facial expression recognition

[39] Decision trees as base classifiers for

binary class classification problems

[40] Decision trees as base classifiers for

multiclass classification problems

[41] Ensemble of CNN and boosted forest for

edge detection, object proposal generation, pedestrian and face detection

Trang 39

[43] CNN Boosting applied to bacteria cell images and crowd counting

[44] Boosted deep independent embedding

model for online scenarios

[45] Transfer learning based deep

incremental boosting

[46] Boosting based CNN with incremental

approach for facial action unit recognition

[47] Deep boosting for image denoising with

dense connections

[47] Deep boosting for image restoration and

image denoising

[48] Hierarchical boosted deep metric

learning with hierarchical label embedding

[49] Snapshot boosting

• Stacking

Stacking in ensemble learning is an integration technique where the outputs of

baseline models are piled up to find the best output afterwards (Figure 2.13) This

Trang 40

network to gain significant improvement in performance without many training datasets.

Figure 2.13: Stacking concept

• Unweighted Model Averaging

Ensemble learning uses the approach of voters to get the final decision hence the intuitive way to wrap up the result is taking the average of predictions This in fact is the most popular way to address the concern of bias, variance, etc and generalize the performance in place of the reduction of unbalance between models The average of predictions is taken via softmax or directly based on the probabilities or output respectively:

𝑃𝑖𝑗 = softmax𝑗 (𝑂𝑖) = 𝑂𝑖

𝑗

∑𝐾

Định dạng
Số trang	80
Dung lượng	1,36 MB