Luận văn tốt nghiệp Kỹ thuật máy tính: Study and Improve Few-shot Learning Techniques in Computer Vision Application

Nhiệm vụ yêu cầu về nội dung và số liệu ban đầu: • Study Deep learning, and do literature review for few-shot learning; • Propose a learning techinque for training deep models in compu

Trang 1

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE & ENGINEERING

——————– * ———————

BACHELOR THESIS

Study and Improve Few-shot Learning Techniques in Computer Vision Application

Major: Computer Engineering

Council: Computer Engineering 1 Supervisor: Dr Le Thanh Sach

Dr Nguyen Ho Man Rang Reviewer: Dr Nguyen Duc Dung

—o0o—

Student: Nguyen Duc Khoi (1752302)

HO CHI MINH CITY, 8/2021

Trang 2

ĐẠI HỌC QUỐC GIA TP.HCM CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

- Độc lập - Tự do - Hạnh phúc

TRƯỜNG ĐẠI HỌC BÁCH KHOA

KHOA:KH & KT Máy tính NHIỆM VỤ LUẬN ÁN TỐT NGHIỆP

BỘ MÔN: KHMT _ Chú ý: Sinh viên phải dán tờ này vào trang nhất của bản thuyết trình

MSSV: 1752302 Họ và Tên SV: NGUYEN DUC KHOI Ngành: Kỹ thuật Máy tính

1 Đầu đề luận án:

EN: A study on few-shot learning for computer vision applications

VN: Nghiên cứu và cải tiến kỹ thuật học với số ít mẫu được làm nhãn cho các ứng dụng trong thị giác máy tính

2 Nhiệm vụ (yêu cầu về nội dung và số liệu ban đầu):

• Study Deep learning, and do literature review for few-shot learning;

• Propose a learning techinque for training deep models (in computer vision) with popuplar datasets on the Internet;

• Apply few-learning to an application in computer vision, from training, tuning, to deploying the trained model on embeded systems supported by NVIDIA’s technologies

3 Ngày giao nhiệm vụ luận án: 01/ 01 /2021

4 Ngày hoàn thành nhiệm vụ: 01/ 08 /2021

1) Lê Thành Sách Đồng hướng dẫn 2) Nguyễn Hồ Mẫn Rạng Đồng hướng dẫn

Nội dung và yêu cầu LVTN đã được thông qua Bộ môn

Ngày tháng năm 2021

(Ký và ghi rõ họ tên) (Ký và ghi rõ họ tên)

Lê Thành Sách

PHẦN DÀNH CHO KHOA, BỘ MÔN:

Người duyệt (chấm sơ bộ):

Đơn vị: _

Ngày bảo vệ: _

Điểm tổng kết: _

Trang 3

TRƯỜNG ĐẠI HỌC BÁCH KHOA CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

EN: A study on few-shot learning for computer vision applications

VN: Nghiên cứu và cải tiến kỹ thuật học với số ít mẫu được làm nhãn cho các ứng dụng trong thị giác máy tính

3 Họ tên người hướng dẫn: TS Lê Thành Sách

4 Tổng quát về bản thuyết minh:

- Số bản vẽ vẽ tay Số bản vẽ trên máy tính:

6 Những ưu điểm chính của LVTN:

• The author masters different techniques required for designing deep learning models, and for training, tunning, and deploying models to GPU cards with NVIDIA’s technologies

• The thesis consists of a science and an engineering task related to deep learning as follows: (a) Science: improve a selected few-shot learning technique for computer vision The author has proposed an idea that is based on the episodic training and the dense convolution The proposed idea has been evaluated with popular datasets reserved for the reseach field, it can gain some improvements The reseach’s results have been submitted to an international conference and wait for the reviewers’ conclusions

(b) Engineering: apply the few-shot to train a selected computer vision task and then deploy the trained model to an embeded system GPU card To this end, the author selected application “drowsiness detection” He utilized few-shot to train YOLOv5 and then deploy the trained model to NVIDIA Jetson TX2 successfully The demo application can run and detect the drowsiness live

7 Những thiếu sót chính của LVTN:

• The publication is not available at the defense’s time as designed

•

8 Đề nghị: Được bảo vệ þ Bổ sung thêm để bảo vệ o Không được bảo vệ o

9 Ba câu hỏi SV phải trả lời trước Hội đồng:

10 Đánh giá chung (bằng chữ: giỏi, khá, TB): 10 (mười)

Ký tên (ghi rõ họ tên)

Trang 4

Lê Thành Sách

Trang 5

TRƯỜNG ĐẠI HỌC BÁCH KHOA CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

-

Ngày 01 tháng 08 năm 2021

PHIẾU CHẤM BẢO VỆ LVTN

(Dành cho người hướng dẫn/phản biện)

1 Họ và tên SV: Nguyễn Đức Khôi

MSSV: 1752302 Ngành (chuyên ngành): Computer Engineering

2 Đề tài: Research and Apply Few-shot Learning Techniques in Drowsiness Detection

3 Họ tên người hướng dẫn/phản biện: Nguyễn Đức Dũng

4 Tổng quát về bản thuyết minh:

- Số bản vẽ vẽ tay Số bản vẽ trên máy tính:

6 Những ưu điểm chính của LVTN:

The thesis focus on detecting drowsiness from the human face using deep learning approaches The team proposed using ResNet block instead of normal convolutional block in the YOLOv5 network

to improve the detection accuracy The team also deploy this model to the embedded system (Jetson TX2) for realtime performance The results show some improvement in the detection accuracy

7 Những thiếu sót chính của LVTN:

The replacement of ResNet block in the network has been utilized for awhile, which makes this contribution a bit weak The drowsiness detection problem, however, can be solved better by other vision techniques, which can be very fast and realtime The choice of current approach is very bias and need to be considered in the future The few-shot learning scheme is irrelevant to the main topic

we are discussing

8 Đề nghị: Được bảo vệ o Bổ sung thêm để bảo vệ o Không được bảo vệ o

9 3 câu hỏi SV phải trả lời trước Hội đồng:

a Why don't you use other vision algorithms to detect drowsiness? Even if it will give much better performance comparing to YOLO?

b Explain why Few-shot learning matter The discussion need to be improved

c

10 Đánh giá chung (bằng chữ: giỏi, khá, TB): Giỏi Điểm: 9 /10

Ký tên (ghi rõ họ tên)

Nguyễn Đức Dũng

Trang 6

We hereby declare that this thesis titled ‘Research and Apply Few-shot niques in Computer Vision Application’ and the work presented in it are our own We confirm that:

LearningTech-• This work was done wholly or mainly while in candidature for a degree at this versity.

Uni-• Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated.

• Where we have quoted from the work of others, the source is always given With the exception of such quotations, this thesis is entirely our own work.

• We have acknowledged all main sources of help.

• Where the thesis is based on work done by ourselves jointly with others, we have made clear exactly what was done by others and what we have contributed ourselves.

Trang 7

Acknowledgments First and foremost, I am tremendously grateful for my advisers Dr Sach Le Thanh and Dr Rang Nguyen Ho Man for their continuous support and guidance throughout my project, and for providing me the freedom to work on a variety of problems Second, I take this opportunity to express gratitude to all

of the Faculty of Computer Science and Engineering members for their help and support I also thank

my parents for the unceasin encouragement, support and attention.

Trang 8

AbstractArtificial intelligence for driving is receiving more attention Drowsiness detection is one ofthe smaller tasks to improve the driving experience A drowsiness detector can detect and warnthe drivers when they fall asleep and prevent accidents caused by drivers’ drowsiness A simpleapproach is to consider the drowsiness detection problem as an object detection problem Inthis thesis, we adopt a powerful object detector called YOLOv5 It is one of the most pop-ular frameworks for object detection that was released to the public In our experiments, theYOLOv5 framework can achieve excellent detection performance with abundant superviseddata In terms of speed performance, we deploy the trained model to the Jetson TX2 usingTensorRT, which significantly outperforms the released Pytorch implementation In practice,

we are not always able to access an abundant amount of labeled data The limited number oftraining examples can lead to severely deficient performance, as shown in our experiments Wepropose to pretrain the model with other datasets to improve the overall performance without in-troducing any computational inference cost We introduce a pretraining method from few-shotlearning that achieves state-of-the-art in widely used few-shot learning benchmarks to pretrainthe model We extensively conduct experiments with several pretraining methods to analyzetheir transfer performance to object detection tasks

Trang 9

1.1 Motivation 1

1.2 The Scope of the Thesis 2

1.3 Organization of the Thesis 2

2 Foundations 4 2.1 Probabilities and Statistic Basics 4

2.1.1 Random Variables 4

2.1.2 Probability Distributions 4

2.1.3 Discrete Random Variables - Probability Mass Function 4

2.1.4 Continuous Random Variables - Probability Density Function 4

2.1.5 Marginal Probability 5

2.1.6 Conditional Probability 5

2.1.7 Expectation and Variance 5

2.1.8 Sample 5

2.1.9 Confidence Intervals 6

2.2 Machine Learning Basics 7

2.2.1 Supervised Learning 7

2.2.2 Unsupervised Learning 9

2.2.3 Semi-supervised learning 9

2.3 Few-shot Learning 9

2.4 Object Detection 10

3 Related work 11 3.1 Few-shot Learning 11

3.1.1 Meta-Learning 11

3.1.2 Metrics-Learning 13

3.1.3 Boosting few-shot visual learning with self-supervision 13

3.2 Object Detection 16

3.2.1 Two-stage Detectors 16

3.2.2 One-stage Detectors 18

4 Methods 21 4.1 Problem formulation 21

4.2 Bag of freebies 22

4.3 A Strong Baseline for Few-Shot Learning 22

4.3.1 Joint Training of Episodic and Standard Supervised Strategies 23

4.3.2 Revisiting Pooling Layer 24

4.4 YOLOv5 25

Trang 10

4.4.1 YOLOv5 architecture 25

4.4.2 ResNet-50-YOLOv5 25

5 Experiments 28 5.1 Datasets 28

5.2 Results of Training ResNet-50-YOLOv5 from Scratch with Abundant Annota-tions 29

5.2.1 Implementation Details 29

5.2.2 Quantitative Results 29

5.2.3 Qualitative Results 30

5.3 Performance of Deploying ResNet-50-YOLOv5 with TensorRT 30

5.3.1 Comparison between TensorRT and Pytorch 30

5.3.2 Effective of image’s resolution on performance 31

5.4 Result of Baseline on Few-shot Benchmarks 31

5.4.1 Implementation Details 31

5.4.2 Results 32

5.5 Results of Training ResNet-50-YOLOv5 with Limited Annotations 33

5.5.1 Results of Training ResNet-50-YOLOv5 from Scratch with Limited Annotations 34

5.5.2 Results of Training Pretrained ResNet-50-YOLOv5 with Limited An-notations 34

6 Conclusion 42 7 Appendix 43 7.1 Network architecture terminology 43

7.2 Jetson TX2 46

7.2.1 Jetson TX2 Developer Kit 46

7.2.2 JetPack SDK 47

7.3 Tensor RT 48

7.3.1 Developing and Deploying with Tensor RT 48

Trang 11

List of Tables

2.1 Confusion matrix 82.2 Quantities in Confusion Matrix of Testing for Coronavirus 95.1 Evaluation of training ResNet-50-YOLOv5 from scratch 295.2 Performance of deploying trained ResNet-50-YOLOv5 into Jetson TX2 315.3 Comparison to prior work on CIFAR-FS and FC100 345.4 Comparison with previous works on mini-ImageNet 355.5 Evaluation of training ResNet-50-YOLOv5 from scratch 365.6 Comparison between Our Baseline and Standard Supervised Training on mini-ImageNet benchmark 365.7 Performance of mini-ImageNet-pretrained ResNet-50-YOLOv5 365.8 Performance of ImageNet-pretrained ResNet-50-YOLOv5 39

Trang 12

List of Figures

1.1 Giraffe 1

2.1 The graph of the standard normal distribution 6

2.2 Mnist digit dataset 8

3.1 MAML algorithms 11

3.2 LEO algorithms 12

3.3 Meta-SGD algorithms 13

3.4 [9] 13

3.5 Relation Network 15

3.6 Attention-based weight generator 16

3.7 R-CNN 17

3.8 Fast R-CNN 17

3.9 Faster R-CNN 18

3.10 You only look once model 19

3.11 Single Shot MultiBox Detector 20

4.1 Problem formulation 21

4.2 Overall development of our system 22

4.3 a) Different kernel sizes of pooling layers applied to feature maps b) Adapted pooling layer 24

4.4 YOLOv5 Architecture 26

4.5 resnet50xYOLOv5 architecture 27

5.1 Dataset samples 28

5.2 mini-ImageNet sample images 29

5.3 Qualitative results of training ResNet-50-YOLOv5 from scratch with abundant annotations 30

5.4 Performance on different image sizes 32

5.5 Evaluation on different kernel sizes of the last pooling layer 33

5.6 Precision of mini-ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set 37 5.7 Recall of mini-ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set 37

5.8 mAP@0.5 of mini-ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set 38 5.9 mAP@0.5:0.95 of mini-ImageNet-pretrained ResNet-50-YOLOv5 on Valida-tion Set 38

5.10 Precision of ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set 40

5.11 Recall of ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set 40

5.12 mAP@0.5 of ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set 41 5.13 mAP@0.5:0.95 of ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set 41

Trang 13

7.1 Jetson TX2 Developer Kit components 467.2 NVIDIA SDK Manager 47

Trang 14

De-Humans are very good at grasping new concepts and adapting to unforeseen circumstances.For instance, humans can recognize a giraffe from just a single picture This ability is a hallmark

of human intelligence The secret behind the ability is that humans can leverage prior experience

to reinforce new concepts In contrast, the traditional classification model learns an object fromscratch and required massive labeled training data For example, ResNet [13] was trained on1.28 million training images of ImageNet [32] to achieve a classification accuracy of humanslevel

Figure 1.1:Giraffe Humans can recognize a giraffe after viewing one a single time In contrast,

an object recognition system like Resnet [13] has to train with a far more number of examples

to achieve the human level of accuracy

In some areas, labeled data is just way too expensive For instance, the process of collectmedical data is very complicated It consumes time, resources, or even acceptance from thepatients This causes practical difficulties for traditional machine learning systems to handle

Trang 15

such situations However, the available labeled or unlabeled data from other distributions isenormous The question is that can we transfer knowledge from available data to new tasks?Can we train a model with available data such that a few annotation information from new taskscan produce a great performance model?

Modern deep learning works are usually implemented by general-purpose deep learningframeworks, e.g., Pytorch, Tensorflow These libraries provide great flexibility to constructloss functions at training time, build and modify the models, etc A deep learning model isusually trained on data center GPUs with great processing capacity However, at deployingstage, the inference is sometimes required to run on cheaper devices with smaller capacities.Inferencing using the same general-purpose deep learning libraries is relatively slow in terms ofspeed for some applications Typically, drowsiness detection requires low latency for immediateresponse to the drowsiness Hence, there is a need to optimize the trained deep learning modelfor embedded devices The most popular approach for NVIDIA embedded devices is to useNVIDIA TensorRT

1.2 The Scope of the Thesis

We consider drowsiness detection as an object detection problem More specifically, weconsider the problem of detecting drowsiness from the human face The problem consists oflocalizing the human face on camera and classifying the drowsy expression from it We an-alyze the difficulty of limited data in training a modern object detector To this extent, wereview recent few-shot learning and object detection literature We analyze how those few-shottechniques can be applied when transfer to object detection tasks We then demonstrate theperformance of several state-of-the-art pretraining techniques in transferring to object detectiontasks We propose a new object detection network called ResNet-50-YOLOv5 which adoptsResNet-50 as a backbone in YOLOv5 architecture The combined network is native to a lot ofavailable unsupervised and few-shot learning methods Finally, we deploy the proposed model

to Jetson TX2 using the leading framework for processing time, i.e., TensorRT

1.3 Organization of the Thesis

• In chapter 2, we briefly describe few-shot learning and object detection settings We alsoprovide some basic math and machine learning concepts that might be helpful for the reader

to understand the rest of the text

• Chapter 3 summarizes some legacy research works in the areas in a consistent way

• In chapter 4, we briefly provide object detection settings We introduce our few-shot ing method that achieves state-of-the-art in popular benchmarks, and we show how to apply

learn-it to improve a particular object detector We describe one of the most powerful models forobject detection, i.e., YOLOv5 Finally, we provide a detailed description of the networkarchitecture of the proposed ResNet-50-YOLOv5

• In chapter 5, we first describe specifically the drowsiness detection problem, which cludes dataset, performance metrics, etc We demonstrate the results of a naive approach

in-to drowsiness detection with full annotations using ResNet-50-YOLOv5 We also reportthe overall performance of the model when being deployed on Jetson TX2 We then in-vestigate the performance of training ResNet-50-YOLOv5 on limited numbers of trainingdata Finally, we conduct extensive experiments on pretraining the backbone of ResNet-50-YOLOv5

Trang 16

• Finally, we conclude by summarize what we have done so far and discussing pros and cons

of the work in chapter 6

Trang 17

Chapter 2

Foundations

In the first part of this section, we provide basic math and machine learning concepts thatare useful for further discussion in the field We describe essential subfields of machine learn-ing: supervised, unsupervised learning Each of them has been studied extensively in few-shotlearning literature recently and still has room for improvements

In the second part, we provide a formal definition of the few-shot learning object tion, object detection, and their terminologies

recogni-2.1 Probabilities and Statistic Basics

2.1.1 Random Variables

A random variable is a variable that can take on different values, each occurring with someprobability If the set of such values is discrete, the random variable is said to be discrete,otherwise it is continuous

For example, the outcome of tossing an unbias coin can be modeled with a random variable:

x 2 {0,1}, where 0 indicates the outcome “head”; 1 indicates the outcome “tail”; each caseoccurs with probability 12

2.1.2 Probability Distributions

2.1.3 Discrete Random Variables - Probability Mass Function

When describing a probability distribution associated with some discrete random variable,

we use probability mass function A probability mass function takes a value of the specifiedrandom variable as input and outputs the corresponding probability

A typical probability mass function P of a random variable x must satisfy these properties:

• The domain of P must be the set of all possible values of x

• 8x,0  P(x)  1

• ÂxP(x) = 1

2.1.4 Continuous Random Variables - Probability Density Function

In the case of describing the probability distribution of a continuous random variable, weuse probability density function

A typical probability density function P of a random variable x must satisfy these properties:

Trang 18

• The domain of p must be the set of all possible values of x.

2.1.7 Expectation and Variance

The expectation of some function f (x) with respect to a probability mass function P(x) isdefined as

var( f (x)) = Ehf (x) E[ f (x)]2i (2.6)2.1.8 Sample

Statistical inference is concerned with making decisions about a population based on theinformation in a random sample drawn from that population

Random sample The random variables X1,X2, ,Xnare a random sample of size n if the Xi’sare independent random variables and every Xihas the same probability distribution

Statistic A statistic is any function of the observations in a random sample

Trang 19

Some important statistics are sample mean ¯X, sample variance S2and sample standard viation S Because observation may vary as sample are randomly drawn, the statistic will alsovary As a result, a statistic is a random variable associated with some probability distribution.The probability distribution of a statistic is called asampling distribution.

de-¯X = X1+X2+··· + Xn

Central Limit Theorem If X1,X2, ,Xnis a random sample of size n taken from a population(either finite or infinite) with mean µ and finite variance s2 and if ¯X is the sample mean, thelimiting form of the distribution of

Z = ¯X µ

s

as n 7! •, is the standard normal distribution

Figure 2.1: The graph of the standard normal distribution

The t distribution Let X1,X2, ,Xn be a random sample from a normal distribution withunknown meanµ and unknown variance s2 The random variable

Confidence interval the population mean, variance known If ¯x is the sample mean of arandom sample of size n from a normal population with known variances2, a 100(1 a)%confidence interval onµ is given by

¯x za 2

2 is the upper 100a/2 percentage point of the standard normal distribution

Confidence interval the population mean, variance unknown If ¯x and s are the mean andstandard deviation of a random sample from a normal distribution with unknown variances2,

a 100(1 a)% confidence interval on µ is given by

Trang 20

2.2 Machine Learning Basics

Many works described in this text are deep learning techniques, which are a particular type

of machine learning Understanding machine learning basic concepts is crucial for discussingdeep learning as well as few-shot learning algorithms

A machine learning algorithm is an algorithm that is able to learn from data A particulartask in machine learning can be defined by two sets: training and test sets corresponding to twostages At the first stage or training time, the training set is given to the model The model aims

to learn from the training set so that at the second stage or test time, the model can performwell on the test set A performance metric is used to evaluate how good the model is for aparticular dataset Sometimes we encounter the term ’validation set’ in literature This set isused to evaluate the model performance before being actually tested on test set Feedback fromthe validation set also help tune hyperparameters of a model and avoid overfitting

2.2.1 Supervised Learning

In supervised learning problems, the training set is a collection of pairs of input data pointand its associated target or label Supervised learning aims to learn a function that accuratelypredicts the targets for novel data points

Formally, given a training dataset D = {(xi,yi) ,i = 1,2, }, the task is to learn a modelthat produces a prediction ypredict for an unseen data pointx⇤ as accurately as possible Theaccuracy of a model is defined by a loss function L ypredict,ytrue of the prediction ypredictand

a ’ground truth’ target ytrueassociated withx⇤

In some cases, the predictions are based on a conditional distribution p(y|x,D) The task

is now to model the conditional distribution This is referred to as probabilistic supervisedlearning

Classification

Classification is one of the most widespread problems of supervised learning The task is

to classify some input data point into one of the given categories The set of categories are adiscrete and ordered set Formally, each data point is associated with a target drawn from theset {1,2, ,k} of k categories The model is asked to predict the category of the given input

In probabilistic perspective, the model produces a probability distribution over categories giventhe training set and the specified data point An example of classification is to recognize thedigit from an image of a handwritten number [16]

Accuracy Accuracy is one of the ways to evaluate performance of a classification model.Accuracy is the number of correct predictions divided by the total number of predictions For

Trang 21

Figure 2.2:Mnist digit dataset [16].

instance, let ypredict = [1,1,1,0] be the predictions for four samples whose labels are ytrue=[0,0,1,1] Then the accuracy is 25% since there is one correct prediction The accuracy gives

a good view of the model’s performance However, it gives no insight into how the underlyingmodel performs in predicting each class

Confusion Matrix Confusion matrix gives more details about model’s performance Morespecifically, it records how each class being predicted by the model For example, ypredict=[1,1,1,0] and ytrue= [0,0,1,1] will have the following confusion matrix:

Predicted as 0 Predicted as 1

Table 2.1:Confusion matrix Confusion matrix with ypredict= [1,1,1,0] and ytrue= [0,0,1,1]

The accuracy can be derived as the sum of the numbers on the diagonal of the confusionmatrix divided by the sum of all the numbers on the matrix Other metrics can also be derivedfrom confusion matrix Typical, one could had more interest in a class than the others In suchcase, we define that class as positive class, whereas all the other classes are defined as negative.For example, for a coronavirus test, if it predicts that a person had coronavirus, then the test is

‘positive’ Otherwise, it is ‘negative’ We have the following quantities:

• A test correctly predicts if a person has coronavirus (True Positive)

• A test incorrectly predicts if a person has coronavirus (False Positive)

• A test correctly predicts if a person does not have coronavirus (True Negative)

• A test incorrectly predicts if a person does not have coronavirus (False Negative)

Table 2.2 shows these quantities on a confusion matrix Precision and Recall are two metricsthat are used to evaluate this test More specifically, precision is defined as T P

T P+FP and recall

is defined as T P+FNT P Obviously, precision and recall are ranging from 0 to 1 It is very crucialnot to miss any infected person, so we need the test to have high recall On the other hand,wrongly classify a person as being infected will waste medical resource, then we also want theprecision to be high A good test will have high precision and high recall, which is not alwaysachievable in practice There is sometimes a trade-off between precision and recall, depending

on the underlying task that we are solving

Trang 22

The person tested positive The person tested negativeThe person has coronavirus True Positive (TP) False Negative (FN)The person does not have coronavirus False Positive (FP) True Negative (TN)

Table 2.2: Quantities in Confusion Matrix of Testing for Coronavirus Illustration of ent quantities in Confusion matrix

differ-Regression

This class of task is quite similar to classification, except that the output is now a real value.The model has to make a approximation of the corresponding target given some input datapoint Formally, the learning algorithms is asked to produce a function f : Rn7! R, where n isthe dimension of inputx In probabilistic perspective, the learning algorithms might output aprobability mass function over y

2.2.2 Unsupervised Learning

In unsupervised learning, the data points are provided without corresponding targets Themodel aims to extract compact information from the data By extracting information, we meanlearning from the distribution of the given data Examples of unsupervised learning tasks aredensity estimation, anomaly detection, clustering data, generate new data points with the distri-bution Given a dataset D = {xi,i = 1,2, }, the probabilistic unsupervised learning tend tomodel the distribution p(x) of over data points

a large number of labeled samples of Nbbase classes and Ds

ncontains N novel classes with eachhaving K samples Given Db and Ds

n, the aim of few-shot learning is to adapt the model to Ds

n

after being well trained on Db Another set Dnq of query samples drawn from novel classes isused to evaluate the generalization ability of a few-shot learning model Such configuration iscalled an N-way K-shot task Dnqis called the query set

Few-shot learning was proposed to tackle the problem of data scarcity, and meta learning

is the most promising approach Most meta learning algorithms can be classify into black-boxadaptation, optimization-based inference, and non-parametric methods We briefly describemeta-learning and some terminologies

Meta-learning Machine learning algorithms deal with tasks, e.g., classification, regression,anomaly detection, sampling from distribution A task usually consists of a training set, a testset, and a performance measure For example, the object recognition task’s training set and testset are sets of images and labels The two sets have no common images Labels of the test set

Trang 23

are for evaluating machine learning algorithms The performance is the accuracy of predictingcategories for the test set Let denote training set and test set as Dtr and Dts, respectively.

In the text, we mostly consider the supervised task where Dtr ={(xi,yi) ,i = 1,2, ,k} and

Dts={(xi,yi) ,i = 1,2, ,l}

Meta-learning does not treat tasks independently Instead, meta-learning accumulate perience from prior tasks to quickly adapt to a new one The process mimics human behav-ior as we do not learn tasks from scratch Formally, given a meta-training data Dmeta train={(Ditr, Dits) ,i = 1,2, ,k}, the meta-learning model learns to perform well on meta-test data

ex-Dmeta test={(Ditr, Dits) ,i = 1,2, ,l} There is an analogy between meta-learning and dard supervised learning Supervised learning learns to predict the target y, given the data point

stan-x, whereas meta-learning learns to perform well on Dtsgiven Dtr

In the paradigm of meta-learning, a component called “learner” learns the new tasks, and another component called “meta-learner” trains the learner As a result, there are two optimizationcorresponding to the two components Some literature refers to the optimization of the meta-learner as outer loop and the optimization of the learner as inner loop or adaptation

2.4 Object Detection

In some contexts, we want to correctly classify images into predefined classes and want

to precisely localize regions that contain one or multiple objects within the considered image.The evolution of deep learning recently begins a native research wave to this topic, which isthen well defined as object detection Object detection can be further divided into two sub-topics, namely general object detection and its applications General object detection tends

to explore new general approaches, algorithms, or techniques toward improving the model’slocalizing and classifying ability Research into applications spans over a large number ofareas, such as pedestrian detection, car detection, face detection, etc Object detection, alongwith image classification, is one of the most critical problems in modern visual understanding.Their applications are related to many downstream tasks, including autonomous driving, objecttracking, medical, etc

Object detection consists of two subtasks, namely object localization, and object tion Given an image of objects, the object detection model aims to predict the bounding boxeswhich localize the underlying objects in the image and associate them with class predictions

classifica-In object detection, the model is usually trained using data with bounding boxes and classeslabel For a single task, typically, there are a specific set of underlying object categories Thebounding boxes of the object within an image are defined by their relative offsets to the size ofthe image The class label for an object is encoded by numerical indexes

Trang 24

3.1.1 Meta-Learning

In recent few-shot learning literature, the gradient-based metal-learning methods were ferred to as meta-learning, whereas metric learning was classified as a non-parametric meta-learning technique For more detail on meta-learning, see [8]

re-Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

In [7], the authors proposed to learn a set of model parameters such that a few numbers

of gradient descent iterations will produce a good performance on that task Intuitively, themodel is trained to an initial state, and for each new task, good performance can be obtained byfine-tuning on a small amount of data

Figure 3.1: MAML algorithms (figure from [7]) During meta-training, a large numbers ofbatch, consists of multiple tasks sampled from p(T ), is generated For each task Ti in thebatch, the setqi of task specific parameters is derived by fune-tuning the model parametersq

on training set of that task The task specific parameters are then evaluated on the task’s test set.Finally, the loss accumulated for the batch is used to update model parametersq

Trang 25

In effect, for a parametric model fq, the MAML aims to learn a single set of parametersq⇤

that is ’close’ to all the task specific parameters Starting at such set of parameters, the model

is able to quickly adapt to novel tasks with a few optimization steps

The work is compatible with a range of models trained with gradient descent and applicable

to many applications However, models that learn to adapt with MAML tend to be unstable andhard to archive good performance [2]

Meta-learning with latent embedding optimization

In MAML, the learner space and the meta-learner space are the same As a result, the innerloop as well as the outer loop has to be performed on a high-dimensional parameter space, which

is inefficient In Meta-learning with latent embedding optimization [33], the authors relax thislimitation by introducing a latent space

The authors argue that it is beneficial to relax the assumption that there exists an optimized

q⇤ and replace suchq with a generative model in low-dimensional space More specifically,

a task-dependent conditional probability distribution over q The adaptation steps are nowperformed on the low-dimensional latent space and task specific parameters are sampled fromthe generative model

Figure 3.2: LEO algorithms (figure from [33]) Instead of maintain a high-dimensional rametersq to captures information from the tasks distribution as in MAML, LEO proposed alow-dimensional generative model, conditioned on underlying tasks with an encode technique.The adaptation steps are now performed on this latent space Finally, the task specific parame-ters are sampled from the generative model

pa-Although LEO has relaxed one difficulty when working with MAML as well as otherparametric-based meta-learning algorithms that is the high-dimensional optimization space, it

is still very unstable to actually archive good performance In LEO work, the authors actuallyhas to tune the model hyperparameters carefully

Meta-SGD: Learning to Learn Quickly for Few-Shot Learning

Meta-SGD [18] is closely related to MAML, but it has ’higher capacity’ Meta-SGD notonly learns the initialize parameters for the model but it also learns the direction, learning ratefor adaptation procedure In the beginning, the model parameters and learning rate of adaptationsteps are randomly initialized The inner loop is mostly the same as in MAML However, in theouter loop, the paratemets as well as the learning rate are updated according to the loss of thebatch

Trang 26

Figure 3.3: Meta-SGD algorithms (figure from [18]) Learning rate of adaptation or inner loopsteps is updated along with the model parameters during meta-training.

3.1.2 Metrics-Learning

3.1.3 Boosting few-shot visual learning with self-supervision

Theoretically, the work in [9] is an add-on module that is applicable to various works Inthe sense that model trained on base categories tends to be overfitting, the author introduce aself-supervised task along with the classification task as a form of auxiliary loss

Figure 3.4: [9] At training time, the network is trained with standard supervised learning topredict object labels as well as rotation labels for images

The best described combination is as follow At training time, the model is trained in astandard supervised manner Specifically, a number of batches of images from training set issampled from Db training set For each batch B = {(xi,yi) ,i = 1,2, ,m}, four rotations in

R ={0 ,90 ,180 ,270 } are applied to each image The result batch is then

Trang 27

The training loss is now

At test time, the classification weight for novel categories is computed as in 3.5

Prototypical networks for few-shot learning

Prototypical networks [36] aims to solve the few-shot classification problem This is anon-parametric methods, i.e., the adaptation steps are ’computing steps’ rather than being ’op-timized’ Prototypical aims to learn a good feature extractor fq which has good generalizationperformance at test time For a K-way N-shot support set Ds⇢ IN⇥YN, where IN is set of im-ages for N specified classes; YN={1,2, ,N}, the algorithms first feeds all the images into thefeature extractor fq The ’prototype’ for each class i 2 YN is computed as average of extractedfeature of sample images of that class:

pi= 1

K Â

x2I i

where Ii={x|(x,y) 2 Ds,y = i}

The normalized score for each class i, given a query imagexq, is then

Ci fq xq ,Ds = exp gsim fq xq ,pi

Âj2YNexp gsim fq xq ,pj (3.6)where sim(·,·) is a similarity function; g is the inverse temperature The authors investigate theuse of cosine similarity and negative squared Euclidean distance as the similarity function for3.6

The loss at training time for each batch is then

Trang 28

Figure 3.5: Relation Network (figure from [37]) The overall model consists of two modules,namely embedding module and relation module The embedding module extracts feature frominput images to a fixed size vector The relation module plays as a learnable similarity function.The classification is performed in a nearest neighbor manner.

Learning to compare: Relation network for few-shot learning

The overall algorithm of Relation network [37] is quite the same as in Prototypical network.The authors proposed to use a deep distance function rather than standard squared Euclideandistance The deep distance metric is learned along with the feature extractor fq during meta-training More formally, let fq and gf be the feature extractor and the ’relation module’, re-spectively For an episode Ds⇢ IN⇥YN, the relation score of a query imagexqwith respect to

a class i is

Ci fq xq ,Ds =gf fq xq ,pi (3.8)The authors propose to regress the relation score to the ground truth in one-hot codingscheme The loss is then

Dynamic few-shot visual learning without forgetting

In [10], the authors do not compute prototype for base categories during meta-training.Instead, the authors maintain a set of base weights Wb={w1, ,wN b} At meta-training time,the base weights are substituted for the set of prototypes pi’s and are trained along with thefeature extractor fq The score of a query imagexqwith respect to a class i is

Ci fq xq ;Wb = exp gcos fq xq ,wi

Âw2Wbexp gcos fq xq ,w . (3.10)

Trang 29

Figure 3.6: Attention-based weight generator figure from [18] The model consists of afeature extractor and a classifier, and a few-shot classification weight generator All are trainedsimultaneously At test time, the classification weight generator takes base categories and novelimages from support set as input and produce the classification weights for novel classes.

At test time, the prototypes or weights for novel categories are derived from a based weight inference procedure The detailed form of weight for a typical novel class is asfollow

attention-w = favg wavg+fatt watt, (3.11)wherewavgis computed as prototype in 3.5; watt is attention-based weight pooled from baseweight; favg andfatt are learnable weight vectors Finally, the category inference step is thesame as in Prototypical network

The authors argue that attention mechanism can help network explicitly capture prior edge and transfer it to novel classes at test time, hence improve the performance

knowl-3.2 Object Detection

There are multiple ways to divide recent object detection approaches into groups Onecould split these approaches into the regression-based detector and the sliding window detector.Another way is to classify object detection methods based on their number of stages In thistext, we consider the latter one

3.2.1 Two-stage Detectors

Rich feature hierarchies for accurate object detection and semantic segmentation

This work from [12], which is also known as R-CNN, is one of the most popular works indeep approaches to object detection At the time, the convolution neural(CNN) network hadhuge attention since [15] achieved state-of-the-art on image classification benchmarks usingCNN R-CNN is the result of combining region proposals with CNNs

Instead of exhaustively investigating all bounding boxes within an image, they use a searchalgorithm to produce a reasonable number of 2000 regions These regions are called regions

Trang 30

Figure 3.7: R-CNN (figure from [12]) In R-CNN, a embedding network extracts feature foreach of 2000 region proposals within the underlying image A SVM is then adopted to classifyclass label for each feature vector.

proposals A feature extractor then takes as input each proposal and produces a correspondingfeature map Finally, an SVM is adopted to classify the regions

Fast R-CNN

In this paper, [11], the authors propose a method called Fast Region-based ConvolutionalNetwork or Fast R-CNN to tackle object detection problems The work focuses on improvingtraining and testing speed with nine times and 213 times faster than R-CNN at the trainingand testing stage, respectively The authors point out three significant drawbacks regarding itsmultiple stages training pipelines, the complexity of training, and detection speed

Figure 3.8: Fast R-CNN (figure from [11]) A convolutional-based network takes an entireimage and several object proposals as input The result feature map is then ROI pooled toproduce a set of fixed vectors, which are then post-processed to class probabilities and boundbox offsets

The Fast R-CNN first feed an image and a set of object proposals to a convolution-basedfeature extractor The output feature maps are then ROI pooled to a fixed-size feature vector foreach object proposal Finally, a few fully connected layers will output class probabilities andbounding boxes offsets

Faster R-CNN: Towards real-time object detection with region proposal networks

[31] comes up with an even faster version of R-CNN in which the region proposal stage

is learnable Prior works utilize a region proposal algorithm and build deep models on this

Trang 31

However, it turns out to have a bottleneck at the region proposals algorithms.

Figure 3.9: Faster R-CNN (figure from [11]) The region proposals are produce from a able module, which significantly improve time performance

learn-In this paper, the authors propose to leverage a shared convolution-based model for regionproposals and feature extraction More specifically, an embedding network extracts featuresfrom an input image The result feature map is then fed to a region predictor, producing regionproposals for the image Finally, the original feature map is ROI pooled according to regionproposals from the predictor

3.2.2 One-stage Detectors

You only look once: Unified, real-time object detection

YOLO of [28] introduces the use of single CNN models in object detection They larize the object detection as a regression problem in which bounding box offsets and classprobabilities are inferred directly in one step Specifically, the input image is first divided into

formu-a S ⇥ S grid A pformu-atch is sformu-aid to be responsible for detecting formu-an object if the center of the objectlies within the patch Each patch can predict B bounding boxes at the same time; each of them

is defined by five values indicating center coordinates, width, height, and confidence score, spectively The confidence score is defined as P(Object) ⇥ IOUtruth

re-pred, where P(Object) is theprobability that the bounding box has an object; IOUtruthpred is the IOU of the predicted boundingbox and the ground truth one The patches also predict conditional class probabilities, giventhat the underlying patch contains an object As a result, the output of the network has a size of

S ⇥ S ⇥ (B ⇥ 5 +C) The limitation of this framework is that it can only predict one object pergrid patch If more than two objects whose centers fall within the same patch, the network will

be confused

Trang 32

Figure 3.10: YOLO model (figure from [28]) The input image is divided into a S ⇥ S grid.Each grid will predict B bounding boxes by producing their offsets and confidence score, andpredict C conditional class probabilities Finally, a non-maximal suppression is utilized to re-duce multiple detections of the same object.

YOLO9000:Better, Faster, Stronger

[29] introduces an incrementally improved version of YOLO, i.e., YOLOv2 More cally, in this work, the author presents a series of small modifications that lead to improvementfor the first version of YOLO Some key changes are high-resolution classifier, convolutionalwith anchor boxes, dimension clusters, direct location prediction, fine-grained features, multi-scale training They also proposed a joint training scheme that can leverage weakly supervisedinformation from classification datasets to further improve the model’s capacity

specifi-SSD: Single shot multibox detector

[20] introduces a set of default boxes with different shapes and ratios into the detectionprocedure rather than infers for region proposals The network is trained to predict the presence

of given classes within each default box and produce adjustments for accurate final boundingboxes The final decisions are produced from a non-maximum suppression step

Tiêu đề	Study and Improve Few-shot Learning Techniques in Computer Vision Application
Tác giả	Nguyen Duc Khoi
Người hướng dẫn	Le Thanh Sach, PTS. Nguyen Ho Man Rang
Trường học	Ho Chi Minh City University of Technology
Chuyên ngành	Computer Engineering
Thể loại	Bachelor Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	64
Dung lượng	7,05 MB