Nhiệm vụ yêu cầu về nội dung và số liệu ban đầu: • Study Deep learning, and do literature review for few-shot learning; • Propose a learning techinque for training deep models in compu
Trang 1VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE & ENGINEERING
——————– * ———————
BACHELOR THESIS
Study and Improve Few-shot Learning Techniques in Computer Vision Application
Major: Computer Engineering
Council: Computer Engineering 1 Supervisor: Dr Le Thanh Sach
Dr Nguyen Ho Man Rang Reviewer: Dr Nguyen Duc Dung
—o0o—
Student: Nguyen Duc Khoi (1752302)
HO CHI MINH CITY, 8/2021
Trang 2ĐẠI HỌC QUỐC GIA TP.HCM CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
- Độc lập - Tự do - Hạnh phúc
TRƯỜNG ĐẠI HỌC BÁCH KHOA
KHOA:KH & KT Máy tính NHIỆM VỤ LUẬN ÁN TỐT NGHIỆP
BỘ MÔN: KHMT _ Chú ý: Sinh viên phải dán tờ này vào trang nhất của bản thuyết trình
MSSV: 1752302 Họ và Tên SV: NGUYEN DUC KHOI Ngành: Kỹ thuật Máy tính
1 Đầu đề luận án:
EN: A study on few-shot learning for computer vision applications
VN: Nghiên cứu và cải tiến kỹ thuật học với số ít mẫu được làm nhãn cho các ứng dụng trong thị giác máy tính
2 Nhiệm vụ (yêu cầu về nội dung và số liệu ban đầu):
• Study Deep learning, and do literature review for few-shot learning;
• Propose a learning techinque for training deep models (in computer vision) with popuplar datasets on the Internet;
• Apply few-learning to an application in computer vision, from training, tuning, to deploying the trained model on embeded systems supported by NVIDIA’s technologies
3 Ngày giao nhiệm vụ luận án: 01/ 01 /2021
4 Ngày hoàn thành nhiệm vụ: 01/ 08 /2021
1) Lê Thành Sách Đồng hướng dẫn 2) Nguyễn Hồ Mẫn Rạng Đồng hướng dẫn
Nội dung và yêu cầu LVTN đã được thông qua Bộ môn
Ngày tháng năm 2021
(Ký và ghi rõ họ tên) (Ký và ghi rõ họ tên)
Lê Thành Sách
PHẦN DÀNH CHO KHOA, BỘ MÔN:
Người duyệt (chấm sơ bộ):
Đơn vị: _
Ngày bảo vệ: _
Điểm tổng kết: _
Trang 3TRƯỜNG ĐẠI HỌC BÁCH KHOA CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
EN: A study on few-shot learning for computer vision applications
VN: Nghiên cứu và cải tiến kỹ thuật học với số ít mẫu được làm nhãn cho các ứng dụng trong thị giác máy tính
3 Họ tên người hướng dẫn: TS Lê Thành Sách
4 Tổng quát về bản thuyết minh:
- Số bản vẽ vẽ tay Số bản vẽ trên máy tính:
6 Những ưu điểm chính của LVTN:
• The author masters different techniques required for designing deep learning models, and for training, tunning, and deploying models to GPU cards with NVIDIA’s technologies
• The thesis consists of a science and an engineering task related to deep learning as follows: (a) Science: improve a selected few-shot learning technique for computer vision The author has proposed an idea that is based on the episodic training and the dense convolution The proposed idea has been evaluated with popular datasets reserved for the reseach field, it can gain some improvements The reseach’s results have been submitted to an international conference and wait for the reviewers’ conclusions
(b) Engineering: apply the few-shot to train a selected computer vision task and then deploy the trained model to an embeded system GPU card To this end, the author selected application “drowsiness detection” He utilized few-shot to train YOLOv5 and then deploy the trained model to NVIDIA Jetson TX2 successfully The demo application can run and detect the drowsiness live
7 Những thiếu sót chính của LVTN:
• The publication is not available at the defense’s time as designed
•
8 Đề nghị: Được bảo vệ þ Bổ sung thêm để bảo vệ o Không được bảo vệ o
9 Ba câu hỏi SV phải trả lời trước Hội đồng:
10 Đánh giá chung (bằng chữ: giỏi, khá, TB): 10 (mười)
Ký tên (ghi rõ họ tên)
Trang 4Lê Thành Sách
Trang 5TRƯỜNG ĐẠI HỌC BÁCH KHOA CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
-
Ngày 01 tháng 08 năm 2021
PHIẾU CHẤM BẢO VỆ LVTN
(Dành cho người hướng dẫn/phản biện)
1 Họ và tên SV: Nguyễn Đức Khôi
MSSV: 1752302 Ngành (chuyên ngành): Computer Engineering
2 Đề tài: Research and Apply Few-shot Learning Techniques in Drowsiness Detection
3 Họ tên người hướng dẫn/phản biện: Nguyễn Đức Dũng
4 Tổng quát về bản thuyết minh:
- Số bản vẽ vẽ tay Số bản vẽ trên máy tính:
6 Những ưu điểm chính của LVTN:
The thesis focus on detecting drowsiness from the human face using deep learning approaches The team proposed using ResNet block instead of normal convolutional block in the YOLOv5 network
to improve the detection accuracy The team also deploy this model to the embedded system (Jetson TX2) for realtime performance The results show some improvement in the detection accuracy
7 Những thiếu sót chính của LVTN:
The replacement of ResNet block in the network has been utilized for awhile, which makes this contribution a bit weak The drowsiness detection problem, however, can be solved better by other vision techniques, which can be very fast and realtime The choice of current approach is very bias and need to be considered in the future The few-shot learning scheme is irrelevant to the main topic
we are discussing
8 Đề nghị: Được bảo vệ o Bổ sung thêm để bảo vệ o Không được bảo vệ o
9 3 câu hỏi SV phải trả lời trước Hội đồng:
a Why don't you use other vision algorithms to detect drowsiness? Even if it will give much better performance comparing to YOLO?
b Explain why Few-shot learning matter The discussion need to be improved
c
10 Đánh giá chung (bằng chữ: giỏi, khá, TB): Giỏi Điểm: 9 /10
Ký tên (ghi rõ họ tên)
Nguyễn Đức Dũng
Trang 6We hereby declare that this thesis titled ‘Research and Apply Few-shot niques in Computer Vision Application’ and the work presented in it are our own We confirm that:
LearningTech-• This work was done wholly or mainly while in candidature for a degree at this versity.
Uni-• Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated.
• Where we have quoted from the work of others, the source is always given With the exception of such quotations, this thesis is entirely our own work.
• We have acknowledged all main sources of help.
• Where the thesis is based on work done by ourselves jointly with others, we have made clear exactly what was done by others and what we have contributed ourselves.
Trang 7Acknowledgments First and foremost, I am tremendously grateful for my advisers Dr Sach Le Thanh and Dr Rang Nguyen Ho Man for their continuous support and guidance throughout my project, and for providing me the freedom to work on a variety of problems Second, I take this opportunity to express gratitude to all
of the Faculty of Computer Science and Engineering members for their help and support I also thank
my parents for the unceasin encouragement, support and attention.
Trang 8AbstractArtificial intelligence for driving is receiving more attention Drowsiness detection is one ofthe smaller tasks to improve the driving experience A drowsiness detector can detect and warnthe drivers when they fall asleep and prevent accidents caused by drivers’ drowsiness A simpleapproach is to consider the drowsiness detection problem as an object detection problem Inthis thesis, we adopt a powerful object detector called YOLOv5 It is one of the most pop-ular frameworks for object detection that was released to the public In our experiments, theYOLOv5 framework can achieve excellent detection performance with abundant superviseddata In terms of speed performance, we deploy the trained model to the Jetson TX2 usingTensorRT, which significantly outperforms the released Pytorch implementation In practice,
we are not always able to access an abundant amount of labeled data The limited number oftraining examples can lead to severely deficient performance, as shown in our experiments Wepropose to pretrain the model with other datasets to improve the overall performance without in-troducing any computational inference cost We introduce a pretraining method from few-shotlearning that achieves state-of-the-art in widely used few-shot learning benchmarks to pretrainthe model We extensively conduct experiments with several pretraining methods to analyzetheir transfer performance to object detection tasks
Trang 91.1 Motivation 1
1.2 The Scope of the Thesis 2
1.3 Organization of the Thesis 2
2 Foundations 4 2.1 Probabilities and Statistic Basics 4
2.1.1 Random Variables 4
2.1.2 Probability Distributions 4
2.1.3 Discrete Random Variables - Probability Mass Function 4
2.1.4 Continuous Random Variables - Probability Density Function 4
2.1.5 Marginal Probability 5
2.1.6 Conditional Probability 5
2.1.7 Expectation and Variance 5
2.1.8 Sample 5
2.1.9 Confidence Intervals 6
2.2 Machine Learning Basics 7
2.2.1 Supervised Learning 7
2.2.2 Unsupervised Learning 9
2.2.3 Semi-supervised learning 9
2.3 Few-shot Learning 9
2.4 Object Detection 10
3 Related work 11 3.1 Few-shot Learning 11
3.1.1 Meta-Learning 11
3.1.2 Metrics-Learning 13
3.1.3 Boosting few-shot visual learning with self-supervision 13
3.2 Object Detection 16
3.2.1 Two-stage Detectors 16
3.2.2 One-stage Detectors 18
4 Methods 21 4.1 Problem formulation 21
4.2 Bag of freebies 22
4.3 A Strong Baseline for Few-Shot Learning 22
4.3.1 Joint Training of Episodic and Standard Supervised Strategies 23
4.3.2 Revisiting Pooling Layer 24
4.4 YOLOv5 25
Trang 104.4.1 YOLOv5 architecture 25
4.4.2 ResNet-50-YOLOv5 25
5 Experiments 28 5.1 Datasets 28
5.2 Results of Training ResNet-50-YOLOv5 from Scratch with Abundant Annota-tions 29
5.2.1 Implementation Details 29
5.2.2 Quantitative Results 29
5.2.3 Qualitative Results 30
5.3 Performance of Deploying ResNet-50-YOLOv5 with TensorRT 30
5.3.1 Comparison between TensorRT and Pytorch 30
5.3.2 Effective of image’s resolution on performance 31
5.4 Result of Baseline on Few-shot Benchmarks 31
5.4.1 Implementation Details 31
5.4.2 Results 32
5.5 Results of Training ResNet-50-YOLOv5 with Limited Annotations 33
5.5.1 Results of Training ResNet-50-YOLOv5 from Scratch with Limited Annotations 34
5.5.2 Results of Training Pretrained ResNet-50-YOLOv5 with Limited An-notations 34
6 Conclusion 42 7 Appendix 43 7.1 Network architecture terminology 43
7.2 Jetson TX2 46
7.2.1 Jetson TX2 Developer Kit 46
7.2.2 JetPack SDK 47
7.3 Tensor RT 48
7.3.1 Developing and Deploying with Tensor RT 48
Trang 11List of Tables
2.1 Confusion matrix 82.2 Quantities in Confusion Matrix of Testing for Coronavirus 95.1 Evaluation of training ResNet-50-YOLOv5 from scratch 295.2 Performance of deploying trained ResNet-50-YOLOv5 into Jetson TX2 315.3 Comparison to prior work on CIFAR-FS and FC100 345.4 Comparison with previous works on mini-ImageNet 355.5 Evaluation of training ResNet-50-YOLOv5 from scratch 365.6 Comparison between Our Baseline and Standard Supervised Training on mini-ImageNet benchmark 365.7 Performance of mini-ImageNet-pretrained ResNet-50-YOLOv5 365.8 Performance of ImageNet-pretrained ResNet-50-YOLOv5 39
Trang 12List of Figures
1.1 Giraffe 1
2.1 The graph of the standard normal distribution 6
2.2 Mnist digit dataset 8
3.1 MAML algorithms 11
3.2 LEO algorithms 12
3.3 Meta-SGD algorithms 13
3.4 [9] 13
3.5 Relation Network 15
3.6 Attention-based weight generator 16
3.7 R-CNN 17
3.8 Fast R-CNN 17
3.9 Faster R-CNN 18
3.10 You only look once model 19
3.11 Single Shot MultiBox Detector 20
4.1 Problem formulation 21
4.2 Overall development of our system 22
4.3 a) Different kernel sizes of pooling layers applied to feature maps b) Adapted pooling layer 24
4.4 YOLOv5 Architecture 26
4.5 resnet50xYOLOv5 architecture 27
5.1 Dataset samples 28
5.2 mini-ImageNet sample images 29
5.3 Qualitative results of training ResNet-50-YOLOv5 from scratch with abundant annotations 30
5.4 Performance on different image sizes 32
5.5 Evaluation on different kernel sizes of the last pooling layer 33
5.6 Precision of mini-ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set 37 5.7 Recall of mini-ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set 37
5.8 mAP@0.5 of mini-ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set 38 5.9 mAP@0.5:0.95 of mini-ImageNet-pretrained ResNet-50-YOLOv5 on Valida-tion Set 38
5.10 Precision of ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set 40
5.11 Recall of ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set 40
5.12 mAP@0.5 of ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set 41 5.13 mAP@0.5:0.95 of ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set 41
Trang 137.1 Jetson TX2 Developer Kit components 467.2 NVIDIA SDK Manager 47
Trang 14De-Humans are very good at grasping new concepts and adapting to unforeseen circumstances.For instance, humans can recognize a giraffe from just a single picture This ability is a hallmark
of human intelligence The secret behind the ability is that humans can leverage prior experience
to reinforce new concepts In contrast, the traditional classification model learns an object fromscratch and required massive labeled training data For example, ResNet [13] was trained on1.28 million training images of ImageNet [32] to achieve a classification accuracy of humanslevel
Figure 1.1:Giraffe Humans can recognize a giraffe after viewing one a single time In contrast,
an object recognition system like Resnet [13] has to train with a far more number of examples
to achieve the human level of accuracy
In some areas, labeled data is just way too expensive For instance, the process of collectmedical data is very complicated It consumes time, resources, or even acceptance from thepatients This causes practical difficulties for traditional machine learning systems to handle
Trang 15such situations However, the available labeled or unlabeled data from other distributions isenormous The question is that can we transfer knowledge from available data to new tasks?Can we train a model with available data such that a few annotation information from new taskscan produce a great performance model?
Modern deep learning works are usually implemented by general-purpose deep learningframeworks, e.g., Pytorch, Tensorflow These libraries provide great flexibility to constructloss functions at training time, build and modify the models, etc A deep learning model isusually trained on data center GPUs with great processing capacity However, at deployingstage, the inference is sometimes required to run on cheaper devices with smaller capacities.Inferencing using the same general-purpose deep learning libraries is relatively slow in terms ofspeed for some applications Typically, drowsiness detection requires low latency for immediateresponse to the drowsiness Hence, there is a need to optimize the trained deep learning modelfor embedded devices The most popular approach for NVIDIA embedded devices is to useNVIDIA TensorRT
1.2 The Scope of the Thesis
We consider drowsiness detection as an object detection problem More specifically, weconsider the problem of detecting drowsiness from the human face The problem consists oflocalizing the human face on camera and classifying the drowsy expression from it We an-alyze the difficulty of limited data in training a modern object detector To this extent, wereview recent few-shot learning and object detection literature We analyze how those few-shottechniques can be applied when transfer to object detection tasks We then demonstrate theperformance of several state-of-the-art pretraining techniques in transferring to object detectiontasks We propose a new object detection network called ResNet-50-YOLOv5 which adoptsResNet-50 as a backbone in YOLOv5 architecture The combined network is native to a lot ofavailable unsupervised and few-shot learning methods Finally, we deploy the proposed model
to Jetson TX2 using the leading framework for processing time, i.e., TensorRT
1.3 Organization of the Thesis
• In chapter 2, we briefly describe few-shot learning and object detection settings We alsoprovide some basic math and machine learning concepts that might be helpful for the reader
to understand the rest of the text
• Chapter 3 summarizes some legacy research works in the areas in a consistent way
• In chapter 4, we briefly provide object detection settings We introduce our few-shot ing method that achieves state-of-the-art in popular benchmarks, and we show how to apply
learn-it to improve a particular object detector We describe one of the most powerful models forobject detection, i.e., YOLOv5 Finally, we provide a detailed description of the networkarchitecture of the proposed ResNet-50-YOLOv5
• In chapter 5, we first describe specifically the drowsiness detection problem, which cludes dataset, performance metrics, etc We demonstrate the results of a naive approach
in-to drowsiness detection with full annotations using ResNet-50-YOLOv5 We also reportthe overall performance of the model when being deployed on Jetson TX2 We then in-vestigate the performance of training ResNet-50-YOLOv5 on limited numbers of trainingdata Finally, we conduct extensive experiments on pretraining the backbone of ResNet-50-YOLOv5
Trang 16• Finally, we conclude by summarize what we have done so far and discussing pros and cons
of the work in chapter 6
Trang 17Chapter 2
Foundations
In the first part of this section, we provide basic math and machine learning concepts thatare useful for further discussion in the field We describe essential subfields of machine learn-ing: supervised, unsupervised learning Each of them has been studied extensively in few-shotlearning literature recently and still has room for improvements
In the second part, we provide a formal definition of the few-shot learning object tion, object detection, and their terminologies
recogni-2.1 Probabilities and Statistic Basics
2.1.1 Random Variables
A random variable is a variable that can take on different values, each occurring with someprobability If the set of such values is discrete, the random variable is said to be discrete,otherwise it is continuous
For example, the outcome of tossing an unbias coin can be modeled with a random variable:
x 2 {0,1}, where 0 indicates the outcome “head”; 1 indicates the outcome “tail”; each caseoccurs with probability 12
2.1.2 Probability Distributions
2.1.3 Discrete Random Variables - Probability Mass Function
When describing a probability distribution associated with some discrete random variable,
we use probability mass function A probability mass function takes a value of the specifiedrandom variable as input and outputs the corresponding probability
A typical probability mass function P of a random variable x must satisfy these properties:
• The domain of P must be the set of all possible values of x
• 8x,0 P(x) 1
• ÂxP(x) = 1
2.1.4 Continuous Random Variables - Probability Density Function
In the case of describing the probability distribution of a continuous random variable, weuse probability density function
A typical probability density function P of a random variable x must satisfy these properties:
Trang 18• The domain of p must be the set of all possible values of x.
2.1.7 Expectation and Variance
The expectation of some function f (x) with respect to a probability mass function P(x) isdefined as
var( f (x)) = Ehf (x) E[ f (x)]2i (2.6)2.1.8 Sample
Statistical inference is concerned with making decisions about a population based on theinformation in a random sample drawn from that population
Random sample The random variables X1,X2, ,Xnare a random sample of size n if the Xi’sare independent random variables and every Xihas the same probability distribution
Statistic A statistic is any function of the observations in a random sample
Trang 19Some important statistics are sample mean ¯X, sample variance S2and sample standard viation S Because observation may vary as sample are randomly drawn, the statistic will alsovary As a result, a statistic is a random variable associated with some probability distribution.The probability distribution of a statistic is called asampling distribution.
de-¯X = X1+X2+··· + Xn
Central Limit Theorem If X1,X2, ,Xnis a random sample of size n taken from a population(either finite or infinite) with mean µ and finite variance s2 and if ¯X is the sample mean, thelimiting form of the distribution of
Z = ¯X µ
s
as n 7! •, is the standard normal distribution
Figure 2.1: The graph of the standard normal distribution
The t distribution Let X1,X2, ,Xn be a random sample from a normal distribution withunknown meanµ and unknown variance s2 The random variable
Confidence interval the population mean, variance known If ¯x is the sample mean of arandom sample of size n from a normal population with known variances2, a 100(1 a)%confidence interval onµ is given by
¯x za 2
2 is the upper 100a/2 percentage point of the standard normal distribution
Confidence interval the population mean, variance unknown If ¯x and s are the mean andstandard deviation of a random sample from a normal distribution with unknown variances2,
a 100(1 a)% confidence interval on µ is given by
Trang 202.2 Machine Learning Basics
Many works described in this text are deep learning techniques, which are a particular type
of machine learning Understanding machine learning basic concepts is crucial for discussingdeep learning as well as few-shot learning algorithms
A machine learning algorithm is an algorithm that is able to learn from data A particulartask in machine learning can be defined by two sets: training and test sets corresponding to twostages At the first stage or training time, the training set is given to the model The model aims
to learn from the training set so that at the second stage or test time, the model can performwell on the test set A performance metric is used to evaluate how good the model is for aparticular dataset Sometimes we encounter the term ’validation set’ in literature This set isused to evaluate the model performance before being actually tested on test set Feedback fromthe validation set also help tune hyperparameters of a model and avoid overfitting
2.2.1 Supervised Learning
In supervised learning problems, the training set is a collection of pairs of input data pointand its associated target or label Supervised learning aims to learn a function that accuratelypredicts the targets for novel data points
Formally, given a training dataset D = {(xi,yi) ,i = 1,2, }, the task is to learn a modelthat produces a prediction ypredict for an unseen data pointx⇤ as accurately as possible Theaccuracy of a model is defined by a loss function L ypredict,ytrue of the prediction ypredictand
a ’ground truth’ target ytrueassociated withx⇤
In some cases, the predictions are based on a conditional distribution p(y|x,D) The task
is now to model the conditional distribution This is referred to as probabilistic supervisedlearning
Classification
Classification is one of the most widespread problems of supervised learning The task is
to classify some input data point into one of the given categories The set of categories are adiscrete and ordered set Formally, each data point is associated with a target drawn from theset {1,2, ,k} of k categories The model is asked to predict the category of the given input
In probabilistic perspective, the model produces a probability distribution over categories giventhe training set and the specified data point An example of classification is to recognize thedigit from an image of a handwritten number [16]
Accuracy Accuracy is one of the ways to evaluate performance of a classification model.Accuracy is the number of correct predictions divided by the total number of predictions For
Trang 21Figure 2.2:Mnist digit dataset [16].
instance, let ypredict = [1,1,1,0] be the predictions for four samples whose labels are ytrue=[0,0,1,1] Then the accuracy is 25% since there is one correct prediction The accuracy gives
a good view of the model’s performance However, it gives no insight into how the underlyingmodel performs in predicting each class
Confusion Matrix Confusion matrix gives more details about model’s performance Morespecifically, it records how each class being predicted by the model For example, ypredict=[1,1,1,0] and ytrue= [0,0,1,1] will have the following confusion matrix:
Predicted as 0 Predicted as 1
Table 2.1:Confusion matrix Confusion matrix with ypredict= [1,1,1,0] and ytrue= [0,0,1,1]
The accuracy can be derived as the sum of the numbers on the diagonal of the confusionmatrix divided by the sum of all the numbers on the matrix Other metrics can also be derivedfrom confusion matrix Typical, one could had more interest in a class than the others In suchcase, we define that class as positive class, whereas all the other classes are defined as negative.For example, for a coronavirus test, if it predicts that a person had coronavirus, then the test is
‘positive’ Otherwise, it is ‘negative’ We have the following quantities:
• A test correctly predicts if a person has coronavirus (True Positive)
• A test incorrectly predicts if a person has coronavirus (False Positive)
• A test correctly predicts if a person does not have coronavirus (True Negative)
• A test incorrectly predicts if a person does not have coronavirus (False Negative)
Table 2.2 shows these quantities on a confusion matrix Precision and Recall are two metricsthat are used to evaluate this test More specifically, precision is defined as T P
T P+FP and recall
is defined as T P+FNT P Obviously, precision and recall are ranging from 0 to 1 It is very crucialnot to miss any infected person, so we need the test to have high recall On the other hand,wrongly classify a person as being infected will waste medical resource, then we also want theprecision to be high A good test will have high precision and high recall, which is not alwaysachievable in practice There is sometimes a trade-off between precision and recall, depending
on the underlying task that we are solving
Trang 22The person tested positive The person tested negativeThe person has coronavirus True Positive (TP) False Negative (FN)The person does not have coronavirus False Positive (FP) True Negative (TN)
Table 2.2: Quantities in Confusion Matrix of Testing for Coronavirus Illustration of ent quantities in Confusion matrix
differ-Regression
This class of task is quite similar to classification, except that the output is now a real value.The model has to make a approximation of the corresponding target given some input datapoint Formally, the learning algorithms is asked to produce a function f : Rn7! R, where n isthe dimension of inputx In probabilistic perspective, the learning algorithms might output aprobability mass function over y
2.2.2 Unsupervised Learning
In unsupervised learning, the data points are provided without corresponding targets Themodel aims to extract compact information from the data By extracting information, we meanlearning from the distribution of the given data Examples of unsupervised learning tasks aredensity estimation, anomaly detection, clustering data, generate new data points with the distri-bution Given a dataset D = {xi,i = 1,2, }, the probabilistic unsupervised learning tend tomodel the distribution p(x) of over data points
a large number of labeled samples of Nbbase classes and Ds
ncontains N novel classes with eachhaving K samples Given Db and Ds
n, the aim of few-shot learning is to adapt the model to Ds
n
after being well trained on Db Another set Dnq of query samples drawn from novel classes isused to evaluate the generalization ability of a few-shot learning model Such configuration iscalled an N-way K-shot task Dnqis called the query set
Few-shot learning was proposed to tackle the problem of data scarcity, and meta learning
is the most promising approach Most meta learning algorithms can be classify into black-boxadaptation, optimization-based inference, and non-parametric methods We briefly describemeta-learning and some terminologies
Meta-learning Machine learning algorithms deal with tasks, e.g., classification, regression,anomaly detection, sampling from distribution A task usually consists of a training set, a testset, and a performance measure For example, the object recognition task’s training set and testset are sets of images and labels The two sets have no common images Labels of the test set
Trang 23are for evaluating machine learning algorithms The performance is the accuracy of predictingcategories for the test set Let denote training set and test set as Dtr and Dts, respectively.
In the text, we mostly consider the supervised task where Dtr ={(xi,yi) ,i = 1,2, ,k} and
Dts={(xi,yi) ,i = 1,2, ,l}
Meta-learning does not treat tasks independently Instead, meta-learning accumulate perience from prior tasks to quickly adapt to a new one The process mimics human behav-ior as we do not learn tasks from scratch Formally, given a meta-training data Dmeta train={(Ditr, Dits) ,i = 1,2, ,k}, the meta-learning model learns to perform well on meta-test data
ex-Dmeta test={(Ditr, Dits) ,i = 1,2, ,l} There is an analogy between meta-learning and dard supervised learning Supervised learning learns to predict the target y, given the data point
stan-x, whereas meta-learning learns to perform well on Dtsgiven Dtr
In the paradigm of meta-learning, a component called “learner” learns the new tasks, and another component called “meta-learner” trains the learner As a result, there are two optimizationcorresponding to the two components Some literature refers to the optimization of the meta-learner as outer loop and the optimization of the learner as inner loop or adaptation
2.4 Object Detection
In some contexts, we want to correctly classify images into predefined classes and want
to precisely localize regions that contain one or multiple objects within the considered image.The evolution of deep learning recently begins a native research wave to this topic, which isthen well defined as object detection Object detection can be further divided into two sub-topics, namely general object detection and its applications General object detection tends
to explore new general approaches, algorithms, or techniques toward improving the model’slocalizing and classifying ability Research into applications spans over a large number ofareas, such as pedestrian detection, car detection, face detection, etc Object detection, alongwith image classification, is one of the most critical problems in modern visual understanding.Their applications are related to many downstream tasks, including autonomous driving, objecttracking, medical, etc
Object detection consists of two subtasks, namely object localization, and object tion Given an image of objects, the object detection model aims to predict the bounding boxeswhich localize the underlying objects in the image and associate them with class predictions
classifica-In object detection, the model is usually trained using data with bounding boxes and classeslabel For a single task, typically, there are a specific set of underlying object categories Thebounding boxes of the object within an image are defined by their relative offsets to the size ofthe image The class label for an object is encoded by numerical indexes
Trang 243.1.1 Meta-Learning
In recent few-shot learning literature, the gradient-based metal-learning methods were ferred to as meta-learning, whereas metric learning was classified as a non-parametric meta-learning technique For more detail on meta-learning, see [8]
re-Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
In [7], the authors proposed to learn a set of model parameters such that a few numbers
of gradient descent iterations will produce a good performance on that task Intuitively, themodel is trained to an initial state, and for each new task, good performance can be obtained byfine-tuning on a small amount of data
Figure 3.1: MAML algorithms (figure from [7]) During meta-training, a large numbers ofbatch, consists of multiple tasks sampled from p(T ), is generated For each task Ti in thebatch, the setqi of task specific parameters is derived by fune-tuning the model parametersq
on training set of that task The task specific parameters are then evaluated on the task’s test set.Finally, the loss accumulated for the batch is used to update model parametersq
Trang 25In effect, for a parametric model fq, the MAML aims to learn a single set of parametersq⇤
that is ’close’ to all the task specific parameters Starting at such set of parameters, the model
is able to quickly adapt to novel tasks with a few optimization steps
The work is compatible with a range of models trained with gradient descent and applicable
to many applications However, models that learn to adapt with MAML tend to be unstable andhard to archive good performance [2]
Meta-learning with latent embedding optimization
In MAML, the learner space and the meta-learner space are the same As a result, the innerloop as well as the outer loop has to be performed on a high-dimensional parameter space, which
is inefficient In Meta-learning with latent embedding optimization [33], the authors relax thislimitation by introducing a latent space
The authors argue that it is beneficial to relax the assumption that there exists an optimized
q⇤ and replace suchq with a generative model in low-dimensional space More specifically,
a task-dependent conditional probability distribution over q The adaptation steps are nowperformed on the low-dimensional latent space and task specific parameters are sampled fromthe generative model
Figure 3.2: LEO algorithms (figure from [33]) Instead of maintain a high-dimensional rametersq to captures information from the tasks distribution as in MAML, LEO proposed alow-dimensional generative model, conditioned on underlying tasks with an encode technique.The adaptation steps are now performed on this latent space Finally, the task specific parame-ters are sampled from the generative model
pa-Although LEO has relaxed one difficulty when working with MAML as well as otherparametric-based meta-learning algorithms that is the high-dimensional optimization space, it
is still very unstable to actually archive good performance In LEO work, the authors actuallyhas to tune the model hyperparameters carefully
Meta-SGD: Learning to Learn Quickly for Few-Shot Learning
Meta-SGD [18] is closely related to MAML, but it has ’higher capacity’ Meta-SGD notonly learns the initialize parameters for the model but it also learns the direction, learning ratefor adaptation procedure In the beginning, the model parameters and learning rate of adaptationsteps are randomly initialized The inner loop is mostly the same as in MAML However, in theouter loop, the paratemets as well as the learning rate are updated according to the loss of thebatch
Trang 26Figure 3.3: Meta-SGD algorithms (figure from [18]) Learning rate of adaptation or inner loopsteps is updated along with the model parameters during meta-training.
3.1.2 Metrics-Learning
3.1.3 Boosting few-shot visual learning with self-supervision
Theoretically, the work in [9] is an add-on module that is applicable to various works Inthe sense that model trained on base categories tends to be overfitting, the author introduce aself-supervised task along with the classification task as a form of auxiliary loss
Figure 3.4: [9] At training time, the network is trained with standard supervised learning topredict object labels as well as rotation labels for images
The best described combination is as follow At training time, the model is trained in astandard supervised manner Specifically, a number of batches of images from training set issampled from Db training set For each batch B = {(xi,yi) ,i = 1,2, ,m}, four rotations in
R ={0 ,90 ,180 ,270 } are applied to each image The result batch is then
Trang 27The training loss is now
At test time, the classification weight for novel categories is computed as in 3.5
Prototypical networks for few-shot learning
Prototypical networks [36] aims to solve the few-shot classification problem This is anon-parametric methods, i.e., the adaptation steps are ’computing steps’ rather than being ’op-timized’ Prototypical aims to learn a good feature extractor fq which has good generalizationperformance at test time For a K-way N-shot support set Ds⇢ IN⇥YN, where IN is set of im-ages for N specified classes; YN={1,2, ,N}, the algorithms first feeds all the images into thefeature extractor fq The ’prototype’ for each class i 2 YN is computed as average of extractedfeature of sample images of that class:
pi= 1
K Â
x2I i
where Ii={x|(x,y) 2 Ds,y = i}
The normalized score for each class i, given a query imagexq, is then
Ci fq xq ,Ds = exp gsim fq xq ,pi
Âj2YNexp gsim fq xq ,pj (3.6)where sim(·,·) is a similarity function; g is the inverse temperature The authors investigate theuse of cosine similarity and negative squared Euclidean distance as the similarity function for3.6
The loss at training time for each batch is then
Trang 28Figure 3.5: Relation Network (figure from [37]) The overall model consists of two modules,namely embedding module and relation module The embedding module extracts feature frominput images to a fixed size vector The relation module plays as a learnable similarity function.The classification is performed in a nearest neighbor manner.
Learning to compare: Relation network for few-shot learning
The overall algorithm of Relation network [37] is quite the same as in Prototypical network.The authors proposed to use a deep distance function rather than standard squared Euclideandistance The deep distance metric is learned along with the feature extractor fq during meta-training More formally, let fq and gf be the feature extractor and the ’relation module’, re-spectively For an episode Ds⇢ IN⇥YN, the relation score of a query imagexqwith respect to
a class i is
Ci fq xq ,Ds =gf fq xq ,pi (3.8)The authors propose to regress the relation score to the ground truth in one-hot codingscheme The loss is then
Dynamic few-shot visual learning without forgetting
In [10], the authors do not compute prototype for base categories during meta-training.Instead, the authors maintain a set of base weights Wb={w1, ,wN b} At meta-training time,the base weights are substituted for the set of prototypes pi’s and are trained along with thefeature extractor fq The score of a query imagexqwith respect to a class i is
Ci fq xq ;Wb = exp gcos fq xq ,wi
Âw2Wbexp gcos fq xq ,w . (3.10)
Trang 29Figure 3.6: Attention-based weight generator figure from [18] The model consists of afeature extractor and a classifier, and a few-shot classification weight generator All are trainedsimultaneously At test time, the classification weight generator takes base categories and novelimages from support set as input and produce the classification weights for novel classes.
At test time, the prototypes or weights for novel categories are derived from a based weight inference procedure The detailed form of weight for a typical novel class is asfollow
attention-w = favg wavg+fatt watt, (3.11)wherewavgis computed as prototype in 3.5; watt is attention-based weight pooled from baseweight; favg andfatt are learnable weight vectors Finally, the category inference step is thesame as in Prototypical network
The authors argue that attention mechanism can help network explicitly capture prior edge and transfer it to novel classes at test time, hence improve the performance
knowl-3.2 Object Detection
There are multiple ways to divide recent object detection approaches into groups Onecould split these approaches into the regression-based detector and the sliding window detector.Another way is to classify object detection methods based on their number of stages In thistext, we consider the latter one
3.2.1 Two-stage Detectors
Rich feature hierarchies for accurate object detection and semantic segmentation
This work from [12], which is also known as R-CNN, is one of the most popular works indeep approaches to object detection At the time, the convolution neural(CNN) network hadhuge attention since [15] achieved state-of-the-art on image classification benchmarks usingCNN R-CNN is the result of combining region proposals with CNNs
Instead of exhaustively investigating all bounding boxes within an image, they use a searchalgorithm to produce a reasonable number of 2000 regions These regions are called regions
Trang 30Figure 3.7: R-CNN (figure from [12]) In R-CNN, a embedding network extracts feature foreach of 2000 region proposals within the underlying image A SVM is then adopted to classifyclass label for each feature vector.
proposals A feature extractor then takes as input each proposal and produces a correspondingfeature map Finally, an SVM is adopted to classify the regions
Fast R-CNN
In this paper, [11], the authors propose a method called Fast Region-based ConvolutionalNetwork or Fast R-CNN to tackle object detection problems The work focuses on improvingtraining and testing speed with nine times and 213 times faster than R-CNN at the trainingand testing stage, respectively The authors point out three significant drawbacks regarding itsmultiple stages training pipelines, the complexity of training, and detection speed
Figure 3.8: Fast R-CNN (figure from [11]) A convolutional-based network takes an entireimage and several object proposals as input The result feature map is then ROI pooled toproduce a set of fixed vectors, which are then post-processed to class probabilities and boundbox offsets
The Fast R-CNN first feed an image and a set of object proposals to a convolution-basedfeature extractor The output feature maps are then ROI pooled to a fixed-size feature vector foreach object proposal Finally, a few fully connected layers will output class probabilities andbounding boxes offsets
Faster R-CNN: Towards real-time object detection with region proposal networks
[31] comes up with an even faster version of R-CNN in which the region proposal stage
is learnable Prior works utilize a region proposal algorithm and build deep models on this
Trang 31However, it turns out to have a bottleneck at the region proposals algorithms.
Figure 3.9: Faster R-CNN (figure from [11]) The region proposals are produce from a able module, which significantly improve time performance
learn-In this paper, the authors propose to leverage a shared convolution-based model for regionproposals and feature extraction More specifically, an embedding network extracts featuresfrom an input image The result feature map is then fed to a region predictor, producing regionproposals for the image Finally, the original feature map is ROI pooled according to regionproposals from the predictor
3.2.2 One-stage Detectors
You only look once: Unified, real-time object detection
YOLO of [28] introduces the use of single CNN models in object detection They larize the object detection as a regression problem in which bounding box offsets and classprobabilities are inferred directly in one step Specifically, the input image is first divided into
formu-a S ⇥ S grid A pformu-atch is sformu-aid to be responsible for detecting formu-an object if the center of the objectlies within the patch Each patch can predict B bounding boxes at the same time; each of them
is defined by five values indicating center coordinates, width, height, and confidence score, spectively The confidence score is defined as P(Object) ⇥ IOUtruth
re-pred, where P(Object) is theprobability that the bounding box has an object; IOUtruthpred is the IOU of the predicted boundingbox and the ground truth one The patches also predict conditional class probabilities, giventhat the underlying patch contains an object As a result, the output of the network has a size of
S ⇥ S ⇥ (B ⇥ 5 +C) The limitation of this framework is that it can only predict one object pergrid patch If more than two objects whose centers fall within the same patch, the network will
be confused
Trang 32Figure 3.10: YOLO model (figure from [28]) The input image is divided into a S ⇥ S grid.Each grid will predict B bounding boxes by producing their offsets and confidence score, andpredict C conditional class probabilities Finally, a non-maximal suppression is utilized to re-duce multiple detections of the same object.
YOLO9000:Better, Faster, Stronger
[29] introduces an incrementally improved version of YOLO, i.e., YOLOv2 More cally, in this work, the author presents a series of small modifications that lead to improvementfor the first version of YOLO Some key changes are high-resolution classifier, convolutionalwith anchor boxes, dimension clusters, direct location prediction, fine-grained features, multi-scale training They also proposed a joint training scheme that can leverage weakly supervisedinformation from classification datasets to further improve the model’s capacity
specifi-SSD: Single shot multibox detector
[20] introduces a set of default boxes with different shapes and ratios into the detectionprocedure rather than infers for region proposals The network is trained to predict the presence
of given classes within each default box and produce adjustments for accurate final boundingboxes The final decisions are produced from a non-maximum suppression step