Khóa luận tốt nghiệp: Khai thác mô hình ngôn ngữ trong bài toán phát hiện đối tượng với ít mẫu dữ liệu

Hầu hết các phương pháp phát hiện đối tượng ít mẫu hiện có tập trung vào việc trích xuất đặc trưng hình ảnh mà bỏ qua thông tin mối quan hệ giữa các lớp và sự hỗ trợ của thông tin bổ sun

Trang 1

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

FALCUTY OF COMPUTER SCIENCE

PHAM NHAT HOANG

NGUYEN TRAN TIEN

BACHELOR OF SCIENCE IN COMPUTER SCIENCE

HONOR PROGRAM

HO CHI MINH CITY, 2024

Trang 2

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

FALCUTY OF COMPUTER SCIENCE

PHAM NHAT HOANG - 20520052

NGUYEN TRAN TIEN - 20521011

BACHELOR OF SCIENCE IN COMPUTER SCIENCE

HONOR PROGRAM

THESIS ADVISOR

Dr NGUYEN VINH TIEP

HO CHI MINH CITY, 2024

Trang 3

The Assessment Committee is established under the Decision

President of the University of Information Technology:

1 - Chairman

"¬ - Secretary

Ậ - Member

We - Member

Trang 4

ĐẠI HỌC QUOC GIA TP HO CHÍMINH CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

TRƯỜNG ĐẠI HỌC Độc Lập - Tự Do - Hạnh Phúc

CÔNG NGHỆ THÔNG TIN re

TP HCM, ngay thang ndm

NHAN XET KHOA LUAN TOT NGHIEP

(CUA CAN BO HUONG DAN)

Tên khóa luận:

KHAI THÁC MÔ HÌNH NGÔN NGỮ TRONG BÀI TOÁN PHÁT HIỆN ĐÓI TƯỢNG

VỚI ÍT MẪU DỮ LIỆU

Nhóm SV thực hiện: Cán bộ hướng dẫn:

Phạm Nhật Hoàng 20520052 TS Nguyễn Vinh Tiệp

Nguyễn Trần Tiến 20521011

Đánh giá Khóa luận

1 Về cuốn báo cáo:

Số trang - _ Só chương

Sô bảng sô liệu So hình vẽ

Sô tải liệu tham khảo Sản phâm Một sô nhận xét về hình thức cuôn báo cáo:

Trang 5

3 Về chương trình ứng dụng:

Người nhận xét

(Ký tên và ghi rõ họ tên)

Trang 6

ĐẠI HỌC QUỐC GIA TP HO CHÍMINH CỘNG HOA XÃ HỘI CHỦ NGHĨA VIỆT NAM

CÔNG NGHỆ THÔNG TIN

TP HCM, ngay thang ndm

NHAN XET KHOA LUAN TOT NGHIEP

(CUA CAN BO PHAN BIEN)

Tên khóa luận:

KHAI THÁC MÔ HÌNH NGÔN NGỮ TRONG BÀI TOÁN PHÁT HIỆN ĐÓI TƯỢNG

VỚI ÍT MẪU DỮ LIỆU

Nhóm SV thực hiện: Cán bô phản biên:

Phạm Nhật Hoàng 20520052

Nguyễn Trần Tiến 20521011

Đánh gia Khóa luận

5 Về cuốn báo cáo:

Số trang - Số chương

So bảng sô liệu SỐ hình vẽ

Sô tài liệu tham khảo Sản phâm

Một sô nhận xét vê hình thức cuôn báo cáo:

Trang 7

7 Về chương trình ứng dụng:

Người nhận xét

(Ký tên và ghi rõ họ tên)

Trang 8

ĐẠI HỌC QUỐC GIA TP HO CHÍMINH CONG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

CÔNG NGHỆ THÔNG TIN

DE CƯƠNG CHI TIẾT

Tên dé tài: KHAI THÁC MÔ HÌNH NGÔN NGỮ TRONG BÀI TOÁN PHÁT HIỆN

ĐÓI TƯỢNG VỚI ÍT MẪU DỮ LIỆU

Tên đề tài tiếng anh: EXPLOITING LANGUAGE MODELS FOR FEW-SHOT

OBJECT DETECTION

Ngôn ngữ thực hiện: Tiếng Anh

Cán bộ hướng dẫn: Nguyễn Vinh Tiệp

Thời gian thực hiện:Từ ngày L5/09/2023 đến ngày 15/12/2023

Sinh viên thực hiện:

Phạm Nhật Hoàng - 20520052 Lớp: KHTN2020

Email: 20520052@gm.ui(.edu.vn Điện thoại: 0837890278

Nguyễn Trần Tiến - 20521011 Lớp: KHTN2020

Email: 20521011@gm.uif.edu.vn Điện thoại: 0963690428

Nội dung đề tai:(M6 ta chỉ tiết mục tiêu, phạm vi, doi tượng, phương pháp thực hiện, kết

quả mong đợi của đề tài)

Phát hiện đối tượng trong tình huống ít mẫu (few-shot object detection) nhằm xác định vị trí và

lớp của các đối tượng trong một hình ảnh trong bối cảnh của một số mẫu huấn luyện còn ít Hầu

hết các phương pháp phát hiện đối tượng ít mẫu hiện có tập trung vào việc trích xuất đặc trưng

hình ảnh mà bỏ qua thông tin mối quan hệ giữa các lớp và sự hỗ trợ của thông tin bổ sung như

thông tin ngữ nghĩa Ngoài ra, các kỹ thuật phát hiện đối tượng ít mẫu có một vấn đề tích cực bởi

vì đữ liệu huấn luyện không đủ, giới hạn khả năng khám phá thông tin ngữ nghĩa một cách đầy

đủ Để giải quyết vấn đề này, chúng tôi đề xuất việc sử dụng kỹ thuật truyền thụ kiến thức

(knowledge distillation) trong phương pháp học phát hiện đối tượng ít mẫu, kết hợp thông tin ngữ

nghĩa bồ sung và mối quan hệ giữa các lớp dé hướng dẫn mô hình phát hiện tốt các trường hợp

mới Các thử nghiệm mở rộng trên tập dit liệu đánh giá phát hiện đối tượng ít mẫu PASCAL VOC

Trang 9

cho thấy phương pháp của chúng tôi đạt được kết quả cạnh tranh so với các phương pháp phát

hiện đối tượng ít mẫu tiên tiến nhất.

Stage |: Base training Stage II: Few-shot fine-tuning

Fixed Feature Extractor

vey

Base Images Novel Shots

(Abundant) (Few)

Hình 1 Bài toán phat hiện đối tượng ít mẫu dữ liệu

Các phương pháp hiện đại dựa trên học sâu đã đạt được kết quả ấn tượng trên các nhiệm vụ thị

giác máy tính Tuy nhiên, các mô hình này chủ yếu phụ thuộc vào đữ liệu huấn luyện có nhãn quy

mô lớn, điều này tốn kém và không luôn có sẵn trong thực tế Các phương pháp như vậy gặp sự giảm hiệu suất mạnh mẽ khi giảm dt liệu huấn luyện Ngược lại, con người có khả năng học thông tin mới dựa trên một vài mẫu thông tin nhờ vào kiến thức trước đó của chúng ta Khả năng mong muốn này được mô phỏng thành một dang học ít mẫu (few-shot learning) trong máy tinh dé giải

quyết van đề trên Học ít mẫu được giới thiệu dé hiệu quả huấn luyện các mô hình học sâu trong

bối cảnh có hạn chế về đữ liệu và được áp dụng vào các nhiệm vụ thị giác máy tính bao gồm phân loại, phát hiện đối tượng và phân đoạn.

Gần đây, đã có nhiều nghiên cứu trong lĩnh vực phát hiện đối tượng ít mẫu (FSOD) Các phương pháp FSOD huấn luyện bộ phát hiện dé xác định vùng tiềm năng chứa các đối tượng mới và phân loại chúng trong hình ảnh với dữ liệu khan hiếm Thông thường, có hai tập dữ liệu: dữ liệu cơ sở với nhiều nhãn trên mỗi lớp và dữ liệu mới chứa một số mẫu có nhãn (ít hơn 10) Do đó, các bộ phát hiện ít mẫu được huấn luyện trên các lớp cơ sở dé học khái niệm của nhiệm vụ thị giác va fine-tune trên tập dit liệu mới dé có khả năng phát hiện các đối tượng mới trong tập kiểm tra.

Một số phương pháp FSOD áp dụng mạng Siamese dé điều chỉnh lại đặc trưng cho mỗi lớp dé

bắt được các đối tượng mới Các phương pháp FSOD khác tận dụng biến đổi dit liệu hoặc tăng

cường dữ liệu dé bổ sung dit liệu huấn luyện cho các lớp mới Gần đây, DeFRCN đã điều chỉnh

Trang 10

độ dốc giữa các mô-đun trong Faster R-CNN đề phân tách học đa nhiệm Các phương pháp khác

khám phá cơ chế chú ý trong kiến trúc transformer đề tạo ra các đặc trưng phân biệt.

Tuy nhiên, các phương pháp trên ít quan tâm đến việc tận dụng các tín hiệu b6 sung được trích

xuất từ một tập đữ liệu quy mô lớn hoặc được định nghĩa rõ ràng như thông tin ngữ nghĩa Hơn

nữa, các mối quan hệ giữa các lớp hoặc giữa các lớp cơ sở và các lớp mới trong dữ liệu như vậy

cũng được biểu diễn rõ ràng, điều này quan trọng dé cải thiện hiệu suất của các hệ thống phát hiện

đối tượng.

Trong công việc này, chúng tôi trình bày một phương pháp mới dé trích xuất các tín hiệu bỗ sung

và tận dụng mối quan hệ trong chúng dé cải thiện các bộ phát hiện ít mẫu Cụ thé, chúng tôi đề

xuất một mô hình đa modal làm đầu vào, hỗ trợ việc phân loại hiệu quả các đối tượng mới nhờ

vào các định nghĩa rõ ràng trong dữ liệu bồ sung của chúng Sau đó, mô hình được truyền thụ kiến

thức để chuyền giao kiến thức và giải phóng các đầu vào bô sung.

Đóng góp của chúng tôi được tông kết như sau:

e Ching tôi đề xuất một phương pháp mới áp dung học truyền thu dé hiệu quả tận dụng dữ

liệu huấn luyện bồ sung dé cải thiện các bộ phát hiện ít mau.

e_ Chúng tôi trình bày một phương pháp FSOD trích xuất các mối quan hệ ngữ nghĩa giữa

các lớp cơ sở và các lớp mới dé tận dụng các đặc trưng chung.

e Ching tôi cung cấp bằng chứng thực nghiệm rằng mô hình của chúng tôi vượt trội so với

các phương pháp cơ bản một cách đáng kể.

Nội dung và phương pháp thực hiện:

- Nội dung 1: Khảo sát và thiết kế

* Khao sát các công trình nghiên cứu liên quan đến bài toán phát hiện, các phương pháp

Phát hiện đối tượng trong tình huống it mẫu nói chung và các phương pháp kết hợp thông

tin ngữ nghĩa giữa hình anh và ngôn ngữ (Vision-langueage model)

* - Dựa trên kết quả khảo sát, lựa chọn mô hình tìm kiếm kiến trúc mạng nơ-ron phù hợp

nhất, thiết kế mạng mô hình học sâu, thiết kế nhiều cách kết hợp thông tin ngữ nghĩa giữa

Trang 11

hình ảnh và ngôn ngữ, lựa chọn/ thiết kê phương pháp huấn luyện phù hợp với mô hình Phát hiện đối tượng trong tình huống it mẫu dữ liệu

- Nội dung 2: Tìm hiểu dữ liệu.

Dữ liệu được sử dụng cho bài toán phát hiện đối tượng Ít mẫu (Few-shot object detection) là các

tap dữ liệu thường bao gồm hai phan chính: dit liệu cơ sé (base data) va đữ liệu mới (novel data).

e - Dữ liệu co sở (Base data): Đây là tập dữ liệu chứa các hình anh với đối tượng đã được gắn

nhãn chỉ tiết và có số lượng mau đủ lớn cho mỗi lớp Dữ liệu cơ sở này được sử dụng dé huấn luyện mô hình cơ bản, giúp mô hình hiểu khái niệm về các đối tượng cơ bản.

e - Dữ liệu mới (Novel data): Day là tập dữ liệu chứa các hình ảnh với các đối tượng mới mà

mô hình chưa từng thấy trong quá trình huấn luyện trên dit liệu cơ sở Tập dữ liệu mới này thường có số lượng mẫu rất ít (ít hơn 10 mẫu) cho mỗi đối tượng mới Dữ liệu này được

sử dụng dé fine-tune mô hình, dé mô hình có khả năng phát hiện các đối tượng mới trong tập đữ liệu kiểm tra.

Dữ liệu cơ sở làm nền tảng để mô hình học cách phát hiện và phân loại các đối tượng cơ bản, trong khi dữ liệu mới giúp mô hình thích nghi với các đối tượng mới mà nó chưa biết trước đó.

Sự kết hợp giữa hai loại dữ liệu này giúp mô hình có khả năng tự động hóa trong việc phát hiện

đối tượng ít mẫu.

Một số tập đữ liệu thường được sử dụng cho bài toán phát hiện đối tượng ít mẫu (Few-shot object

detection) Dưới đây là một số ví đụ tiêu biểu:

e PASCAL VOC: Tập dữ liệu PASCAL VOC là một trong những tập dữ liệu phổ biến và

được sử dụng rộng rãi Nó bao gồm nhiều loại đối tượng và hình ảnh đã được gan nhan

chỉ tiết về vị tri của các đối tượng trong hình ảnh.

se COCO: Tập dit liệu COCO (Common Objects in Context) là một tập dữ liệu lớn với nhiều

loại đối tượng và hình ảnh có độ phân giải cao Nó cung cấp dữ liệu đa dạng cho việc phát hiện đối tượng và phân đoạn đối tượng.

- Nội dung 3: Huấn luyện mô hình

» Lựa chọn thuật toán tôi ưu hóa mô hình và tìm kiêm các siêu tham sô liên quan

» _ Xây dựng và huấn luyện mô hình

Trang 12

- Nội dung 4: Thực nghiệm đánh giá, so sánh

Đánh giá hiệu suất của phương pháp đề xuất sử dụng: Mean Average Precision (mAP):

Đây là một phương pháp đánh giá phô biến cho bài toán phát hiện đối tượng Mô hình

được đánh giá dựa trên độ chính xác của việc phát hiện đối tượng mới trên tập dữ liệu kiểm tra mAP tính toán độ chính xác trung bình cho từng lớp và sau đó lấy trung bình của

các giá trị này.

Trong đó sẽ có 3 giá trị để đánh giá mAP, bAP và nAP trong đó:

e mAP: dùng dé đánh giá chung khả năng nhận diện của mô hình ở dit liệu cơ sở và

dữ liệu mới

e bAP: dùng dé đánh giá kha năng nhận diện của mô hình ở đữ liệu co sở, mục tiêu

dé xem mô hình có bị “quên kiến thức cũ” không.

e nAP: dùng dé đánh giá khả năng nhận diện của mô hình ở dit liệu mới, mục tiêu dé

xem khả năng học với ít mẫu dữ liệu của mô hình

Ket quả dự kiên:

* _ Kết quả khảo sát tong quan về các hướng nghiên cứu của Fewshot Object Detection.

* Tai liệu mô tả về các bộ dữ liệu hiện có trên các tập PASCAL VOC và bộ dữ liệu COCO.

* Tai liệu về thông số kĩ thuật trong qua trình huấn luyện và bộ trọng số thu thập được.

Kết quả đánh giá, so sánh giữa phương pháp đề xuất và các phương pháp liên quan.

Kế hoạch thực hién:(M6 ta kế hoạch làm việc và phân công công việc cho từng sinh viên tham gia)

Kế hoạch làm việc:

* Thang đầu tiên: Tiến hành nội dung 1 - Khảo sát các công trình nghiên cứu liên quan đến

bài toán

* Nita tháng tiếp theo: Tiến hành nội dung 2 — Làm quen framework Detectron2 dé tiến

hành lập trình và mô hình hoá phương pháp đề xuất

* Nita thang còn lại: Tiến hành nội dung 3 - Huấn luyện mô hình.

Trang 13

* Thang cuối cùng: Tiến hành nội dung 4 - Thực hiện đánh giá phương pháp đê xuất và so

sánh với các phương pháp khác Viết báo cáo trình bày chỉ tiết về phương pháp đề xuất,

các kết quả đạt được

Phân công công việc:

- Pham Nhật Hoàng: khảo sát các nghiên cứu liên quan, chỉ đạo thực nghiệm, viết báo cáo

° Nguyễn Trần Tiến: khảo sát các nghiên cứu liên quan, chạy thực nghiệm, hỗ trợ viết báo

Trang 14

We would like to express our deepest gratitude to all those who have been involved in the

successful completion of this thesis

First and foremost, we would like to thank our supervisor, Dr Vinh-Tiep Nguyen, for

their invaluable guidance and support throughout the course of this work Dr Tiep has

been an incredible mentor, providing insightful feedback and suggestions for us to do this

dissertation more completely

We would also like to give special thanks to Anh-Khoa Nguyen Vu for his guidance and

encouragement throughout our research Thank you so much for your time and effort in

helping us deal with our tackle during the research process

We would also like to thank our family and friends for their support and encouragement

throughout the entire process

Finally, we would like to express our gratitude to the computing resources provided at

MMLAB UIT and the Faculty of Computer Science, which enabled us to develop the

algorithms and experiments presented in this thesis

All in all, this project would not have been possible without the assistance and help of the

people mentioned above Our sincere appreciation goes out to each of them

Trang 16

3.3 Our baseline: Decoupled Faster R-CNN (DeFRCN)]|

3.4 Improve Localization Awarenes| Ặ ee

Trang 18

List of Figures

1.1 Flow chart of Generalized Few-shot Object Detection} 3

ắ 4 1.3 Typical challenges in object detection [47].| 8

2.1 R-CNN Algorithm [2ZI]]| - 18

2.2 Architecture of Faster R-CNN [6l1].| - 19

2.3 YOLOv1 model [60]| - 20

[ -YOLOJ60).) 2.2 ee 21 2.5 Architecture of DETR [S]| - 22

2.6 Illustration of two-stage fine-tuning approach (TFA) [77]| 23

2.7 Image-text contrastive learning in CLIP model [56].| 24

2.8 The architecture of knowledge distillation J22]| - 25

3.1 Overview of approach Our approach has 3 components: DeFRCN, Spa-tial, and Semantic Distillation The main task of Spatial impact in the feature via Lxp loss Semantic module directly impacts the classification distribution of the FSODmodel 30

Trang 19

3.2 The architecture of our baseline: Decoupled Faster R-CNN (DeFRCN)

Compared to the standard Faster R-CNN, the framework

incor-porates two Gradient Decoupled Layers (sky-blue) and an offline

Pro-totypical Calibration Block (red) to enable decoupling for multi-stage and

multi-task functions The A denotes the affine transformation layer in

GDL, while © represents the score fusion operation in PCB

Addition-ally, yellow and dark blue signify whether the block is trainable or frozen

during fine-tuning, respectively The solid orange and dotted black lines

illustrate the forward flow and gradient flow within the system) 35

3.3 Architecture of Spatial Distillation Module This module guides the model

by creating an attention map with word embedding of objects’s classes and

bounding boxes 2 Ặ Q Q Q Q Q HQ R 38

3.4 Similarity score of classes in PASCAL VOC Take a example the

sim-ilarity between "bicycle" (a base class) and "motorbike" (a novel class)

embeddings is 0.57, where the maximum consistency index between any

two given "motorbike" or "bicycle" is observed This learning mechanism

equips the FSOD model to overcome challenges arising from the lack of

information between classes, enabling it to generalize and make informed

AoO

predictions even in scenarios with limited annotated dataj

3.5 _ Overview of Semantic Distillation Module The visual features are treated

as Query vectors, and vectors embedded ò classes-word are treated as Key

and Value vectors Then, we apply some techniques to reduce bias in

training and fine-tuning sesslons} 43

3.6 Cosine Similarity of "Background" embedding with each class

embed-ding It is easy to see that the word "Background" in the latent space of

the pre-trained language model is expressed differently from the meaning

of the background area in the image field| 45

Trang 20

4.1 Examples of the PASCAL VOC 2012 dataset| 48

4.2 The visualization result in the Spatial Distillation module: (left) ground

truth; (mid) the feature of backbone output after adding the semantic

fea-ture, which the model needs to learn; (right) the feature of backbone

out-put model learned.) 2 2.2.2 0.2.0 000 eee eee eee 51

4.3 Example result of our model The left side is a result of our model and

the right side is the ground truth In the left column (a-d), there are cases

in which our model can work well On the other hand, there are cases that

our model can not pass through (e-h)| - 53

4.4 Similarity score of vector embedded classes of CLIP in PASCAL VOC

The range of this Similarity score is (0.6,0.8), the scale of this is lower

than the Similarity score of GloVe (-0.05,0.6)) 55

Trang 21

List of Tables

4.1 G-FSOD results on PASCAL VOC Novel Set | of a single run with Spatial

Leben eee eee 50

4.2 G-FSOD results on PASCAL VOC with SemD only.| 51

Ắầáa 52

ony| 4a AP ø»s% | / 54

4.5_ Experiments choosing vector to present the background meaning in the

Trang 22

GDL Gradient Decoupled Layer

KL Knowledge DistillationPCB Prototypical Calibration BlockR-CNN Region-based with CNN

ROI Region Of Interest

RPN Region Proposal Network

SemD Semantic Distillation

SpaD Spatial Distillation

SRR-FSD Semantic Relation Reasoning Few-Shot Object Detection

SSD Single Shot (Multibox) Detector SVM Support Vector Machine

TFA Two-stage Fine-tuning Approach

VLM Vision-Language Model

YOLO You Only Look Once

Trang 23

Few-shot object detection aims to determine the location and the class of

ob-jects in an image in the context of a few training samples Most existing few-shot object

detection methods focus on extracting features from images while ignoring the relation

information between classes and the aid of extra information such as semantic

informa-tion Our motivation is that the relationship between classes assists in determining objects

of novel classes more effectively

In addition, the semantic information such as the label information, and the

object attributes assists the detector in capturing generic information about classes

Along with that, image vision-language models appeared such as CLIP and

BLIP, which are a category of artificial intelligence models designed to understand and

bridge the gap between visual and textual information These models combine computer

vision and natural language processing techniques to process and generate content that

incorporates both text and images

In this thesis, we present a new method for few-shot object detection which

utilizes the extra text data and the relations between classes to guide models to detect novel

instances well The extensive experiments on the benchmark object detection datasets

PASCAL VOC show that the proposed method slightly improves the baseline

Keywords:

Object Detection, Few-shot Learning, Few-shot Object Detection, Language Models, Text

Embedding

Trang 24

Chapter 1

Introduction

1.1 Overview

1.1.1 Pratical Context

Over the last decade, remarkable advancements in powerful computing technology and

Deep Neural Networks (DNNs) with the widespread availability of large-scale datasets

have enabled machines to make remarkable progress in mimicking human abilities

Com-puter vision also has several applications in image classification, object detection, and

semantic segmentation It’s the foundation for numerous computer vision applications in

fields like autonomous driving [23}|18], remote sensing [8}|84]], robotics [54} |19], etc.

The problem arises because labeling massive amounts of data is expensive and

time-consuming In certain scenarios, like medical applications or identifying rare species

[26] obtaining a large number of images might even be impossible Although Deep

learning systems are excellent at recognizing patterns in labeled data, struggle when faced

with new, unseen information Every time there’s new data, the entire system has to be

re-trained to understand these new patterns This becomes impractical because new data

is constantly becoming available

Trang 25

Chapter 1 Introduction

Object detection is a long-standing task in computer vision that involves teaching

ma-chines to identify and locate various objects within digital images or video frames It

plays a crucial role in diverse applications, such as autonomous vehicles, surveillance

systems, and medical imaging Traditionally, object detection systems rely on extensive

labeled datasets for training, enabling them to recognize predefined objects effectively

However, these systems often struggle when faced with new, unseen objects for which

limited labeled data is available This limitation poses a challenge known as few-shot

object detection (FSOD)

In the concept of few-shot object detection, the goal is to produce a machine-learning

model capable of recognizing and localizing new object classes using only a small

num-ber of examples during training Unlike conventional object detection, which typically

demands abundant labeled data for each object category, few-shot object detection focuses

on learning from minimal samples, mirroring human-like adaptability when encountering

novel objects Addressing this challenge requires innovative methodologies that enable

machines to generalize knowledge effectively from limited examples, thereby enhancing

their ability to detect and classify objects accurately even when encountering unfamiliar

categories The flow chart of the Few-shot Object Detection is depicted in Figure[I.1]

Trang 26

We believe that the sensitivity to limited data is due to relying solely on visual information

Learning novel objects happens only through images and independently among classes

As a result, this leads to the lack of visual information because visual data is limited in

few-shot scenarios However, the semantic connection between base and novel classes

remains constant For example, as shown in Figure [1.2] if we have prior knowledge that

the novel object "horse" looks similar to "cow", and interacts with "person" can interact

with "horse" aids in learning the concept of "bicycle" beyond solely using a few images

Base Classes: cow, person (many)

Novel Classes: horse (few)

Semantic Relations:

FIGURE 1.2: Semantic relation between base and novel classes is constant

regardless of the scarcity of novel class.

This thesis explores the domain of few-shot object detection, aiming to investigate novel

approaches and techniques that facilitate machines in learning from sparse data to detect

and localize new objects efficiently By advancing our understanding and methodologies

in this field, we aim to contribute to the development of more adaptable and robust object

detection systems

Trang 27

1.1.2 Problem Definition

FSOD aims to train the detector on a dataset with few samples for each class Then, the

trained model can locate and identify new objects in unseen images

FSOD setting has two disjoint sets including base classes and novel classes, denoted Coase

and Crovel, respectively, ie Chase Cnovet = 9 Base classes Cp„;¿ consist of sufficient

training samples to construct the concept of object detection in base training and the base

model is considered as a |Cp„;e|-class detector Meanwhile, novel classes that contain

only K-shot samples (K € [1,30]) for each class are utilized to refine the detector in the

previous stage to have the ability to detect novel objects while retaining the performance

on base classes Therefore, FSOD works often apply two-stage fine-tuning: base training

and novel fine-tuning

In the base training stage, a base detector is trained on base dataset Dyase to learn some

basic concepts about object detection such as localization and identification The output

of this stage is a base detector on base classes Coase Then, it will be fine-tuned on unseen

novel classes Cyoye; 1n the fine-tuning stage

In particular, to learn knowledge from novel classes Cyc; and preserve the performance

on base classes Chase, the model is fine-tuned on a joint dataset that has all classes In each

class C, there are only K samples To this end, the model can detect all objects on both

base classes and novel classes This setting is called generalized few-shot object detection

(G-FSOD) (as depicted in Figure Furthermore, there are some FSOD settings in

which the model is trained on only novel dataset 2„ø„¿¡ in the fine-tuning stage Then, the

model finally is a |Cay;|-class detector This setting is few-shot object detection (FSOD)

We use the G-FSOD setting in this work

Trang 28

1.1.3 Challenges

Besides the practical benefits that the problem brings, FSOD has numerous challenges

and difficulties to solve

e Scale Variation: Objects can appear at different scales in images (as shown in

Figure|1.3|(c) Especially with very small objects (Figure[I.3](h)) because it is

chal-lenging to extract their features to identify and distinguish between them effectively

e Occlusions and Clutter: Objects may be partially obscured by other objects or

background clutter (as depicted in Figure (e)), leading to difficulties in

accu-rately detecting and localizing them

e Viewpoint and Pose Variation: Objects can have different orientations and poses

(for example in Figure (d)), making it challenging for detectors to generalize

across variations

e Background Complexity: Complex backgrounds or environments can make it

dif-ficult for detectors to distinguish between objects and their surroundings, especially

with camouflage objects

e Real-Time Processing: Achieving real-time object detection in applications like

autonomous vehicles or live video streams poses challenges due to the need for

speed and efficiency

e Class Imbalance: In FSOD, certain classes may be overrepresented or

underrepre-sented, leading to an imbalance between these classes in the training dataset This

causes the model to be biased so that it performs well in frequent classes but poorly

in infrequent ones

Trang 29

e Limited Labeled Data: FSOD models need to learn to detect objects with only

a few annotated examples per class, which makes it challenging to generalize to

unseen classes or variations within classes

e Feature Generalization: Generalizing visual features effectively from a few

ex-amples to new instances is complex, especially when dealing with diverse object

categories and variations in poses, scales, and viewpoints

e Domain Shift: Adapting the learned knowledge from a few examples to a new

domain or unseen scenarios (e.g., different lighting conditions, backgrounds, or

image qualities) is a significant challenge

e Fine-grained Discrimination: Discriminating between visually similar classes

with subtle differences becomes harder when only a few instances per class are

available for training

e Model Robustness: Achieving robustness in FSOD against occlusions, partial

ob-ject views, and other challenging scenarios requires sophisticated model

architec-tures and training strategies This may lead the performance model to become

WOrse.

e Evaluation Metrics: Assessing the performance of FSOD models accurately with

limited labeled data for testing is a challenge, and designing appropriate evaluation

metrics becomes crucial

In general, few-shot object detection faces many challenges This also drives

the research process to be more vibrant, thereby developing more accurate, powerful, and

efficient models In this thesis, we put effort into addressing some issues such as limited

labeled data, feature generalization, domain shift, and fine-grained discrimination

Trang 30

(a) Illumination (b) Deformation (c) Scale, Viewpoint

(e) Clutter, Occlusion (f) Blur (g) Motion

FIGURE 1.3: Typical challenges in object detection (47).

1.2 Motivations

FSOD expands the application potential of object detection in everyday life Some

prac-tical applications of FSOD include:

e Object Detection: Object detection is a fundamental task in computer vision,

cru-cial for various applications such as autonomous driving, robotics, and surveillance

e Limitations of Traditional Object Detection: Traditional object detection

meth-ods heavily rely on large-scale annotated datasets, posing challenges in scenarios

where labeled data is scarce or costly to obtain FSOD is a solution to address these

limitations by enabling models to generalize to new object classes with minimal

training data

Trang 31

e Resource Scarcity in Data Annotation: Annotated data collection for object

de-tection is resource-intensive and time-consuming Few-shot learning mitigates this

challenge by reducing the dependency on vast labeled datasets, making it more

ac-cessible and cost-effective

e Adaptability to Novel Environments: In dynamic environments where new object

categories emerge frequently, few-shot object detection models exhibit the potential

to adapt quickly, enabling seamless integration into evolving scenarios without

ex-tensive retraining

e Human-Like Learning: The essence of few-shot learning parallels human

cogni-tion, where humans can recognize and classify new objects with minimal exposure

This characteristic drives the aspiration to develop AI systems that mimic

human-like learning capabilities

e Continual Learning and Adaptation: Few-shot learning models possess the

po-tential for continual learning, where they can continuously acquire knowledge about

new objects and adapt to changing environments without forgetting previously learned

information

The research works on FSOD are expanding and achieving significant results

[77] However, most of these methods rely on exploiting a few available training

sam-ples to fine-tune the model This approach is simple for quick positive outcomes but not

truly effective due to low performance in extremely scarce data conditions Additionally,

some recent works have employed entity diversity by augmenting data

How-ever, these approaches rely on external datasets to enhance model performance, which

might not be feasible as real data collection can be very limited Conversely, Zhang et al

[81] utilized features from new classes with a few samples to generate synthetic examples.

Trang 32

Specifically, they used these new samples to train the model, generating synthetic

exam-ples across multiple stages While this method proves effective, it’s prone to overfitting

due to multiple fine-tunings on a restricted dataset

The practical applications and current limitations discussed above have driven us to

un-dertake this project In summary, there are three main reasons for conducting this study:

e Few-shot object detection has practical applications and aids in expanding

com-puter capabilities Therefore, gathering and synthesizing knowledge about FSOD is

essential for future advancements

e The performance of current FSOD methods is still insufficient for practical

deploy-ment, especially in extremely scarce data conditions like one-shot scenarios

e The Research community has not done much work in leveraging the semantic

infor-mation between classes to improve the performance of the FSOD model

1.3 Objectives

For the aforementioned reasons, we outline three main objectives in this study:

e Understand the basic concept and survey recent approaches in FSOD and some

methods related to leveraging semantic information

e Research and propose techniques leveraging semantic information for FSOD

e Conduct experiments on widely used benchmark datasets in FSOD to assess the

effectiveness of the proposed methods

10

Trang 33

1.4 Contributions

Our main contributions to this work are listed in the following points:

e Conducted an overview of the FSOD problem and standard datasets Explored

and analyzed recent advanced methods based on model design approaches The

comprehensive review of related works in few-shot object detection is detailed in

the subsequent section

e Propose a novel method applying knowledge distillation to effectively utilize

se-mantic information to improve the few-shot detectors

e We provide empirical evidence that our model outperforms the baseline methods

1.5 Structure of the Thesis

This thesis is divided into five key sections, each of which is designed to provide the

reader with a comprehensive overview of the topic:

e Chapter 1 - Introduction: This section provides an overview of the problem,

scope, challenges, motivations, objectives, and structure of this thesis

e Chapter 2 - Literature review: This section reviews the existing works in the field

of Few-shot Learning, Object Detection, Few-shot Object Detection and

Vision-Language Models, and Knowledge Distillation,

e Chapter 3 - Methodology: This section gives an overview, detail explains the

proposed approach, and outlines the objectives of this thesis

e Chapter 4 - Experiments: This section presents the experimental results of the

proposed approach and evaluates the effectiveness of the approach, as well as a

discussion of the results and implications

11

Trang 34

e Chapter 5 - Conclusion and Future Work: This section summarizes the findings

of this thesis and outlines possible future work in this area, as well as potential

applications and avenues for further exploration

12

Trang 35

Chapter 2

Literature Review

In this section, we review the literature of Few-shot Object Detection methods We then

discuss the recent trials of exploiting Language Models to Few-shot Object Detection

Finally, we discuss the related datasets used for training and evaluation

2.1 Few-shot Learning

Few-shot learning refers to the ability to learn from a small number of labeled samples,

inspired by human intelligence’s ability to grasp new objects with minimal exposure

Recently, deep learning models used for computer vision tasks like classification,

im-age segmentation, object detection, and recognition primarily rely on supervised learningtechniques These models are trained on extensive datasets like ImageNet (12), Microsoft

COCO [46], and Open Images containing millions of labeled samples.

2.1.1 Concepts and Preliminaries

In 2020, Wang et al provided a clear explanation of FSL by linking it to how machines

learn, which is widely accepted They stated that a machine learning task T is considered

to learn when it can improve its performance P, through experience E obtained by training

13

Trang 36

Chapter 2 Literature Review

on a large number of labeled images The program improves its accuracy P by training

on a vast collection of experiences # consisting of numerous images annotated by human

experts Few-shot Learning is a type of machine learning problem specified by E, T, and

P, where E contains only a limited number of examples with supervised information for

the target

In few-shot learning, the terms support set (S) and query set (Q) are used to describe the

datasets used for training and testing purposes For example, a handful of examples are

chosen from each category within the dataset D and assigned as the support set Similarly,

corresponding data points from each category are selected and designated as the query set

In this setup, the model is trained using the support set and evaluated using the query set

Additionally, few-shot learning models are trained episodically, wherein new data points

are sampled from the dataset D and allocated as the support and corresponding query sets

in each episode This means that during each episode, the model is trained and evaluated

using different support and query sets Over successive episodes, the model graduallylearns how to understand insights effectively from a smaller dataset (71).

To get a better understanding of FSL, it’s important to know about two key concepts:

the N-way-K-shot problem and cross-domain FSL The N-way-K-shot problem helps

de-scribe the specific challenges faced by FSL Here, the support set is a small set of data

used for training, which provides a reference for the actual testing done in the second

phase the query set is where the model needs to make predictions Notably, the classes

in the query set never appear in the support set When we say N-way-K-shot, it means the

support set has ý categories and K samples per category, making the entire task have only

N * K samples For instance, N-way-1-shot means one-shot learning, and N-way-0-shot

means zero-shot learning For instance, if the objective is to classify between cats and

dogs, this task constitutes a 2-way setup, given that there are merely 2 classes (cats and

dogs) being learned Furthermore, in the training phase, if the model is fed with only 5

samples each of cats and dogs as the support set, then it is a 2-way 5-shot learning setup

14

Trang 37

The fundamental aim of few-shot learning is to effectively categorize new image data,

having been exposed to only a limited number of training examples In practice, few-shot

learning proves valuable when obtaining adequate training examples is hard or when these

examples are rare for standard deep learning models Similarly, it holds significance in

situations where labeling data is high costs, such as in diagnosing uncommon diseases

When faced with insufficient data to sufficiently define the problem, a potential solution

involves acquiring knowledge from other similar problems within the same domain

2.1.2 Approaches

Few-shot Learning, based on how knowledge is integrated, is broadly categorized into

single-modal learning and multimodal learning In single-modal learning, it’s further

di-vided into data augmentation, transfer learning, and meta-learning These approaches

fo-cus on abstracting or transferring limited information into more advanced feature vectors

or meta-knowledge On the other hand, multimodal learning resembles human

intelli-gence more closely by exploring different types of information to support FSL, moving

beyond limited samples By using this classification, we thoroughly examine and discuss

each method in our review

Data Augmentation In practical FSL scenarios, the support and query sets

of-ten have few samples due to privacy concerns and high costs related to data collection and

labeling To address this challenge, data augmentation is seen as a direct solution to

en-rich the available samples in FSL However, a crucial concern in FSL data augmentation is

determining how well the expanded dataset represents the actual distribution of real data

FSL data augmentation methods are categorized into two types based on whether these

techniques can be applied to different tasks: hand-crafted rules and automatic learning

data processing

15

Trang 38

Transfer Learning Transfer learning, a classic learning paradigm 85},

ad-dresses the challenge of limited or no labeled samples in Few-Shot Learning (FSL)

(3] It revolves around reusing features, aiming to resolve the data scarcity in FSL narios The fundamental approach involves pretraining the model on extensive data and

sce-subsequently fine-tuning it on a limited support set However, when facing substantial

differences between the source and target domains, the efficacy of knowledge transfer

diminishes significantly This cross-domain context introduces a new hurdle for FSL

Within FSL, transfer learning is typically categorized into pre-training and fine-tuning

stages, often considered as the baseline approach Figure depicts this general process

Meta-learning Meta-learning acquires historical knowledge from both data and

tasks, extracting meta-knowledge for future task applications It operates irrespective of

specific problems, focusing on finding the best initialization parameters in task space,

unlike traditional supervised learning that disregards task-independent feature

represen-tation While many meta-learning models currently update parameters using gradient

descent, alternative methods exist, such as reinforcement learning and metric-based

ap-proaches In Few-Shot Learning (FSL), meta-learning automates the learning process of

model parameters, metric functions, and information transfer

Multimodal Complementary Learning So far, Few-Shot Learning (FSL) has

excelled mainly in the unimodal domain, where models represent information as feature

vectors or higher-level semantic vectors Multimodal learning in FSL aims to enhance

feature representations by leveraging complementarities between various modalities and

eliminating redundancies among them Just like parents teaching babies both general and

specific information Incorporating multiple modalities becomes crucial in FSL, which

inherently lacks sufficient valid information for accurate data or feature distribution

as-sessment Inspired by this, many studies have introduced additional modal

information to address FSL challenges By integrating multimodal information, models

can better comprehend small-sample data Figure xx illustrates the primary pathways of

16

Trang 39

FSL in multimodal contexts

Recent studies highlighted the limitations of visual features in specific

Few-Shot Learning (FSL) tasks Leveraging semantic space as supplementary

informa-tion has shown promise in providing crucial context for visual features Experiments

demonstrated that adaptive combinations of multiple modalities significantly

outperform unimodal FSL approaches Wang et al introduced weak semantic

super-vision for categories by integrating diverse visual features Meanwhile, Schonfeld et al

[63] utilized variational autoencoders (VAEs) to model semantic features based on latent

visual features Schwartz et al and Peng et al expanded semantic information

by incorporating class labels, attributes, natural language descriptions, and knowledge

in-ference Embedding loss functions aligned the additional semantic information with

visual features, substantially reducing knowledge transfer expenses Aoxue et al [42]

took it further by utilizing semantic information to model classes hierarchically

2.2 Object Detection

General object detection has undergone significant developments through two remarkable

paradigms in deep neural networks: the two-stage proposal-based approach and the

one-stage proposal-free method Notably, both paradigms have showcased impressive progress

across various large-scale benchmarks The two-stage approach, such as the R-CNN

se-ries, performs the detection process by generating a set of potential objects through a

region proposal network (RPN) Following this, it carries out category classification and

box localization, achieving end-to-end detection Conversely, one-stage detectors aim to

directly produce final predictions from the feature map without the RPN module, leading

to faster inference speeds

17

Trang 40

R-CNN: Regions with CNN features

[=] warped region | aeroplane? no.

1.Input 2 Extract region 3 Compute 4 Classify

image proposals (~2k) CNN features regions

FIGURE 2.1: R-CNN Algorithm [21].

2.2.1 Two-Stage Models

The R-CNN series pioneered the two-stage proposal-based approach, marking a

signifi-cant shift in the landscape of object detection The original R-CNN (Region-based

Con-volutional Neural Network) introduced the concept of generating region proposals using

selective search and then classifying these proposals individually Subsequent

refine-ments, such as Fast R-CNN and Faster R-CNN, improved computational efficiency by

integrating the region proposal generation step into the network itself, introducing the

Region Proposal Network (RPN)

R-CNN The R-CNN model generates about 2000 region proposals for

each image using selective search [69] These proposals are applied to the image,

produc-ing a fixed-size image (e.g., 227x227) After resizproduc-ing, each image undergoes a CNN to

transform it into a 4096-dimensional vector, which is then classified using an SVM

algo-rithm (shown in Figure[2 I) While this approach significantly enhances performance

compared to previous methods, the time-consuming process of using CNNs to generate

around 2000 feature vectors for an image limits the algorithm’s practical deployment in

real-life applications

Fast R-CNN Unlike R-CNN, Fast R-CNN utilizes feature map-based

re-gion synthesis for efficient proposal extraction Operating on an image and proposed

18

Tiêu đề	Khai thác mô hình ngôn ngữ trong bài toán phát hiện đối tượng với ít mẫu dữ liệu
Tác giả	Pham Nhat Hoang, Nguyen Tran Tien
Người hướng dẫn	TS. Nguyen Vinh Tiep
Trường học	University of Information Technology
Chuyên ngành	Computer Science
Thể loại	Thesis
Năm xuất bản	2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	89
Dung lượng	51,68 MB