0mVӕ8.48.01.01
LUҰN VĂN THҤ&6Ƭ
Trang 27KjQKSKҫQ+ӝLÿӗQJÿiQKJLiOXұQYăQWKҥFVƭJӗP
761JX\ӉQĈӭF'NJQJ««««- &KӫWӏFKHӝLÿӗQJ 761JX\ӉQ7LӃQ7KӏQK««««- 7KѭNê
769}7Kӏ1JӑF&KkX««««- 3KҧQELӋQ 4 PGS.TS 1JX\ӉQ7XҩQĈăQJ««- 3KҧQELӋQ
5 PGS.TS +XǤQK7UXQJ+LӃX««- Ӫ\YLrQ
;iFQKұQFӫD&KӫWӏFK+ӝLÿӗQJÿiQKJLi/9Yj7UѭӣQJ.KRDTXҧQOêFKX\rQQJjQKVDXNKLOXұQYăQÿmÿѭӧFVӱDFKӳDQӃXFy
.+2$+Ӑ&9¬.Ӻ7+8Ұ70È<7Ë1+
Trang 3ĈҤ,+Ӑ&48Ӕ&*,$73+&0
&Ӝ1*+Ñ$;+Ӝ,&+Ӫ1*+Ƭ$9,ӊ71$0 ĈӝFOұS- 7ӵGR- +ҥQKSK~F
NHIӊM VӨ LUҰ19Ă17+Ҥ&6Ƭ
+ӑWrQKӑFYLrQ7UҫQ&{QJ+ұX MSHV: 1970121 1Jj\WKiQJQăPVLQK 1ѫLVLQK/RQJ$Q &KX\rQQJjQK.KRDKӑFPi\WtQK 0mVӕ : 8.48.01.01
I 7Ç1Ĉӄ7¬, : +ӋWKӕQJWUҧOӡLFkXKӓLWUӵFTXDQ / Visual Question Answering
System
II 1+,ӊ09Ө9¬1Ӝ,'81* :
7UҧOӡLFkXKӓLGӵDYjRKuQKҧQKEҵQJQJ{QQJӳWLӃQJ9LӋW3KiWWULӇQKӋWKӕQJWUҧOӡLFkXKӓLGӵDWUrQP{FiFP{KuQKÿmFy6RViQKÿiQKJLiP{KuQKWUrQWұSGӳOLӋXWLӃQJ9LӋWYjWұSGӳOLӋXWLӃQJ$QKĈѭDUDFkXWUҧOӡLSKKӧSFKRFkXKӓLYăQEҧQWLӃQJ9LӋW
III 1*¬<*,$21+,ӊ09Ө : 22/02/2021
IV 1*¬<+2¬17+¬1+1+,ӊ09Ө: 13/06/2021 V &È1%Ӝ+ѬӞ1*'Ү1 : 3*6764XҧQ7KjQK7Kѫ
Trang 4LӠI CҦ0Ѫ1
7{L[LQWUkQWUӑQJJӱLOӡLELӃWѫQFKkQWKjQKÿӃQWKҫ\3*6764XҧQ7KjQK7KѫQJѭӡLÿmWUӵFWLӃSKѭӟQJGүQWұQWuQKFKӍEҧRW{LWURQJTXiWUuQKWKӵFKLӋQÿӅWjLĈӗQJWKӡLWKҫ\FNJQJOjQJѭӡLOX{QFKRW{LQKӳQJOӡLNKX\rQY{FQJTXêJLiYӅFҧNLӃQWKӭFFKX\rQP{QFNJQJQKѭÿӏQKKѭӟQJSKiWWULӇQVӵQJKLӋS7{L[LQFҧPѫQWKҫ\YӅQKӳQJNLӃQWKӭFPjWKҫ\ÿmWUX\ӅQÿҥW7{LFNJQJ[LQFKkQWKjQKFҧPѫQWҩWFҧTXê7Kҫ\&{WURQJNKRDÿmWұQWuQK JL~Sÿӥ ÿӇ W{LKRjQWKjQKÿӅWjL 7{L[LQFҧPѫQYӅWҩWFҧVӵJL~SÿӥFӫDDQKFKӏYjFiFEҥQKӑFYLrQFQJKӑF FKXQJYӟLW{LYjÿmJL~SW{LKRjQWKjQKÿӅWjLOXұQYăQ7KҥFVƭQj\YjJySêFKRW{LWURQJTXiWUuQKWKӵFKLӋQOXұQYăQ
+ӗ&Kt0LQKQJj\WKiQJQăP 7UҫQ&{QJ+ұX
Trang 5TÓM TҲT LUҰ19Ă1
7UҧOӡLFkXKӓLWUӵFTXDQOjPӝWWURQJQKӳQJFKӫÿӅWѭѫQJÿӕLPӟLWURQJOƭQKYӵFWUtWXӋQKkQWҥR9ӟLPӝWEӝGӳOLӋXÿҫXYjROjPӝWKuQKҧQKYjPӝWFkXKӓLGҥQJYăQEҧQFyQӝLGXQJOLrQTXDQÿӃQEӭFҧQKWKuKӋWKӕQJVӁFKRUDPӝWFkXWUҧOӡLFyêQJKƭDYj FyQӝLGXQJOLrQTXDQÿӃQFkXKӓLĈLӅXQj\FyQJKƭDOjPӝWKӋWKӕQJWUҧOӡLFkXKӓLWUӵFTXDQFҫQFyNKҧQăQJ[ӱOêKuQKҧQKFKҷQJKҥQQKѭQKұQGLӋQWKӵFWKӇSKiWKLӋQÿӕLWѭӧQJQKұQGҥQJKRҥWÿӝQJ
WLӃQJ9LӋWKLӋQFyWURQJOXұQYăQQj\ÿmÿҥWÿӝFKtQK[iFWәQJWKӇOj64.77% Thông
TXDOXұQYăQQj\W{LPRQJPXӕQÿyQJJySPӝWSKҫQQKӓ FKRFӝQJÿӗQJQJKLrQFӭXYӅWLӃQJ9LӋW
Trang 6ABSTRACT
Visual question answering is one of the relatively new topics in the field of artificial intelligence Use an image as input and an image-related question to give a meaningful and relevant answer to the question The visual question answering system needs to have image processing capabilities, such as entity recognition, object detection, activity recognition, etc
In this thesis, I aim to use the model to be able to give an answer to the text-based questions in Vietnamese The evaluation results on the existing Vietnamese dataset
in this thesis have achieved an overall accuracy of 64.77% Through this thesis, I
wish to contribute a part to the Vietnamese language research community
Trang 7LӠ,&$0Ĉ2$1
Tôi cam ÿRDQUҵQJFiFF{QJYLӋFWUuQKEj\WURQJOXұQYăQQj\OjGRFKtQKW{LWKӵFKLӋQYjFKѭDFySKҫQQӝLGXQJQjRFӫDOXұQYăQQj\ÿѭӧFQӝSÿӇOҩ\PӝWEҵQJFҩSӣWUѭӡQJQj\KRһFWUѭӡQJNKiF1ӃXNK{QJÿ~QJQKѭÿmQrXWUrQW{L[LQKRjQWRjQFKӏXWUiFKQKLӋPYӅ ÿӅWjLFӫDPuQK
1JѭӡLFDPÿRDQ
7UҫQ&{QJ+ұX
Trang 8&KѭѫQJ CÁC CÔNG TRÌNH LIÊN QUAN 3
2.1 .LӃQWU~F%RWWRP-Up and Top-Down Attention 3
Trang 11DANH MӨC CÁC TӮ VIӂT TҲT
R-CNN Regional-based Convolutional Neural Networks
Trang 13Hình 15: Multi head attention [15] 22
Hình 16: Mô hình cho bài toán 23
+uQK0{KuQK[ӱOêFkXKӓLYjKuQKҧQKEDQÿҫX 23
Hình 18: Mô hình Encoder ± Decoder [4] 25
+uQK+uQKҧQKOjPLQSXW 26
Trang 14&KѭѫQJ GIӞI THIӊU 1.1 Tәng quan
7URQJÿӡLVӕQJKLӋQQD\FRQQJѭӡLGӉGjQJQKuQWKҩ\PӝWKuQKҧQKYjWUҧOӡLEҩWNǤFkXKӓLQjROLrQTXDQÿӃQKuQKҧQKÿyEҵQJFiFKVӱGөQJNLӃQWKӭFWK{QJWKѭӡQJFNJQJQKѭNLQKQJKLӋPVӕQJFӫDFK~QJWD 7X\QKLrQFNJQJFyPӝWVӕWUѭӡQJKӧSQJRҥLOӋkhác QKѭOjQJѭӡLGQJNKLӃPWKӏKRһFcác QKjSKkQWtFKWUtWXӋKӑPXӕQFKӫÿӝQJWKXWKұSWK{QJWLQWUӵFTXDQEҵQJPӝWKuQKҧQKYjKӑNK{QJWKӇÿѭDUDÿѭӧFFkXWUҧOӡL'RÿyYLӋFÿL[k\GӵQJPӝWKӋWKӕQJWUҧOӡLFkXKӓLÿӇJLҧLTX\ӃWYҩQÿӅÿyOjPӝWF{QJYLӋFFҫQWKLӃW
0өFÿtFKFӫDKѭӟQJQJKLrQFӭXQj\OjÿLWuPKLӇXÿӇ[k\GӵQJPӝWKӋWKӕQJWUҧOӡLFkXKӓLWUӵFTXDQ9LVXDO4XHVWLRQ$QVZHULQJ [1] GӵDWUrQYLӋFiSGөQJWUtWXӋQKkQWҥR$UWLILFLDOLQWHOOLJHQFH Oҩ\ÿҫXYjROjKuQKҧQKYjFkXKӓLÿӇ ÿѭD ra ckXWUҧOӡLEҵQJQJ{QQJӳWLӃQJ9LӋW
+ӋWKӕQJVӁFyNKҧQăQJWUҧOӡLFiFFkXKӓLKRjQWRjQNKiFQKDXYӅPӝWKuQKҧQKĈӕLYӟLWҩWFҧFiFKuQKҧQKKӋWKӕQJWUҧOӡLFkXKӓLWUӵFTXDQQj\VӁFyWKӇ[iFÿӏQKYӏWUtÿӕLWѭӧQJÿѭӧFWKDPFKLӃXÿӃQWURQJFkXKӓLYjSKiWKLӋQQyYjSKҧLFyPӝWVӕNLӃQWKӭFWK{QJWKѭӡQJÿӇWUҧOӡLQyĈӗQJWKӡL94$FҫQSKҧL WUҧOӡLPӝWFkXKӓLWѭѫQJWӵQKѭFRQQJѭӡLӣFiFNKtDFҥQKVDX
- +ӑFNLӃQWKӭFWUӵFTXDQYjYăQEҧQWӯÿҫXYjRKuQKҧQKYjFkXKӓLWѭѫQJӭQJ - ӃWKӧSKDLOXӗQJGӳOLӋX
- 6ӱGөQJNLӃQWKӭFQkQJFDRQj\ÿӇWҥRUDFkXWUҧOӡL
Hình 1&KͱFQăQJFͯDP͡WK͏WK͙QJ94$
Trang 151.2 Thách thӭc cӫDÿӅ tài
7URQJÿӅWjLQj\W{LKѭӟQJÿӃQ[k\GӵQJPӝWKӋWKӕQJWUҧOӡLFkXKӓLFyNӃWKӧSYӟLKuQKҧQKĈӇWKӵFKLӋQÿѭӧFÿӅWjLFҫQÿzLKӓLQKLӅXNLӃQWKӭFPӟLWURQJFiFOƭQKYӵFWKӏJLiFPi\WtQK[ӱOêKuQKҧQK[ӱOêQJ{QQJӳWӵQKLrQ %rQFҥQKÿy mӝWWKiFKWKӭFNKiFOjFҫQJLҧLTX\ӃWEjLWRiQEҵQJWLӃQJ9LӋW
1.3 Mөc tiêu nghiên cӭu cӫDÿӅ tài
0өFWLrXFӫDÿӅWjLQj\OjQKҵPJLҧLTX\ӃWEjLWRiQWUҧOӡLFkXKӓLWLӃQJ9LӋWGӵDYjRKuQKҧQKÿmFKR.KLÿѭӧFFKRPӝWFkXKӓLYjPӝWKuQKҧQKWKuVӁÿѭDUDPӝWFkXWUҧOӡLFyQӝLGXQJOLrQTXDQÿӃQFkXKӓLYjKuQKҧQKÿmFKREҵQJFiFKVӱGөQJQJ{QQJӳWLӃQJ9LӋW
1.4 Giӟi hҥQYjÿӕLWѭӧng nghiên cӭu cӫDÿӅ tài
7URQJQJKLrQFӭXQj\W{LÿmWtFKKӧSFiFNӃWTXҧWLrQWLӃQFӫDYLӋFiSGөQJKӑFWұSVkXYjRYLӋFWUҧOӡLFkXKӓLFQJYӟLYLӋFWLӅQ[ӱOêGӳOLӋXEDQÿҫXOjFkXKӓLYjKuQKҧQKÿӇOjPFKRP{KuQKFyWKӇÿҥWKLӋXVXҩWWӕWKѫQT{LÿmWKӵFKLӋQFiFQKLӋPYөQJKLrQFӭXVDXÿk\
x 7uPKLӇXP{KuQKYӅWUҧOӡLFkXKӓLWUӵFTXDQYjWuPKLӇXFiFWұSGӳOLӋXKLӋQFy
x &KX\ӇQÿәLQJ{QQJӳFӫDEjLWRiQWӯWLӃQJ$QKVDQJWLӃQJ9LӋW x 3KiWWULӇQKӋWKӕQJ94$WLӃQJVLӋWGӵDWUrQP{FiFP{KuQKÿmFy x ĈiQKJLiNӃWTXҧÿҥWÿѭӧF
1.5 Ĉҫu ra cӫa nghiên cӭu
6DXNKLKRjQWKjQKGӵiQQJKLrQFӭXW{LK\YӑQJUҵQJVӁÿҥWÿѭӧFNӃWTXҧOjFyWKӇWUҧOӡLFKRFiFFkXKӓLWLӃQJ9LӋWYӟLEӝGӳOLӋXÿҫXYjROjKuQKҧQKYjFkXKӓLFyliên TXDQÿӃQKuQKҧQK
Trang 16&KѭѫQJ CÁC CÔNG TRÌNH LIÊN QUAN
7URQJTXiWUuQKQJKLrQFӭXYӅÿӅWjLQj\W{LÿmFyWuPKLӇXPӝWVӕF{QJWUuQKFyOLrQTXDQWӟLEjLWRiQWUҧOӡLFkXKӓLWUӵFTXDQ0ӝWVӕEjLYLӃWOLrQTXDQÿӃQÿӅWjLEDRJӗPFiFEjLEiRÿҥWNӃWTXҧFDRWURQJFiFFXӝFWKLYӅVisual Question Answering KҵQJQăP
2.1 KiӃn trúc Bottom-Up and Top-Down Attention
&iFFѫFKӃKӑFFK~êWRS-down và bottom-XSÿmÿѭӧFVӱGөQJUӝQJUmLWURQJYLӋFWҥR
chú thích cho KuQKҧQKYjWUҧOӡLFkXKӓLEҵQJKuQKҧQK 7URQJEjLEiR³Bottom-Up
and Top-Down Attention for Image Captioning and Visual Question Answering´ [2],
FiF WiF JLҧ ÿm ÿӅ [XҩW PӝW Fѫ FKӃ NӃW KӧS JLӳD ERWWRP-up và top-GRZQ &ѫ FKӃbottom-up attention ÿӅ[XҩWPӝWWұSKӧSFiF YQJKuQK ҧQKQәLEұWYӟLPӛL YQJÿѭӧFÿҥLGLӋQEӣLPӝWYHFWѫÿһFWUѭQJÿѭӧFJӝSFKXQJ1KyPWiFJLҧÿmWULӇQNKDLbottom-up attention EҵQJFiFKVӱGөQJ)DVWHU5-CNN [8]&ѫFKӃWRS-down attention VӱGөQJQJӳFҧQKWKHRQKLӋPYөFөWKӇÿӇGӵÿRiQVӵSKkQEәVӵFK~êWUrQFiFYQJKuQKҧQK6DXÿyYHFWRUÿһFWUѭQJFӫDÿӕLWѭӧQJVӁÿѭӧFWtQKWRiQGѭӟLGҥQJWUXQJEuQKFyWUӑQJVӕFӫDÿһF WUѭQJKuQKҧQKWUrQWҩWFҧFiFYQJĈӇÿiQKJLiP{KuQKWKuQKyPWiFJLҧÿmWKӵFKLӋQKDLEѭӟF ĈҫXWLrQOjVӱGөQJP{KuQKLPDJHFDSWLRQLQJÿӇOҩ\QKDQKWK{QJWLQFӫDYQJKuQKҧQKQәLEұt6DXÿyWiFJLҧÿmWLӃQKjQKWKӱQJKLӋPYjÿiQKJLiNӃWTXҧ
NghLrQFӭXFӫDQKyPWiFJLҧQj\ÿmJLjQKÿѭӧFYӏWUtFDRQKҩWWURQJ&XӝFWKLVisual
Question Answering ÿҥWÿӝFKtQK[iFWәQJWKӇOj70,3% YjÿѭӧFWKӱQJKLӋP
WUrQWұSGӳOLӋXVQA v2.0 test-std
2.2 KiӃn trúc PyThia
Trong bài báo ³3\WKLDY7KHZLQQLQJHQWU\WRWKH94$FKDOOHQJH´>@, tác
JLҧ ÿm JLӟL WKLӋX PӝW P{ KuQK PҥQJ KӑF VkX [ӱ Oê EjL WRiQ 9LVXDO 4XHVWLRQ$QVZHULQJ0{KuQKQj\GӵDWUrQP{KuQK%RWWRP± Up and Top - down Attention >@ÿmÿѭӧFÿӅFұSÿӃQWUѭӟFÿyQKѭQJFyPӝWYjLEәVXQJÿӇQKҵPWăQJÿӝFKtQK[iFFKRNӃWTXҧGӵÿRiQ
0{KuQKÿѭӧFPLQKKӑDQKѭKuQKErQGѭӟLWUtFK[XҩWUDFiFÿһFWUѭQJWӯKuQKҧQKVӱGөQJSKpSQKkQHOHPHQW-ZLVHÿӇNӃWKӧSFiFÿһFWUѭQJFӫDKuQKҧQKFkXKӓLWҥR
Trang 17UDPӝWWHQVRUWұSWUXQJ VӁPDQJÿҫ\ÿӫWK{QJWLQJLӳDQӝLGXQJFkXKӓLYjFiFÿӕLWѭӧQJOLrQTXDQWURQJEӭFҧQK
Hình 2.L͇QWU~F3\WKLD>@
0ӝWVӕEәVXQJWURQJNLӃQWU~FS\WKLDEDRJӗP
- Model Architecture: VӱGөQJSKpSQKkQHOHPHQW-ZLVHÿӇNӃWKӧSFiFWtQK
QăQJWӯSKѭѫQJWKӭFYăQEҧQYjKuQKҧQK
- Learning Schedule: WKD\ÿәLWӕFÿӝKӑFWURQJTXiWUuQKKXҩQOX\ӋQ
- Fine-Tuning Bottom-Up Features: ÿһWOHDUQLQJUDWHOjOҫQOHDUQLQJUDWH
WәQJWKӇ
- Data Augmentation: WKrPWұSGӳOLӋXWUDLQLQJ
- Model Ensembling: FKӑQFiFP{KuQKÿѭӧFÿjRWҥRYӟLFiFFjLÿһWNKiF
QKDXVӱGөQJXSGRZQPRGHOÿmÿѭӧFWUDLQVҹQӣWұSGӳOLӋX94$
.ӃWTXҧWKӵFKLӋQFӫDQKyPWiFJLҧQj\ÿmÿҥWÿѭӧFWURQJYLӋFÿiQKJLiYӟLWұSGӳOLӋXtest-std VQA v2
2.3 KiӃn trúc mҥng Modular Co-Attention Networks
'ӵDWUrQP{KuQKTransformer, mô hình Modular Co-Attention Networks (MCAN) [4] ÿѭӧFÿѭDUDYjRQăPP{KuQKQj\ÿmÿҥWÿѭӧFNӃWTXҧWӕWQKҩWWURQJFXӝFWKLYӅ9LVXDO4XHVWLRQ$QVZHULQJ7URQJEjLEiRFiFWiFJLҧÿmÿӅ[XҩWPӝWPҥQJÿӗQJFK~êWKHRP{-ÿXQ0&$1 EDRJӗPFiFOӟSÿӗQJFK~êWKHRP{-ÿXQ
Trang 18ÿѭӧFFiFYQJFyWKӇOjYұWWKӇWURQJҧQKYjFKXҭQKyDFK~QJYӅPӝWGҥQJYHFWRUÿӗQJQKҩWӣÿk\WiFJLҧVӁÿѭDYӅPӝWYHFWRUFyFKLӅX
7URQJNKӕL0RGXODU&R-$WWHQWLRQQKyPWiFJLҧÿmNӃWKӧSYӟLFiFÿһF WUѭQJÿmÿѭӧFWUtFK[XҩWWӯҧQKWK{QJTXDFѫFKӃERWWRP-XSDWWHQWLRQYjFkXKӓLFKRWUѭӟFWK{QJqua mô hình (Global vectors for word representation) GloVe [9] và LSTM [10] ÿӇÿѭD UD NӃW TXҧ WK{QJ TXD PӝW EjL WRiQ SKkQ ORҥL FRQ 6ӱ GөQJ KDL Fѫ FKӃ 6HOI-Attention và Guided-$WWHQWLRQOjÿLӇPKLӋXTXҧWURQJP{KuQKNKLFK~QJVӁÿѭӧFOLrQNӃWYӟLQKDXÿӇFyWKӇWKӵFKLӋQFѫFKӃWұSWUXQJKLӋXTXҧWUrQFҧKDLLQSXWOjFkXKӓLvà KuQKҧQKÿӇJLҧLTX\ӃWWӕWEjLWRiQ
0{KuQK0&$1ÿѭӧFELӇXGLӉQWKHRVѫÿӗKuQKErQGѭӟLJӗPFyEDJLDLÿRҥQ[ӱOêFkXKӓLYjKuQKҧQKÿҫXYjRVӱGөQJ'HHS&R-$WWHQWLRQ/HDUQLQJÿӇOҩ\FiFÿһFWUѭQJFӫDFkXKӓLYjKuQKҧQKVDXÿyKӧSQKҩWFiFÿһFWUѭQJYjÿѭDUDFkXWUҧOӡLFKRbài toán
Hình 3.L͇QWU~F0&$1>@
0{KuQKPDQJOҥLÿӝFKtQK[iFWUrQEӝtest-std VQA-v2
2.4 KiӃn trúc ImageBERT
/ҩ\êWѭӣQJWӯNLӃQWU~F%(57 [11@QәLWLӃQJWURQJOƭQKYӵF[ӱOêQJ{QQJӳWӵnhiên FӫD*RRJOHQăPQKyPQJKLrQFӭXӣ0LFURVRIWÿmÿӅ[XҩWP{KuQK,PDJH%(57 [12@ ÿҥW ÿѭӧF QKLӅX NӃW TXҧ ҩQ WѭӧQJ FKR FiF EjL WRiQ ÿD WKӇ PXOWL-model) ,PDJH%(57PmKyDFҧKuQKҧQKYjYăQEҧQӣWҫQJWUtFK[XҩW YHFWRUÿһFWUѭQJ6DXÿyVӁÿѭӧFFKX\ӇQWLӃSÿӃQFiFNKӕLPXOWL-head self-DWWHQWLRQFKRYLӋFKXҩQOX\ӋQYӟLWiFYөFKRYLӋFKXҩQOX\ӋQNK{QJJLiPViW
- Masked Language Modeling 0/0 7iFYөWѭѫQJWӵJLӕQJQKѭSKLrQEҧQ
JӕFFyQKLӋPYөGӵÿRiQFiFWӯÿѭӧFFKHOҥL
Trang 19- Masked Object Classification7iFYөÿѭӧFSKiWWULӇQWKrPGӵDWUrQ0/0
2.5 ĈiQKJLi
&iFEjLEiRPjW{LÿmWuPKLӇXӣWUrQÿӅXÿѭӧFÿѭDUDÿӇJLҧLTX\ӃWFKREjLWRiQ9LVXDO4XHVWLRQ$QVZHULQJ&iFEjLEiRÿѭӧFÿѭDUDӣFiFQăPNKiFQKDXPӭFÿӝFKtQK[iFQJj\FjQJÿѭӧFWăQJOrQÿiQJNӇTXDWӯQJQăP'ѭӟLÿk\OjEҧQJÿiQKJLiÿһFÿLӇPFӫDFiFP{KuQKYjPӝWVӕNӃWTXҧNKLFKҥ\WKӱQJKLӋPYӟLWұSGӳOLӋXVQA-v2
WұSGӳOLӋX94$-v2
Bottom-up and top-down attention
ĈӅ[XҩWPӝWFѫFKӃNӃWKӧSJLӳDbottom-up attention và top-down attention:
- 6ӱGөQJ)DVWHU5-&11ÿӇOҩ\UDFiFÿһFWUѭQJFӫDKuQKҧQK6ӱGөQJWRS-GRZQDWWHQWLRQÿӇOҩ\UDWK{QJWLQFӫDFkXKӓL
- 6ӱGөQJFDSWLRQPRGHOÿӇOҩ\QKDQKWK{QJWLQFӫDYQJKuQKҧQKQәLEұW
Trang 20- Fine-Tuning Bottom-Up Features:
ÿһWOHDUQLQJUDWHOjOҫQOHDUQLQJUDWHWәQJWKӇ
- Data Augmentation: WKrPWұSGӳ
OLӋXWUDLQLQJ
- Model Ensembling: FKӑQFiFP{
KuQKÿѭӧFÿjRWҥRYӟLFiFFjLÿһWNKiFQKDXVӱGөQJXSGRZQPRGHOÿmÿѭӧFWUDLQVҹQӣWұSGӳOLӋXVQA
MCAN 0{KuQKNӃWKӧSÿѭӧFFiFÿһFWUѭQJÿmÿѭӧFWUtFK[XҩWWӯҧQKWK{QJTXDFѫFKӃbottom-XSDWWHQWLRQYjFkXKӓLFKRWUѭӟFWK{QJTXDP{KuQK*OR9HÿӇÿѭDUDNӃWTXҧWK{QJTXDPӝWEjL WRiQSKkQORҥLcon
75.23% and 75.26%
on std and challenge
test-ImageBERT 0{KuQKQj\OjP{KuQKGӵDWUrQ7UDQVIRUPHUOҩ\FiFSKѭѫQJWKӭFNKiFQKDXOjPÿҫXYjRYjP{KuQKKyDPӕLTXDQKӋJLӳDFK~QJ
,PDJH%(57PmKyDFҧKuQKҧQKYjYăQEҧQӣWҫQJWUtFK[XҩWYHFWRUÿһFWUѭQJ6DXÿyVӁÿѭӧFFKX\ӇQWLӃSÿӃQFiFNKӕLmulti-head self-DWWHQWLRQFKRYLӋFKXҩQOX\ӋQYӟLWiFYөFKRYLӋFKXҩQOX\ӋQ
- Masked Language Modeling : có
QKLӋPYөGӵÿRiQFiFWӯÿѭӧFFKHOҥL
- Masked Object Classification: mô
huQKFKHÿLQJүXQKLrQOѭӧQJWKҿFӫDYұWWKӇ
- Masked Region Feature
QKLrQFiFYQJYұWWKӇWUrQҧQK
- Image-Text Matching: 7iFYөFy
YDLWUzOLrQNӃWJLӳDFiFYQJKuQKҧQKYӟLWӯÿѭӧFWUtFK[XҩWUDWURQJYăQEҧQ
%̫QJ1ĈiQKJLiFiFP{KuQK
Trang 21&KѭѫQJ KIӂN THӬC NӄN TҦNG 3.1 KiӃn thӭc lý thuyӃt nӅn tҧng
3.1.1 Mҥng Neural nhân tҥo
0ҥQJQѫURQQKkQWҥR (ANN) [13] OjP{KuQK[ӱOêWK{QJWLQÿѭӧFP{SKӓQJGӵDWUrQKRҥWÿӝQJFӫDKӋWKӕQJWKҫQNLQKFӫDVLQKYұWEDRJӗPVӕOѭӧQJOӟQFiF1ѫURQÿѭӧFJҳQNӃWÿӇ[ӱOêWK{QJWLQ
3.1.1.1 Mô hình Perceptron
0{KuQK3HUFHSWURQOjP{KuQKPҥQJQѫURQÿѫQJLҧQQKҩWFKӍYӟLPӝWWҫQJÿҫXYjRYjWҫQJÿҫXUDÿk\FzQÿѭӧFJӑLOjEӝSKkQWiFKWX\ӃQWtQKQySKөFYөFKRYLӋFJLҧLTX\ӃW FiF EjL WRiQ SKkQ ORҥL WX\ӃQ WtQK Ӣ WURQJ KuQK SKtD ErQ GѭӟL là mô hình
3HUFHSWURQVӱGөQJKjPsigmoid, Oҩ\YtGөPӝWP{KuQKSHUFHSWURQELӇXGLӉQݕො ൌ
ߪሺݓͲ ݓͳ כ ݔͳ ݓʹ כ ݔʹሻP{KuQKQj\KRҥWÿӝQJWK{QJTXDKDLEѭӟF
- Tính tәng linear: ܢ ൌ ͳ כ ࢝ ࢝ כ ܠͳ ࢝ כ ܠʹ 7URQJÿy࢝ ÿѭӧFJӑLOjbias
- 7tQKJLiWUӏWtQKWRiQÿҫXUDWUҧYӅWӯKjPNtFKKRҥW
Hình 4: Mô hình Perceptron
3.1.1.2 Mô hình Multilayer Perceptron
Trang 220{KuQK0XOWLOD\HU3HUFHSWURQOjPӝWP{KuQKFyFҩXWU~FWәQJTXiWKѫQP{KuQK3HUFHSWURQ0{KuQKQj\VӁFyNKҧQăQJJLҧLTX\ӃWFiFEjLWRiQSKkQWiFKSKLWX\ӃQ0{KuQK0XOWLOD\HU3HUFHSWURQÿѭӧFVӱGөQJSKәELӃQWURQJFiFEjLWRiQ SKkQORҥLÿӕLWѭӧQJSKiWKLӋQUDQKӳQJTXDQKӋSKӭFWҥSFӫDGӳOLӋXOjPQӅQWҧQJÿӇQJKLrQFӭXYjSKiWPLQKFiFNLӃQWU~FPҥQJKӑFVkXSKӭFWҥSWURQJOƭQKYӵFWKӏJLiFPi\WtQKKD\[ӱOtQJ{QQJӳWӵQKLrQ
Mô hình Multilayer Perceptron VӁJӗPFiFWKjQKSKҫQVDX - 0ӝWWҫQJÿҫXYjRLQSXWOD\HU
- 7ҫQJӣJLӳDKDLWҫQJQrXWUrQÿѭӧFJӑLWҫQJҭQKLGGHQOD\HU - 0ӝWWҫQJÿҫXUDRXWSXWOD\HU
Hình 5: Mô hình Multilayer Perceptron
0ӛLWҫQJWURQJP{KuQKPҥQJnày FyWKӇEDRJӗPPӝWKRһFQKLӅXÿѫQYӏJӑLOjnode 0ӛLnode FӫDWҫQJVDXVӁÿѭӧFOLrQNӃWYӟLWRjQEӝnode ӣWҫQJWUѭӟFNK{QJNӇWҫQJÿҫXYjR