DANH MӨC CÁC TӮ VIӂT TҲT R-CNN Regional-based Convolutional Neural Networks... DANH MӨC CÁC BIӆ8ĈӖ HÌNH ҦNH +uQK&KӭFQăQJFӫDPӝWKӋWKӕQJ94$ .... 8 Hình 5: Mô hình Multilayer Perceptron ....
Trang 2.+2$+Ӑ&9¬.Ӻ7+8Ұ70È<7Ë1+
Trang 4LӠI CҦ0Ѫ1
7{L[LQWUkQWUӑQJJӱLOӡLELӃWѫQFKkQWKjQKÿӃQWKҫ\3*6764XҧQ7KjQK7KѫQJѭӡLÿmWUӵFWLӃSKѭӟQJGүQWұQWuQKFKӍEҧRW{LWURQJTXiWUuQKWKӵFKLӋQÿӅWjLĈӗQJWKӡLWKҫ\FNJQJOjQJѭӡLOX{QFKRW{LQKӳQJOӡLNKX\rQY{FQJTXêJLiYӅFҧNLӃQWKӭFFKX\rQP{QFNJQJQKѭÿӏQKKѭӟQJSKiWWULӇQVӵQJKLӋS7{L[LQFҧPѫQWKҫ\YӅQKӳQJNLӃQWKӭFPjWKҫ\ÿmWUX\ӅQÿҥW7{LFNJQJ[LQFKkQWKjQKFҧPѫQWҩWFҧTXê7Kҫ\&{WURQJNKRDÿmWұQWuQK JL~Sÿӥ ÿӇ W{LKRjQWKjQKÿӅWjL 7{L[LQFҧPѫQYӅWҩWFҧVӵJL~SÿӥFӫDDQKFKӏYjFiFEҥQKӑFYLrQFQJKӑF FKXQJYӟLW{LYjÿmJL~SW{LKRjQWKjQKÿӅWjLOXұQYăQ7KҥFVƭQj\YjJySêFKRW{LWURQJTXiWUuQKWKӵFKLӋQOXұQYăQ
+ӗ&Kt0LQKQJj\WKiQJQăP
7UҫQ&{QJ+ұX
Trang 5TÓM TҲT LUҰ19Ă1
7UҧOӡLFkXKӓLWUӵFTXDQOjPӝWWURQJQKӳQJFKӫÿӅWѭѫQJÿӕLPӟLWURQJOƭQKYӵFWUtWXӋQKkQWҥR9ӟLPӝWEӝGӳOLӋXÿҫXYjROjPӝWKuQKҧQKYjPӝWFkXKӓLGҥQJYăQEҧQFyQӝLGXQJOLrQTXDQÿӃQEӭFҧQKWKuKӋWKӕQJVӁFKRUDPӝWFkXWUҧOӡLFyêQJKƭDYj FyQӝLGXQJOLrQTXDQÿӃQFkXKӓLĈLӅXQj\FyQJKƭDOjPӝWKӋWKӕQJWUҧOӡLFkXKӓLWUӵFTXDQFҫQFyNKҧQăQJ[ӱOêKuQKҧQKFKҷQJKҥQQKѭQKұQGLӋQWKӵFWKӇSKiWKLӋQÿӕLWѭӧQJQKұQGҥQJKRҥWÿӝQJ
7URQJOXұQYăQQj\W{LKѭӟQJÿӃQYLӋFVӱGөQJP{KuQKÿӇFyWKӇÿѭDUDÿѭӧFFkXWUҧOӡLFKRFkXKӓLGҥQJYăQEҧQEҵQJWLӃQJ9LӋW.ӃWTXҧÿiQKJLiWUrQEӝWұSGӳOLӋX
WLӃQJ9LӋWKLӋQFyWURQJOXұQYăQQj\ÿmÿҥWÿӝFKtQK[iFWәQJWKӇOj64.77% Thông
TXDOXұQYăQQj\W{LPRQJPXӕQÿyQJJySPӝWSKҫQQKӓ FKRFӝQJÿӗQJQJKLrQFӭXYӅWLӃQJ9LӋW
Trang 6ABSTRACT
Visual question answering is one of the relatively new topics in the field of artificial intelligence Use an image as input and an image-related question to give a meaningful and relevant answer to the question The visual question answering system needs to have image processing capabilities, such as entity recognition, object detection, activity recognition, etc
In this thesis, I aim to use the model to be able to give an answer to the text-based questions in Vietnamese The evaluation results on the existing Vietnamese dataset
in this thesis have achieved an overall accuracy of 64.77% Through this thesis, I
wish to contribute a part to the Vietnamese language research community
Trang 7LӠ,&$0Ĉ2$1
Tôi cam ÿRDQUҵQJFiFF{QJYLӋFWUuQKEj\WURQJOXұQYăQQj\OjGRFKtQKW{LWKӵFKLӋQYjFKѭDFySKҫQQӝLGXQJQjRFӫDOXұQYăQQj\ÿѭӧFQӝSÿӇOҩ\PӝWEҵQJFҩSӣWUѭӡQJQj\KRһFWUѭӡQJNKiF1ӃXNK{QJÿ~QJQKѭÿmQrXWUrQW{L[LQKRjQWRjQFKӏXWUiFKQKLӋPYӅ ÿӅWjLFӫDPuQK
1JѭӡLFDPÿRDQ
7UҫQ&{QJ+ұX
Trang 8MӨC LӨC
1+,ӊ09Ө/8Ұ19Ă17+Ҥ&6Ƭ i
/Ӡ,&Ҧ0Ѫ1 ii
7Ï07Ҳ7/8Ұ19Ă1 iii
ABSTRACT iv
/Ӡ,&$0Ĉ2$1 v
0Ө&/Ө& vi
'$1+0Ө&&È&7Ӯ9,ӂ77Ҳ7 ix
'$1+0Ө&&È&%Ҧ1* x
'$1+0Ө&&È&%,ӆ8ĈӖ+Î1+Ҧ1+ xxi
&KѭѫQJ *,Ӟ,7+,ӊ8 1
1.1 7әQJTXDQ 1
1.2 7KiFKWKӭFFӫDÿӅWjL 2
1.3 0өF WLrXQJKLrQFӭXFӫDÿӅWjL 2
1.4 *LӟLKҥQYjÿӕLWѭӧQJQJKLrQFӭXFӫDÿӅWjL 2
1.5 ĈҫXUDFӫDQJKLrQFӭX 2
&KѭѫQJ CÁC CÔNG TRÌNH LIÊN QUAN 3
2.1 .LӃQWU~F%RWWRP-Up and Top-Down Attention 3
2.2 .LӃQWU~F3\7KLD 3
2.3 .LӃQWU~FPҥQJ0RGXODU&R-Attention Networks 4
2.4 .LӃQWU~F,PDJH%(57 5
2.5 ĈiQKJLi 6
&KѭѫQJ ,ӂ17+Ӭ&1ӄ17Ҧ1* 8
3.1 .LӃQWKӭFOêWKX\ӃWQӅQWҧQJ 8
3.1.1 0ҥQJ1HXUDOQKkQWҥR 8
3.1.2 +jPNtFKKRҥW 9
3.1.3 +jPPҩWPiW 10
3.2 .LӃQWKӭFQӅQWҧQJWURQJ[ӱOêQJ{QQJӳWӵQKLrQ 11
Trang 93.2.1 Word Embedding 11
3.2.2 Mô hình Continuous Bag-of-Words 11
3.2.3 Mô hình Skip-gram 12
3.2.4 Mô hình GloVe 12
3.3 Long Short-Term Memory 13
3.3.1 0ҥQJKӗLTX\5HFXUUHQW1HXUDO1HWZRUN 13
3.3.2 0ҥQJ/RQJ6KRUW-Term Memory 14
3.4 .LӃQWU~F5HJLRQDO-based Convolutional Neural Networks 15
3.4.1 .LӃQWU~F5-CNN 15
3.4.2 .LӃQWU~F)DVW5-CNN 16
3.4.3 .LӃQWU~F)DVWHU5-CNN 18
3.5 Modular Co-Attention 20
3.5.1 &ѫFKӃ6HOI-Attention 20
3.5.2 &ѫFKӃ0XOWL+HDG$WWHQWLRQ 21
3.5.3 Modular Co-Attention Layer 22
&KѭѫQJ 3+ѬѪ1*3+È31*+,Ç1&Ӭ8 23
4.1 Mô hình 23
4.1.1 XӱOêFkXKӓLYjKuQKҧQK 23
4.1.2 Deep Co-Attention learning 24
4.1.3 PKkQORҥLÿҫXUD 25
4.2 3KѭѫQJSKiSWKXWKұSGӳOLӋX 26
4.2.1 7ұSGӳOLӋX94$-v2 26
4.2.2 TұSGӳOLӋX94$-YWLӃQJ9LӋW 28
4.3 3KѭѫQJSKiS[ӱOêWұSGӳOLӋXWLӃQJ9LӋW 29
&KѭѫQJ 7+Ӵ&1*+,ӊ0ĈÈ1+*,È.ӂ748Ҧ 30
Trang 105.2.2 ĈiQKJLiFkXKӓLGҥQJÿ~QJVDL 31
5.2.3 ĈiQKJLiFkXKӓLWKXӝFGҥQJNKiF 32
5.2.4 ĈiQKJLiWәQJWKӇ 33
&KѭѫQJ ӂ7/8Ұ1 35
6.1 7әQJNӃWÿӅWjL 35
6.2 ĈӅ[XҩW KѭӟQJPӣUӝQJFӫDÿӅWjL 35
7¬,/,ӊ87+$0.+Ҧ2 36
3+Ҫ1/é/ӎ&+75Ë&+1*$1* 38
Trang 11DANH MӨC CÁC TӮ VIӂT TҲT
R-CNN Regional-based Convolutional Neural Networks
Trang 12DANH MӨC CÁC BҦNG
%ҧQJĈiQKJLiFiFP{KuQK 7
%ҧQJ%ҧQJWyPWҳWWұSGӳOLӋX94$-YWLӃQJ9LӋW 28
%ҧQJ%ҧQJÿiQKJLiFkXKӓLYӅVӕOѭӧQJ 31
%ҧQJ%ҧQJÿiQKJLiFkXKӓLGҥQJÿ~QJVDL 32
%ҧQJ%ҧQJÿiQKJLiFkXKӓLWKXӝFGҥQJNKiF 33
%ҧQJ%ҧQJÿiQKJLiNӃWTXҧJLӳDWLӃQJ$QKYӟLWLӃQJ9LӋW 34
Trang 13DANH MӨC CÁC BIӆ8ĈӖ HÌNH ҦNH
+uQK&KӭFQăQJFӫDPӝWKӋWKӕQJ94$ 1
+uQK.LӃQWU~F3\WKLD>@ 4
+uQK.LӃQWU~F0&$1>@ 5
Hình 4: Mô hình Perceptron 8
Hình 5: Mô hình Multilayer Perceptron 9
+uQK6ѫÿӗPLQKKӑDP{KuQK&%2: 11
Hình 7: Mô hình Skip-gram 12
Hình 8: Mô hình RNN 13
+uQK0{KuQKPҥQJ/670>@ 14
+uQK.LӃQWU~F5-CNN [6] 16
Hình .LӃQWU~F)DVW5-CNN [7] 17
+uQK.LӃQWU~F)DVWHU5-CNN [8] 18
+uQK9tGөYӅVHOI-attention 20
Hình 14: Self-Attention 21
Hình 15: Multi head attention [15] 22
Hình 16: Mô hình cho bài toán 23
+uQK0{KuQK[ӱOêFkXKӓLYjKuQKҧQKEDQÿҫX 23
Hình 18: Mô hình Encoder ± Decoder [4] 25
+uQK+uQKҧQKOjPLQSXW 26
Trang 14&KѭѫQJ GIӞI THIӊU
1.1 Tәng quan
7URQJÿӡLVӕQJKLӋQQD\FRQQJѭӡLGӉGjQJQKuQWKҩ\PӝWKuQKҧQKYjWUҧOӡLEҩWNǤFkXKӓLQjROLrQTXDQÿӃQKuQKҧQKÿyEҵQJFiFKVӱGөQJNLӃQWKӭFWK{QJWKѭӡQJFNJQJQKѭNLQKQJKLӋPVӕQJFӫDFK~QJWD 7X\QKLrQFNJQJFyPӝWVӕWUѭӡQJKӧSQJRҥLOӋkhác QKѭOjQJѭӡLGQJNKLӃPWKӏKRһFcác QKjSKkQWtFKWUtWXӋKӑPXӕQFKӫÿӝQJWKXWKұSWK{QJWLQWUӵFTXDQEҵQJPӝWKuQKҧQKYjKӑNK{QJWKӇÿѭDUDÿѭӧFFkXWUҧOӡL'RÿyYLӋFÿL[k\GӵQJPӝWKӋWKӕQJWUҧOӡLFkXKӓLÿӇJLҧLTX\ӃWYҩQÿӅÿyOjPӝWF{QJYLӋFFҫQWKLӃW
0өFÿtFKFӫDKѭӟQJQJKLrQFӭXQj\OjÿLWuPKLӇXÿӇ[k\GӵQJPӝWKӋWKӕQJWUҧOӡLFkXKӓLWUӵFTXDQ9LVXDO4XHVWLRQ$QVZHULQJ [1] GӵDWUrQYLӋFiSGөQJWUtWXӋQKkQWҥR$UWLILFLDOLQWHOOLJHQFH Oҩ\ÿҫXYjROjKuQKҧQKYjFkXKӓLÿӇ ÿѭD ra ckXWUҧOӡLEҵQJQJ{QQJӳWLӃQJ9LӋW
+ӋWKӕQJVӁFyNKҧQăQJWUҧOӡLFiFFkXKӓLKRjQWRjQNKiFQKDXYӅPӝWKuQKҧQKĈӕLYӟLWҩWFҧFiFKuQKҧQKKӋWKӕQJWUҧOӡLFkXKӓLWUӵFTXDQQj\VӁFyWKӇ[iFÿӏQKYӏWUtÿӕLWѭӧQJÿѭӧFWKDPFKLӃXÿӃQWURQJFkXKӓLYjSKiWKLӋQQyYjSKҧLFyPӝWVӕNLӃQWKӭFWK{QJWKѭӡQJÿӇWUҧOӡLQyĈӗQJWKӡL94$FҫQSKҧL WUҧOӡLPӝWFkXKӓLWѭѫQJWӵQKѭFRQQJѭӡLӣFiFNKtDFҥQKVDX
- +ӑFNLӃQWKӭFWUӵFTXDQYjYăQEҧQWӯÿҫXYjRKuQKҧQKYjFkXKӓLWѭѫQJӭQJ
- ӃWKӧSKDLOXӗQJGӳOLӋX
- 6ӱGөQJNLӃQWKӭFQkQJFDRQj\ÿӇWҥRUDFkXWUҧOӡL
Hình 1&KͱFQăQJFͯDP͡WK͏WK͙QJ94$
Trang 151.2 Thách thӭc cӫDÿӅ tài
7URQJÿӅWjLQj\W{LKѭӟQJÿӃQ[k\GӵQJPӝWKӋWKӕQJWUҧOӡLFkXKӓLFyNӃWKӧSYӟLKuQKҧQKĈӇWKӵFKLӋQÿѭӧFÿӅWjLFҫQÿzLKӓLQKLӅXNLӃQWKӭFPӟLWURQJFiFOƭQKYӵFWKӏJLiFPi\WtQK[ӱOêKuQKҧQK[ӱOêQJ{QQJӳWӵQKLrQ %rQFҥQKÿy mӝWWKiFKWKӭFNKiFOjFҫQJLҧLTX\ӃWEjLWRiQEҵQJWLӃQJ9LӋW
1.3 Mөc tiêu nghiên cӭu cӫDÿӅ tài
0өFWLrXFӫDÿӅWjLQj\OjQKҵPJLҧLTX\ӃWEjLWRiQWUҧOӡLFkXKӓLWLӃQJ9LӋWGӵDYjRKuQKҧQKÿmFKR.KLÿѭӧFFKRPӝWFkXKӓLYjPӝWKuQKҧQKWKuVӁÿѭDUDPӝWFkXWUҧOӡLFyQӝLGXQJOLrQTXDQÿӃQFkXKӓLYjKuQKҧQKÿmFKREҵQJFiFKVӱGөQJQJ{QQJӳWLӃQJ9LӋW
1.4 Giӟi hҥQYjÿӕLWѭӧng nghiên cӭu cӫDÿӅ tài
7URQJQJKLrQFӭXQj\W{LÿmWtFKKӧSFiFNӃWTXҧWLrQWLӃQFӫDYLӋFiSGөQJKӑFWұSVkXYjRYLӋFWUҧOӡLFkXKӓLFQJYӟLYLӋFWLӅQ[ӱOêGӳOLӋXEDQÿҫXOjFkXKӓLYjKuQKҧQKÿӇOjPFKRP{KuQKFyWKӇÿҥWKLӋXVXҩWWӕWKѫQT{LÿmWKӵFKLӋQFiFQKLӋPYөQJKLrQFӭXVDXÿk\
x 7uPKLӇXP{KuQKYӅWUҧOӡLFkXKӓLWUӵFTXDQYjWuPKLӇXFiFWұSGӳOLӋXKLӋQFy
Trang 16&KѭѫQJ CÁC CÔNG TRÌNH LIÊN QUAN
7URQJTXiWUuQKQJKLrQFӭXYӅÿӅWjLQj\W{LÿmFyWuPKLӇXPӝWVӕF{QJWUuQKFyOLrQTXDQWӟLEjLWRiQWUҧOӡLFkXKӓLWUӵFTXDQ0ӝWVӕEjLYLӃWOLrQTXDQÿӃQÿӅWjLEDRJӗPFiFEjLEiRÿҥWNӃWTXҧFDRWURQJFiFFXӝFWKLYӅVisual Question Answering KҵQJQăP
2.1 KiӃn trúc Bottom-Up and Top-Down Attention
&iFFѫFKӃKӑFFK~êWRS-down và bottom-XSÿmÿѭӧFVӱGөQJUӝQJUmLWURQJYLӋFWҥR
chú thích cho KuQKҧQKYjWUҧOӡLFkXKӓLEҵQJKuQKҧQK 7URQJEjLEiR³Bottom-Up
and Top-Down Attention for Image Captioning and Visual Question Answering´ [2],
FiF WiF JLҧ ÿm ÿӅ [XҩW PӝW Fѫ FKӃ NӃW KӧS JLӳD ERWWRP-up và top-GRZQ &ѫ FKӃbottom-up attention ÿӅ[XҩWPӝWWұSKӧSFiF YQJKuQK ҧQKQәLEұWYӟLPӛL YQJÿѭӧFÿҥLGLӋQEӣLPӝWYHFWѫÿһFWUѭQJÿѭӧFJӝSFKXQJ1KyPWiFJLҧÿmWULӇQNKDLbottom-up attention EҵQJFiFKVӱGөQJ)DVWHU5-CNN [8]&ѫFKӃWRS-down attention VӱGөQJQJӳFҧQKWKHRQKLӋPYөFөWKӇÿӇGӵÿRiQVӵSKkQEәVӵFK~êWUrQFiFYQJKuQKҧQK6DXÿyYHFWRUÿһFWUѭQJFӫDÿӕLWѭӧQJVӁÿѭӧFWtQKWRiQGѭӟLGҥQJWUXQJEuQKFyWUӑQJVӕFӫDÿһF WUѭQJKuQKҧQKWUrQWҩWFҧFiFYQJĈӇÿiQKJLiP{KuQKWKuQKyPWiFJLҧÿmWKӵFKLӋQKDLEѭӟF ĈҫXWLrQOjVӱGөQJP{KuQKLPDJHFDSWLRQLQJÿӇOҩ\QKDQKWK{QJWLQFӫDYQJKuQKҧQKQәLEұt6DXÿyWiFJLҧÿmWLӃQKjQKWKӱQJKLӋPYjÿiQKJLiNӃWTXҧ
NghLrQFӭXFӫDQKyPWiFJLҧQj\ÿmJLjQKÿѭӧFYӏWUtFDRQKҩWWURQJ&XӝFWKLVisual
Question Answering ÿҥWÿӝFKtQK[iFWәQJWKӇOj70,3% YjÿѭӧFWKӱQJKLӋP
WUrQWұSGӳOLӋXVQA v2.0 test-std
2.2 KiӃn trúc PyThia
Trong bài báo ³3\WKLDY7KHZLQQLQJHQWU\WRWKH94$FKDOOHQJH´>@, tác
JLҧ ÿm JLӟL WKLӋX PӝW P{ KuQK PҥQJ KӑF VkX [ӱ Oê EjL WRiQ 9LVXDO 4XHVWLRQ
$QVZHULQJ0{KuQKQj\GӵDWUrQP{KuQK%RWWRP± Up and Top - down Attention
>@ÿmÿѭӧFÿӅFұSÿӃQWUѭӟFÿyQKѭQJFyPӝWYjLEәVXQJÿӇQKҵPWăQJÿӝFKtQK[iFFKRNӃWTXҧGӵÿRiQ
0{KuQKÿѭӧFPLQKKӑDQKѭKuQKErQGѭӟLWUtFK[XҩWUDFiFÿһFWUѭQJWӯKuQKҧQKVӱGөQJSKpSQKkQHOHPHQW-ZLVHÿӇNӃWKӧSFiFÿһFWUѭQJFӫDKuQKҧQKFkXKӓLWҥR
Trang 17UDPӝWWHQVRUWұSWUXQJ VӁPDQJÿҫ\ÿӫWK{QJWLQJLӳDQӝLGXQJFkXKӓLYjFiFÿӕLWѭӧQJOLrQTXDQWURQJEӭFҧQK
Hình 2.L͇QWU~F3\WKLD>@
0ӝWVӕEәVXQJWURQJNLӃQWU~FS\WKLDEDRJӗP
- Model Architecture: VӱGөQJSKpSQKkQHOHPHQW-ZLVHÿӇNӃWKӧSFiFWtQK
QăQJWӯSKѭѫQJWKӭFYăQEҧQYjKuQKҧQK
- Learning Schedule: WKD\ÿәLWӕFÿӝKӑFWURQJTXiWUuQKKXҩQOX\ӋQ
- Fine-Tuning Bottom-Up Features: ÿһWOHDUQLQJUDWHOjOҫQOHDUQLQJUDWH
WәQJWKӇ
- Data Augmentation: WKrPWұSGӳOLӋXWUDLQLQJ
- Model Ensembling: FKӑQFiFP{KuQKÿѭӧFÿjRWҥRYӟLFiFFjLÿһWNKiF
QKDXVӱGөQJXSGRZQPRGHOÿmÿѭӧFWUDLQVҹQӣWұSGӳOLӋX94$
.ӃWTXҧWKӵFKLӋQFӫDQKyPWiFJLҧQj\ÿmÿҥWÿѭӧFWURQJYLӋFÿiQKJLiYӟLWұSGӳOLӋXtest-std VQA v2
2.3 KiӃn trúc mҥng Modular Co-Attention Networks
'ӵDWUrQP{KuQKTransformer, mô hình Modular Co-Attention Networks (MCAN) [4] ÿѭӧFÿѭDUDYjRQăPP{KuQKQj\ÿmÿҥWÿѭӧFNӃWTXҧWӕWQKҩWWURQJFXӝFWKLYӅ9LVXDO4XHVWLRQ$QVZHULQJ7URQJEjLEiRFiFWiFJLҧÿmÿӅ[XҩWPӝWPҥQJÿӗQJFK~êWKHRP{-ÿXQ0&$1 EDRJӗPFiFOӟSÿӗQJFK~êWKHRP{-ÿXQ
Trang 18ÿѭӧFFiFYQJFyWKӇOjYұWWKӇWURQJҧQKYjFKXҭQKyDFK~QJYӅPӝWGҥQJYHFWRUÿӗQJQKҩWӣÿk\WiFJLҧVӁÿѭDYӅPӝWYHFWRUFyFKLӅX
7URQJNKӕL0RGXODU&R-$WWHQWLRQQKyPWiFJLҧÿmNӃWKӧSYӟLFiFÿһF WUѭQJÿmÿѭӧFWUtFK[XҩWWӯҧQKWK{QJTXDFѫFKӃERWWRP-XSDWWHQWLRQYjFkXKӓLFKRWUѭӟFWK{QJqua mô hình (Global vectors for word representation) GloVe [9] và LSTM [10] ÿӇÿѭD UD NӃW TXҧ WK{QJ TXD PӝW EjL WRiQ SKkQ ORҥL FRQ 6ӱ GөQJ KDL Fѫ FKӃ 6HOI-Attention và Guided-$WWHQWLRQOjÿLӇPKLӋXTXҧWURQJP{KuQKNKLFK~QJVӁÿѭӧFOLrQNӃWYӟLQKDXÿӇFyWKӇWKӵFKLӋQFѫFKӃWұSWUXQJKLӋXTXҧWUrQFҧKDLLQSXWOjFkXKӓL
và KuQKҧQKÿӇJLҧLTX\ӃWWӕWEjLWRiQ
0{KuQK0&$1ÿѭӧFELӇXGLӉQWKHRVѫÿӗKuQKErQGѭӟLJӗPFyEDJLDLÿRҥQ[ӱOêFkXKӓLYjKuQKҧQKÿҫXYjRVӱGөQJ'HHS&R-$WWHQWLRQ/HDUQLQJÿӇOҩ\FiFÿһFWUѭQJFӫDFkXKӓLYjKuQKҧQKVDXÿyKӧSQKҩWFiFÿһFWUѭQJYjÿѭDUDFkXWUҧOӡLFKRbài toán
Hình 3.L͇QWU~F0&$1>@
0{KuQKPDQJOҥLÿӝFKtQK[iFWUrQEӝtest-std VQA-v2
2.4 KiӃn trúc ImageBERT
/ҩ\êWѭӣQJWӯNLӃQWU~F%(57 [11@QәLWLӃQJWURQJOƭQKYӵF[ӱOêQJ{QQJӳWӵnhiên FӫD*RRJOHQăPQKyPQJKLrQFӭXӣ0LFURVRIWÿmÿӅ[XҩWP{KuQK,PDJH%(57 [12@ ÿҥW ÿѭӧF QKLӅX NӃW TXҧ ҩQ WѭӧQJ FKR FiF EjL WRiQ ÿD WKӇ PXOWL-model) ,PDJH%(57PmKyDFҧKuQKҧQKYjYăQEҧQӣWҫQJWUtFK[XҩW YHFWRUÿһFWUѭQJ6DXÿyVӁÿѭӧFFKX\ӇQWLӃSÿӃQFiFNKӕLPXOWL-head self-DWWHQWLRQFKRYLӋFKXҩQOX\ӋQYӟLWiFYөFKRYLӋFKXҩQOX\ӋQNK{QJJLiPViW
- Masked Language Modeling 0/0 7iFYөWѭѫQJWӵJLӕQJQKѭSKLrQEҧQ
JӕFFyQKLӋPYөGӵÿRiQFiFWӯÿѭӧFFKHOҥL
Trang 19- Masked Object Classification7iFYөÿѭӧFSKiWWULӇQWKrPGӵDWUrQ0/0
\ӃXFӫDFiFP{KuQK94$WUѭӟFÿk\
2.5 ĈiQKJLi
&iFEjLEiRPjW{LÿmWuPKLӇXӣWUrQÿӅXÿѭӧFÿѭDUDÿӇJLҧLTX\ӃWFKREjLWRiQ9LVXDO4XHVWLRQ$QVZHULQJ&iFEjLEiRÿѭӧFÿѭDUDӣFiFQăPNKiFQKDXPӭFÿӝFKtQK[iFQJj\FjQJÿѭӧFWăQJOrQÿiQJNӇTXDWӯQJQăP'ѭӟLÿk\OjEҧQJÿiQKJLiÿһFÿLӇPFӫDFiFP{KuQKYjPӝWVӕNӃWTXҧNKLFKҥ\WKӱQJKLӋPYӟLWұSGӳOLӋXVQA-v2
Trang 20- Fine-Tuning Bottom-Up Features:
Trang 21&KѭѫQJ KIӂN THӬC NӄN TҦNG
3.1 KiӃn thӭc lý thuyӃt nӅn tҧng
3.1.1 Mҥng Neural nhân tҥo
0ҥQJQѫURQQKkQWҥR (ANN) [13] OjP{KuQK[ӱOêWK{QJWLQÿѭӧFP{SKӓQJGӵDWUrQKRҥWÿӝQJFӫDKӋWKӕQJWKҫQNLQKFӫDVLQKYұWEDRJӗPVӕOѭӧQJOӟQFiF1ѫURQÿѭӧFJҳQNӃWÿӇ[ӱOêWK{QJWLQ
3.1.1.1 Mô hình Perceptron
0{KuQK3HUFHSWURQOjP{KuQKPҥQJQѫURQÿѫQJLҧQQKҩWFKӍYӟLPӝWWҫQJÿҫXYjRYjWҫQJÿҫXUDÿk\FzQÿѭӧFJӑLOjEӝSKkQWiFKWX\ӃQWtQKQySKөFYөFKRYLӋFJLҧLTX\ӃW FiF EjL WRiQ SKkQ ORҥL WX\ӃQ WtQK Ӣ WURQJ KuQK SKtD ErQ GѭӟL là mô hình
Trang 220{KuQK0XOWLOD\HU3HUFHSWURQOjPӝWP{KuQKFyFҩXWU~FWәQJTXiWKѫQP{KuQK3HUFHSWURQ0{KuQKQj\VӁFyNKҧQăQJJLҧLTX\ӃWFiFEjLWRiQSKkQWiFKSKLWX\ӃQ0{KuQK0XOWLOD\HU3HUFHSWURQÿѭӧFVӱGөQJSKәELӃQWURQJFiFEjLWRiQ SKkQORҥLÿӕLWѭӧQJSKiWKLӋQUDQKӳQJTXDQKӋSKӭFWҥSFӫDGӳOLӋXOjPQӅQWҧQJÿӇQJKLrQFӭXYjSKiWPLQKFiFNLӃQWU~FPҥQJKӑFVkXSKӭFWҥSWURQJOƭQKYӵFWKӏJLiFPi\WtQKKD\[ӱOtQJ{QQJӳWӵQKLrQ
Mô hình Multilayer Perceptron VӁJӗPFiFWKjQKSKҫQVDX
YjiSGөQJKjPNtFKKRҥW
3.1.2 +jPNtFKKRҥW
+jPNtFKKRҥWDFWLYDWLRQIXQFWLRQ OjQKӳQJKjPSKLWX\ӃQÿѭӧFiSGөQJYjRÿҫXUDFӫDFiFÿѫQYӏQRGH WURQJWҫQJҭQFӫDPӝWP{KuQK PҥQJWKҫQNLQKYjÿѭӧFVӱ
Trang 23GөQJEDRJӗP
- Sigmoid: +jPVLJPRLGOjPӝWKjPSKLWX\ӃQYӟLÿҫXYjROjFiFVӕWKӵFYj
FKRNӃWTXҧQҵPWURQJNKRҧQJ YjÿѭӧF[HPOj[iF[XҩWWURQJPӝWVӕEjLWRiQ+jPVLJPRLGWKѭӡQJÿѭӧFVӱGөQJÿӇGӵÿRiQ[iFVXҩWFӫDPӝWNӃWTXҧQKӏSKkQ
Công thӭc: ݂ܵݐ݉ܽݔሺݔሻ ൌ ܍ܠܘሺ࢞ ሻ
σ ܍ܠܘሺ࢞ ሻ
3.1.3 Hàm mҩt mát
+jPPҩWPiWORVVIXQFWLRQ NêKLӋX/OjWKjQKSKҫQFӕWO}LWURQJYLӋFÿiQKJLi&өWKӇWURQJF{QJWKӭFWKѭӡQJJһSOj
Trang 243.2 KiӃn thӭc nӅn tҧng trong xӱ lý ngôn ngӳ tӵ nhiên
3.2.1 Word Embedding
:RUG(PEHGGLQJOjPӝWNK{QJJLDQYHFWRUGQJÿӇELӇXGLӉQVӵWѭѫQJÿӗQJYӅPһWQJӳQJKƭDQJӳFҧQKFӫDGӳOLӋX'ӳOLӋXÿҫXYjRFӫDFiFEjLWRiQ[ӱOêQJ{QQJӳWӵQKLrQKLӋQWҥLWKѭӡQJEDRJӗPFiF\ӃXWӕQKѭWӯFөPWӯ'RÿӝGjLYjWҫQVXҩW[XҩWKLӋQFӫDFiFWӯWURQJPӝWFkXNK{QJÿӗQJQKҩWVӁJk\NKyNKăQWURQJYLӋFWtQKWRiQQrQFҫQSKҧLFyPӝWSKѭѫQJSKiSFKX\ӇQWҩWFҧFiF\ӃXWӕQj\YӅPӝWGҥQJÿӗQJQKҩWÿӇPi\WtQKFyWKӇ[ӱOêÿѭӧFYjFKӭDQKLӅXWK{QJWLQQKҩWFyWKӇ
0ӝWSKѭѫQJSKiSÿѫQJLҧQQKҩW ÿѭӧFÿӅ[XҩWÿyOjGQJRQH-KRWYHFWRUÿӇÿѭDFiFWӯYӅPӝWGҥQJÿӗQJQKҩWWURQJNK{QJJLDQYHFWRU
3.2.2 Mô hình Continuous Bag-of-Words
0{KuQK&%2:Oҩ\êWѭӣQJOjGӵÿRiQWӯPөFWLrXGӵDYjRFiFWӯQJӳFҧQK[XQJTXDQKQyWURQJPӝWSKҥPYLQKҩWÿӏQK&KRWӯPөFWLrX࢚࢝ WҥLYӏWUtt WURQJFkXYăQ
Trang 25JӗPC WӯQJӳFҧQKV OjNtFKWKѭӟFFӫDWұSWӯYӵQJYj1OjNtFKWKѭӟFFӫDWҫQJ
hidden (hidden layer)
3.2.3 Mô hình Skip-gram
0ӝWP{KuQKNKiFFNJQJKD\ÿѭӧFVӱGөQJOjP{KuQKVNLS-JUDPP{KuQKQj\VӱGөQJWӯPөFWLrXOjPÿҫXYjRYjÿҫXUDPRQJÿӧLOjWӯQJӳFҧQKÿӇKXҩQOX\ӋQPҥQJQѫ-URQ1KѭYұ\PӛLPүXKXҩQOX\ӋQVӁOjPӝWFһSWӯPөFWLrXWӯQJӳFҧQK