1. Trang chủ
  2. » Luận Văn - Báo Cáo

Luận văn thạc sĩ Khoa học máy tính: Nhận diện các tạp chí hiện đại của Nhật Bản bằng cách kết hợp học sâu và mô hình ngôn ngữ

74 1 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Nhận diện các tạp chí hiện đại của Nhật Bản bằng cách kết hợp học sâu và mô hình ngôn ngữ
Tác giả 1Jj\ WKiQJ
Trường học Trường Đại Học Bách Khoa
Chuyên ngành Khoa học máy tính
Thể loại Luận văn thạc sĩ
Năm xuất bản 2021
Thành phố Tp. HCM
Định dạng
Số trang 74
Dung lượng 1,47 MB

Nội dung

In modern Japanese magazines which were published during the centuries XIX - XX, the usage of Japanese is similar with the current style of the Japanese language.. Due to their importanc

Trang 1

TRѬӠNG ĈҤI HӐC BÁCH KHOA

-

1*8<ӈ17+,ӊ11+Æ1

1+Ұ1',ӊ1 &È&7Ҥ3&+Ë+,ӊ1ĈҤ,&Ӫ$ 1+Ұ7%Ҧ1%Ҵ1*&È&+.ӂ7+Ӧ3+Ӑ&6Æ8

9¬0Ð+Î1+1*Ð11*Ӳ

&KX\rQQJjQK.KRDKӑF0i\WtQK 0mVӕ8.48.01.01

LUҰN VĂN THҤ&6Ƭ

TP +Ӗ&+Ë0,1+ tháng 01 QăP 2021

Trang 2

&Ð1*75Î1+ĈѬӦ&+2¬17+¬1+7Ҥ, 75ѬӠ1*ĈҤ,+Ӑ&%È&+.+2$ ±Ĉ+4*-HCM

Trang 3

II 1+,ӊ09Ө9¬1Ӝ,'81* :

PKiWWULӇQPӝWP{KuQKQJ{QQJӳGӵDWUrQFiFNӻWKXұWKӑFVkXFKRFiFWҥSFKtKLӋQÿҥLFӫD1KұW%ҧQÿӇFҧLWKLӋQWtQKFKtQK[iFFӫD2&5KLӋQWҥL, ÿӗQJWKӡLNӃWKӧSKDLNӃWTXҧWӯP{KuQKQJ{QQJӳYjKӋWKӕQJ2&5ÿӇÿiQKJLiWtQKFKtQK[iFFӫDFiFNӃWTXҧWӯ2&5

Trang 4

LӠ,&È0Ѫ1

7{L[LQWUkQWUӑQJJӱLOӡLELӃWѫQFKkQWKjQKÿӃQWKҫ\3*6764XҧQ7KjQK7Kѫ QJѭӡLÿmWUӵFWLӃSGүQGҳWWұQWuQKFKӍEҧRYjÿӝQJYLrQW{LWURQJTXiWUuQKWKӵFKLӋQÿӅWjL;LQFKkQWKjQKFҧPѫQQKӳQJEjLJLҧQJYӅ7UtWXӋ1KkQWҥRYj;ӱOêQJ{QQJӳWӵQKLrQ FӫDWKҫ\ÿmJL~SFKRW{LPӣPDQJWKrPQKLӅXNLӃQWKӭFKӳXtFKĈӗQJWKӡLWKҫ\FNJQJOjQJѭӡLOX{QFKRW{LQKӳQJOӡLNKX\rQY{FQJTXêJLiYӅFҧNLӃQWKӭFFKX\rQP{QFNJQJQKѭÿӏQKKѭӟQJSKiWWULӇQVӵQJKLӋS0ӝWOҫQQӳDW{L[LQJӱLOӡLFҧPѫQÿӃQWKҫ\EҵQJWҩWFҧWҩPOzQJYjVӵELӃWѫQFӫDPuQK &ҧPѫQWKҫ\YӅQKӳQJNLӃQWKӭFYjNLQK QJKLӋPTXêEiXPjWKҫ\ÿmWUX\ӅQÿҥWÿyOjPyQTXjY{JLiFӫDQJѭӡLWKҫ\

7{LFNJQJ[LQFKkQWKjQKFҧPѫQWҩWFҧquý 7Kҫ\&{WURQJNKRDÿmWұQWuQK JL~SÿӣÿӅW{LKRjQWKjQKÿӅWjL7{L[LQFKkQWKjQKFҧPѫQWҩWFҧQJѭӡLWKkQWURQJJLDÿƭQKÿmÿӝQJYLrQ tôi trong TXiWUuQKWKӵFKiӋQÿӅWjL;LQFҧPѫQF{QJODRQX{LGҥ\YjWuQK\rXWKѭѫQJFӫD&KD0ҽYjQJѭӡLWKkQÿӇW{LFyÿѭӧFFѫKӝLQKѭQJj\K{PQD\ 6DXFQJ7{L[LQFҧPѫQYӅWҩWFҧVӵJL~SÿӥFӫDDQKFKӏYjFiFEҥQKӑFYLrQFQJKӑFFKXQJYӟLW{LYjÿmJL~SW{LKRjQWKjQKÿӅWjLOXұQYăQ7KҥFVƭQj\và góp ý cho W{LWURQJTXiWUuQKWKӵFKLӋQOXұQYăQ

Trang 5

TÓM TҲT LUҰ19Ă1

/jPӝWWURQJQKӳQJTXӕFJLDFyQӅQYăQKyDSKRQJSK~ QKҩWWKӃJLӟL1KұW%ҧQFNJQJFyPӝWOӏFKVӱSKRQJSK~YӅWҥSFKt 7URQJFiFWҥSFKtKLӋQÿҥL FӫD1KұW%ҧQÿѭӧF[XҩWEҧQWURQJVXӕWWKӃNӹ;,;- XXFiFKVӱGөQJWLӃQJ1KұWFNJQJWѭѫQJWӵQKѭSKRQJFiFKKLӋQWҥLFӫDQJ{QQJӳ1KұW%ҧQ7X\QKLrQKҫXKӃWFiFWjLOLӋXÿyNK{QJÿѭӧFVӕKyDFKӍÿѭӧFOѭXWUӳGѭӟLGҥQJKuQKҧQK'RWҫPTXDQWUӑQJFӫDQyÿӕLYӟLYăQKyDOӏFKVӱYjFiFFKӫÿӅNKRDKӑF - [mKӝLNKiFFӫD1KұW%ҧQYҩQÿӅVӱGөQJPi\WtQKÿӇJL~SQKұQGLӋQ QKӳQJWҥSFKtKLӋQÿҥLGӵDWUrQKuQKҧQKQj\ÿmÿѭӧFQJKLrQFӭXYjSKәELӃn UӝQJUmLWK{QJTXDYLӋFVӱGөQJFiFSKѭѫQJSKiSNKiFQKDXtrong HӑFVkX(Deep Learning) và TKӏJLiFPi\WtQK (Computer Vision) Tuy nhiên, các SKѭѫQJSKiSYjP{KuQKQj\YүQFzQKҥQFKӃÿӇ ÿҥWÿѭӧFKLӋXVXҩWPҥQKPӁ WURQJYLӋFQKұQGLӋQFiFKuQKҧQKFKӳYLӃWÿһFELӋWOjFiFNêWӵ.DQMLNK{QJSKәELӃQ

0өFÿtFKFӫDQJKLrQFӭXQj\OjSKiWWULӇQP{KuQKQJ{QQJӳGӵDWUrQKӑFVkXYjWtFKKӧSQyYjRKӋWKӕQJQKұQGLӋQFKӳYLӃWFKRFiFWjLOLӋXWҥSFKtKLӋQÿҥLFӫD1KұW%ҧQ ĈӇKӋWKӕQJFyWKӇQKұQGLӋQYjWӵÿӝQJWUtFK[XҩWYăQEҧQWӯQKӳQJKuQKҧQKWҥSFKtKLӋQÿҥL1KұW%ҧQPӝWFiFKFKtQK[iFOjPөFWLrXFӫDQJKLrQFӭXQj\WURQJÿytôi KѭӟQJÿӃQQKӳQJÿyQJJySQKѭsau:

- PKiWWULӇQPӝWP{KuQKQJ{QQJӳGӵDWUrQFiFNӻWKXұWKӑFVkXFKRFiFWҥSFKtKLӋQÿҥLFӫD 1KұW%ҧQÿӇFҧLWKLӋQWtQKFKtQK[iFFӫD2&5KLӋQWҥL

- ĈӅ[XҩWPӝWFKLӃQOѭӧFNӃWKӧSJLӳD2&5KLӋQWҥLYjP{KuQKQJ{QQJӳFӫDW{L&KLӃQOѭӧFOjVӁWuPKLӇXOLӋXNKLQjRKӋWKӕQJQrQGӵDYjR2&5ÿӇ[iFÿӏQK ÿѭӧF PӝW FkX WLӃQJ 1KұW Wӯ FiF WҥS FKt KLӋQ ÿҥL Oj ÿ~QJ QKҩW Yt Gө+LUDJDQDYjFiFNêWӵNDQMLWK{QJWKѭӡQJÿѭӧF2&5QKұQGҥQJFKtQK[iF KRһFP{KuQKQJ{QQJӳ NêWӵ.DQMLNK{QJSKәELӃQWKѭӡQJÿѭӧF2&5QKұQGLӋQNK{QJFKtQK[iFWKuNKLÿyKӋWKӕQJQrQGӵDYjRP{KuQKQJ{QQJӳ

Trang 6

As one of the most culturally rich countries in the world, Japan also has a rich history

of magazines In modern Japanese magazines which were published during the centuries XIX - XX, the usage of Japanese is similar with the current style of the Japanese language However, most of those documents are not digitized, only stored

as images Due to their importance to Japanese culture, history and other scientific topics, the problem of using computers to help identify these image-based modern magazines have been investigated from research and widely dissemined through the use of different methods in Deep Learning (Deep Learning) and Computer Vision (Computer Vision) However, these methods and models are still limited to achieve strong performance in recognizing handwriting images, especially uncommon Kanji characters

socio-The purpose of this research is to develop a deep learning-based language model and integrate it into the current OCR system for Japanese modern magazine documents

To automatically extract texts from those images accurately is the goal of this research, of which I vision the contributions as follows

- I develop a language model based deep learning techniques for modern Japanese magazines to improve the accuracy of the current OCR;

- I propose a combination strategy between the current OCR and our language model The strategy will learn where the system should rely on OCR (eg Hiragana and Common kanji characters recognized correctly by OCR) or language model (uncommon Kanji character are frequently recognized incorrectly by OCR, the system should rely on the language model)

Trang 7

LӠ,&$0Ĉ2$1

7{LFDPÿRDQUҵQJQJRҥLWUӯFiFNӃWTXҧWKDPNKҧRWӯFiFF{QJWUuQKNKiFQKѭ ÿmJKLU}WURQJOXұQYăQFiFF{QJYLӋFWUuQKEj\WURQJOXұQYăQQj\OjGRFKtQKW{LWKӵFKLӋQYjFKѭDFySKҫQQӝLGXQJQjRFӫDOXұQYăQQj\ÿѭӧFQӝSÿӇOҩ\PӝWEҵQJFҩSӣWUѭӡQJQj\KRһFWUѭӡQJNKiF 1ӃXNK{QJÿ~QJQKѭÿmQrXWUrQW{L[LQKRjQWRjQFKӏXWUiFKQKLӋPYӅÿӅWjLFӫDPuQK

1JѭӡLFDPÿRDQ

1JX\ӉQ7KLӋn Nhân

Trang 8

MӨC LӨC

1+,ӊ09Ө/8Ұ19Ă17+Ҥ&6Ƭ i

/Ӡ,&È0Ѫ1 i

7Ï07Ҳ7/8Ұ19Ă1 iii

ABSTRACT OF THE THESIS iv

/Ӡ,&$0Ĉ2$1 v

0Ө&/Ө& vi

'$1+0Ө&+Î1+Ҧ1+ ix

'$1+0Ө&%Ҧ1*%,ӆ8 xi

I *,Ӟ,7+,ӊ8Ĉӄ7¬, 1

1 7әQJTXDQ 1

2 7KiFKWKӭFFӫDÿӅWjL 2

3 0өFWLrXQJKLrQFӭXFӫDÿӅWjL 3

4 *LӟLKҥQYjÿӕLWѭӧQJQJKLrQFӭXFӫDÿӅWjL 4

5 ĈҫXUDFӫDQJKLrQFӭX 5

II CÁC CÔNG TRÌNH LIÊN QUAN 5

III &Ѫ6Ӣ/é7+8<ӂ7 10

1 OCR ± Optical Character Recognition 10

1.1 *LӟLWKLӋX7HVVHUDFW2&5(QJLQH 12

1.2 &ҩXWU~FFӫD7HVVHUDFW 13

1.3 &ѫFKӃKRҥWÿӝQJFӫD7HVVHUDFW 13

1.4 0ӝWVӕWKӱQJKLӋP 15

2 0{KuQKQJ{QQJӳ 17

2.1 7әQJTXDQYӅ0ҥQJKӗLTX\511Yj/670 17

2.2 .KiLTXiWNӻWKXұW:RUG9HFWURQJP{KuQKQJ{QQJӳ 18

Trang 9

3 BERT (Bidirectional Encoder Representations from Transformers) 19

3.1 7әQJTXDQ 20

3.2 7ҥLVDRFҫQGQJ%(57 20

3.3 éWѭӣQJFӕWO}LFӫD%(57 21

3.4 &ѫFKӃKRҥWÿӝQJFӫD%(57 22

3.5 &iFKVӱGөQJ%(57WKHRKѭӟQJ)LQH-tuning 27

4 7KѭYLӋQ7UDQVIRUPHUVYj'DWDVHWVFӫD+XJJLQJIDFH 29

4.1 Datasets 29

4.2 Transformers 30

IV 3+ѬѪ1*3+È31*+,Ç1&Ӭ8 31

1 3KѭѫQJSKiSWKXWKұSGӳOLӋX 31

1.1 7UtFK[XҩWGӳOLӋXWLӃQJ1KұWWӯNKRGӳOLӋX$R]RUD%XQNRGҥQJYăQEҧQ (text) 31

1.2 7UtFK[XҩWGӳOLӋXWLӃQJ1KұWWӯNKRGӳOLӋX$R]RUD%XQNRGҥQJQJ{Q QJӳÿiQKGҩXPӣUӝQJ ;0/ 33

1.3 7әQJKӧSFiFGӳOLӋXÿѭӧFWUtFK[XҩWWӯKDLNKRGӳOLӋX$R]RUD%XQNR GҥQJWH[WYj;0/WKjQKPӝWWұSGӳOLӋXOӟQ 34

1.4 7UtFK[XҩWGӳOLӋXNӃWTXҧWӯKӋWKӕQJ2&5 36

1.5 7ҥREӝWӯÿLӇQFKRP{KuQKKӑFVkXWӯFiFGӳOLӋXYăQEҧQWUrQ 38

2 3KѭѫQJSKiS[k\GӵQJP{KuQKKӑFVkX 39

2.1 ;k\GӵQJEӝÿiQKQKmQ WRNHQL]HU FKRWLӃQJ1KұW 40

2.2 ;k\GӵQJP{KuQK%(57FKRQKLӋPYөÿLӅQWӯEӏFKH 0DVNHG/0 42

2.3 7KӵFKLӋQTXiWUuQKKXҩQOX\ӋQ WUDLQLQJ %(57FKREӝGӳOLӋXWLӃQJ 1KұW 44

3 .ӃWKӧSNӃWTXҧWӯP{KuQKKӑFVkXYjNӃWTXҧWӯKӋWKӕQJ2&5KLӋQWҥL 44

3.1 7tQKÿLӇPPӝWFkXEҵQJYLӋFVӱGөQJP{KuQK%(57WLӃQJ1KұWÿmKXҩQ OX\ӋQ 44

Trang 10

3.2 .ӃWKӧSNӃWTXҧÿLӇPFӫDFkXÿѭӧFWtQKWӯP{KuQKÿmKXҩQOX\ӋQYjNӃW

TXҧWӯKӋWKӕQJ2&5KLӋQWҥL 48

V ĈÈ1+*,È.ӂ748Ҧ1*+,Ç1&Ӭ8 51

1 ĈiQKJLiNӃWTXҧEҵQJWK{QJVӕ:RUG(UURU5DWH 51

1.1 *LӟLWKLӋXYӅ&KDUDFWHU(UURU5DWH &(5 Yj&KDUDFWHU$FFXUDF\ &$FF 51

1.2 ĈiQKJLiNӃWTXҧQJKLrQFӭXYӟL&(5Yj&$FF 52

1.3 6RViQKNӃWTXҧÿiQKJLi&(5Yj&$FFJLӳDP{KuQK%(57YjP{KuQK QJ{QQJӳ/670 55

VI .ӂ7/8Ұ1 57

1 7әQJNӃWÿӅWjL 57

2 *LӟLKҥQFӫDQJKLrQFӭX 58

7¬,/,ӊ87+$0.+Ҧ2 59

3+Ҫ1/é/ӎ&+75Ë&+1*$1* 61

Trang 11

DANH MӨC HÌNH ҦNH

Hình I-9tGͭY͉FiFW̩SFKt1K̵W%̫QYjRWK͇Nͽ;,;± XX 1

Hình III-&̭XWU~FFͯD7HVVHUDFW 13

Hình III-9tGͭY͉P͡Wÿ˱ͥQJF˯VͧG̩QJFRQJ 14

Hình III-9tGͭY͉F̷WFiFNtW͹E͓GtQK 14

Hình III-4XiWUuQKQK̵QG̩QJWͳ 15

Hình III-9tGͭY͉KuQKFKͷÿiQKPi\ 15

Hình III-9tGͭY͉KuQKYăQE̫QÿiQKPi\ 16

Hình III-9tGͭY͉511YͣL³+HOOR´ 18

Hình III-Ĉ̩LGL͏Qÿ̯XYjRFKR%(57 23

Hình III-.L͇QWU~FFͯD0DVNHG/0 24

Hình III-9tGͭY͉QKL͏PYͭ1H[W6HQWHQFH3UHGLFWLRQWURQJ%(RT 25

Hình III-.L͇QWU~FFͯDQKL͏PYͭ1H[W6HQWHQFH3UHGLFWLRQ 26

Hình III-9tGͭY͉ͱQJGͭQJ4XHVWLRQ± Answering trong BERT 28

Hình IV-9tGͭFͯDP͡WW̵SGͷOL͏X$R]RUD%XQNRG̩QJWH[WEDQÿ̯X 32

Hình IV-7̵SGͷOL͏X$R]RUD%XQNRG̩QJWH[Wÿmÿ˱ͫFWUtFK[X̭W 32

Hình IV-9tGͭFͯDP͡WW̵SGͷOL͏X$R]RUD%XQNRG̩QJ;0/EDQÿ̯X 33

Hình IV-7̵SGͷOL͏X$R]RUD%XQNRG̩QJ;0/ÿmÿ˱ͫFWUtFK[X̭W 34

Hình IV-7̵SGͷOL͏Xÿmÿ˱ͫFWUtFK[X̭W 35

Hình IV-7̵SGͷOL͏XVDXNKLQJ̷WGzQJ GzQJFyÿ͡GjLWͳ-NtW͹ 35

Hình IV-9tGͭY͉P͡WW̵SGͷOL͏Xÿ̯XYjRFͯD2&5 G̩QJKH[

Ngày đăng: 03/08/2024, 13:17

w