Khóa luận tốt nghiệp Khoa học máy tính: Thích ứng miền không giám sát cho bài toán nhận diện văn bản ngoại cảnh bằng phương pháp giảm thiểu entropy

VIETNAM NATIONAL UNIVERSITY OF HO CHI MINH CITYUNIVERSITY OF INFORMATION TECHNOLOGY TRAN TIEN HUNG GRADUATION THESIS UNSUPERVISED DOMAIN ADAPTATION IN SCENE TEXT RECOGNITION USING ENTROP

Trang 1

VIETNAM NATIONAL UNIVERSITY OF HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

TRAN TIEN HUNG

GRADUATION THESIS

UNSUPERVISED DOMAIN ADAPTATION IN

SCENE TEXT RECOGNITION USING

ENTROPY MINIMIZATION

BACHELOR’S DEGREE OF COMPUTER SCIENCE

HO CHI MINH CITY, 2023

Trang 2

VIETNAM NATIONAL UNIVERSITY OF HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

TRAN TIEN HUNG - 19521587

GRADUATION THESIS

UNSUPERVISED DOMAIN ADAPTATION IN

SCENE TEXT RECOGNITION USING

ENTROPY MINIMIZATION

BACHELOR’S DEGREE OF COMPUTER SCIENCE

SUPERVISOR

PhD NGO DUC THANH

HO CHI MINH CITY, 2023

Trang 3

THESIS COMMITTEE MEMBERS

Council of graduation thesis committee , established under number- decision on date from the principal of University of Information Technology — VNUHCM.

Lecce eee c eee eeeeeeeeeeeeeeeneeteneeeeneenes — Chairperson

2 Ki nh ng chi th tin eee — Secretary.

he — Commissioner.

4 kh kg kh nhi kh in tin ky — Commissioner.

Trang 4

We are also thankful for our research group’s support and encouragement pecially, we would like to thank Khanh, Phuong, and Tan for their outstanding

Es-assistance and technical advice during the project’s development.

Finally, we would like to express our gratitude to our family and friends, as well

as Tu, for their support and spiritual assistance during the difficult phase.

All things considered, this project could not be achieved without the support of

everyone listed and involved Thank you for your encouragement and support.

Trang 5

IAbstractl

1_ Introductioni

[LT Scene Text

Recognition| -1.11 1.1.2 Definiion| Ặ.ẶẶ ee 1.111 STRinRegularbenchmarks|

1.1.1.2 SIRin Irregularbenchmarksl

Applicatlonsl -1.1.2.1 Intelligent transportation|

1.1.2.2 Diverse

translation| -1.1.2.3 Self-drivingsystem|

1.1.2.4 Replacement for the disabled community]

1.1.2 Information ExtracHion|

1.1.3 Challenges and recent research|

1.1.3.1 Challenges in Scene Text Recogrition|

1.2 Unsupervised Domain

Trang 6

Adaptation| -1.3 Requirement and

Contribution| 1.3.1 equirement} 6

1.3.2 ntibutlon|

14 Thesis Structurel ẶẶẶ SỐ ằ SỐ 2_ Prior Research| 21

Relatedworksl -2.11

Scenetextrecopniion| -2.1.1.1

Transformationsstagel -2.1.1.2 Feature extraction

stage| -2.1.1.3 | Sequence modeling

stagel -2.1.1.4

Predicionstagel -2.1.2 Unsupervised domain adaptation in Scene text recognition] E2 SMIH

[2.2.1 Scene Text Recognition with Attention-based Model] .

U-FOCALID

2.3.1 Methodology| - -c 1.2.2 Unsupervised Domain Adaptation in Scene Text Recognition| 18 24 2.2.2 Sequence-to-sequence UDA with Minimizing Latent Entropy] 26 2.2.3 Class-balanced Self-paced Learning]

Trang 7

2.3.1.3 Adapting Stage) 2 26 2 ee ee2.3.2 Baseline} 2 0 ee2.4 Implementation detalls|l

Trang 8

4_ Thesis conclusion 49

Trang 9

List of Figures

1.1 Scene Text Recognition visualizaton.| 5

¬ 6 1.3 Scene Text Recognition examples| - 6

1.4 Scene Text Recognitloninput| - 8

1.5 Scene Text Recognition output| - 8

1.6 Regulardatasetin[l9jÏ[ 10

1.7 Example of irregular dataset a) Perspective text, samples from SVTP [33] b) Curved text, taken from CUTE80 [35].} 11

Ƒ= 4.4.4 12

Xa eee 13 Áiiáa 14

- đỗ 1.12 Document information extracHion| 16

1.13 Imperfect imaging conditions.| 17

1.14 Blurry, distortion and geometric transformation.) 17

1.15 Perspective, shear and small text images.) 17

Trang 10

2.1 Visualization of[2|SITRframeworks| - 23

2.2_ SU-FOCALID overall architecture F is the decoder with the focal

mechanism with green and red arrows indicating weight control

The brown dot line indicates each block shares weight with the

other c is the predicted character in a sequence Ly;eq is the

classi-fication loss and Leyide 1s the entropy minimization loss) 30

2.3 Class distributions in synthetic dataset including MJ and ST) 33

2.4 Class distributions in real-world dataset including SVT, IC03, IC13,

IC15,CUTE,IIT5k| 332.5 MJSynth samples generation process|l - 342.6 SynthTextsamples| 2.0.0.0 000000000 eee ee 34

Q

3.1 Bar plot for accuracy comparison.| - 3

3.2 Flowchart of hyper-parameters optimization Pink boxes are « where

x is & and 1 is «3, blue boxes are y, green boxes are official

bench-marks} 2 QC CS Q HQ HH v2 40

3.3 Model loss comparisons with synthetic as source dataset and

real-world as target dataset) 2.0 2 ee ee eee 42

3.4 Entropy minimization loss with focal entropy and cross-entropy.| 42

3.5 Classification loss comparisons with synthetic as source dataset

and real-world as target dataset with multiple trials| 43

3.6 Model loss for optimized (a,7) = (0.592,4) as focal entropy and|

= 2 for focal loss compared to entropy loss and entropy as decoder loss (L„;„„) and entropy minimization loss (Le yide)

cross-respectively| 2 ee 44

Trang 11

3.7 Entropy minimization loss - Leyige for optimized focal-based and

cross-entropy based model|

3.8 Classification loss - Ly;eq for focal-based and cross-entropy based

¬

3.9 Comparison between SU-FOCALID and SMILE]

Trang 12

List of Tables

3.1 œandsyvwaluesl 0.0.0.0 va 373.2_ Comparison with UDA methods on regular benchmarks Bold is

the highest value and underline is the second-highest value 48

Trang 13

List of Abbreviations

1 CNN Convolutional Neural Networks

2 OCR Optical Character Recognition

3 STR Scene Text Recognition

4 EM Expectation Maximization

5 SOTA State Of The Art

6 STN Spatial Transformer Networks

7 CTC Connectionist Temporal Classification

8 UDA Unsupervised Domain Adaptation

9 SU-FOCALID Sequence-to-sequence Unsupervised domain

adaptation with FOCAL on Imbalance Distribution

10 LSTM Long Short Term Memory

11 BiLSTM Bi-directional Long Short Term Memory

12 RNN Recurrent Neural Network

Trang 14

Scene Text Recognition is a subproblem of Optical Character Recognition

that will be the focus of this thesis In recent years, numerous Scene Text

Recogni-tion approaches have been developed Since the amount of real dataset is not

sig-nificant and labeling process can be time consuming for Scene Text Recognition

The most common training method involves training a model using synthetic

data and then predicting on actual data Yet, this could result in domain shifts

between synthetic and actual images In addition, each real-world benchmark

has its own unique characteristics, including perspective, curved text, contrast,

and brightness, etc., resulting in poor performance and inaccurate predictions

To address these limitations, Unsupervised Domain Adaptation (UDA)

has been proposed, which can reduce the disparity between source domain datasetsand target domain datasets With pseudo labels generated by a source-labeled

dataset, we may utilize Expectation Maximization methods in conjunction with

Entropy Minimization to produce predictions with high confidence, hence

reduc-ing the discrepancy of an unlabeled target domain dataset In addition, by

em-ploying class-wise self-pace balance, Unsupervised Domain Adaptation can pick

a high-confidence portion of pseudo label for the training set, thereby improving

the adaptation process

In order to implement the described concept, we have modified and

validated multiple models in Unsupervised Domain Adaptation and proposed

a sequence-to-sequence unsupervised domain adaptation with focal against

im-balance distribution (SU-FOCALID) based on a Scene Text Recognition

frame-work that applies adaptation on minimizing latent entropy in pseudo labels

gen-erated by its decoder in order to strengthen predictions from the unlabeled target

Trang 15

dataset SU-FOCALID will be evaluated on official scene text recognition marks with prior UDA methods Our main contributions include:

bench-® Presenting main concept in Unsupervised Domain Adaptation for Scene

Text Recognition using Entropy Minimization

¢ Presenting official Scene Text Recognition benchmarks and training datasets

® Proposing sequence-to-sequence unsupervised domain adaptation against

imbalance distribution (SU-FOCALID)

Keywords: scene text recognition, unsupervised domain adaptation,entropy minimization, deep learning, domain shift

Trang 16

Chapter 1

Introduction

Text has always been an essential aspect of human life, and its application has

benefited humanity throughout human evolution Text is a system of symbols

used for recording and intercultural communication Rich and precise semantic

information carried by text is used in many applications such as image search

[47], intelligent inspection [6], industrial automation [52], robot navigation [9] and instant translation [24] Consequently, recognizing text for applications in

the real world is a crucial task for computer vision in order to enhance

technol-ogy Scene Text Recognition, often known as text recognition in natural scenes,

is a major branch of Optical Character Recognition in the Computer Vision field

Despite the fact that text recognition in scanned documents is extensively

de-veloped, Scene Text Recognition remains difficult due to numerous real-worldfactors, such as complicated backgrounds, diverse typefaces, diverse text posi-

tions, and bad image conditions Early studies [52], mainly based on

hand-crafted features which had low efficiency and resulted in poor performance

Re-cently, deep learning has demonstrated promising results on numerous

bench-marks, and proposals have been introduced alongside competitive State Of The

Art method results during the span of the year Since then, numerous ways are

based on neural networks with additional techniques that represent its benefits

Trang 17

Chapter 1 Introduction 4

Also, training strategies and learning methods have contributed to the ment of model performance

enhance-However, Scene Text Recognition model require a large amount of data

in order to perform well, while labeling process is time-consuming and thereforelack of human-labeled real-world data To counteract the lack of data, synthetic

data (18]have been presented, creating the main training pattern for future

STR research This method includes training the model on synthetic data, lowed by validation or tuning on real-world data Prior research focused mostly

fol-on modifying architecture rather than training diverse datasets [2] [2] proposed

a main STR model with four stages: transformation, feature extraction, sequence

modeling, and prediction The majority of modern scene text recognition modelsadhere to this scheme Another method is data-centric, and suggests that

a model trained on one domain may perform badly when presented with datafrom a different domain, a phenomenon known as domain shift Trained model

on all domain can lead well performance across domain and can reduce the need

for human-labeled data Recent interesting domains are handwritten, real-world,

document printed and synthetic, based on these domain, we can just use one

sin-gle domain in order to validate across domains or use a union of cross-domaindata, this created a learning strategy called Unsupervised Domain Adaptation

In this thesis, we will focus on training from a source domain to a target

domain and validating across domains using Unsupervised Domain Adaptation

for Scene Text Recognition Using Expectation Maximization techniques and

clus-tering methods to get entropy distribution, [13] recommended semi-supervised

learning to deliver high-confidence predictions in deep learning models, with

low entropy indicating the model is confident with one sample and vice versa.Consequently, it can be utilized as a pseudo-label in the training process We re-

produce result from a method called sequence-to-sequence on minimizing latent

Trang 18

entropy (SMILE) to observe the adaptation process along with proposing a new

method called Sequence-to-sequence unsupervised domain adaptation against

imbalance distribution called SU-FOCALID

1.1 Scene Text Recognition

1.1.1 Definition

Text has been used by humanity as a way to communicate and to document

culture, knowledge, history, and accomplishments The image technologies of

the twenty-first century have progressed progressively, with more sophisticatedequipment (camera, smartphone) for capturing high-quality photographs As a

result, text in images has grown increasingly popular in the field of Computer

Vision, as the precise and rich information conveyed by text is crucial in many

vision-based application scenarios Recognizing text in natural scene has become

an active study subject in computer vision and pattern recognition Yet, ing text from natural situations and using it in another process is a hard task with

extract-several fundamental concerns and problems

Charterer ———> charterer

Pleading ————* Pleading

FIGURE 1.1: Scene Text Recognition visualization.

Trang 19

FIGURE 1.2: Optical Character Recognition visualization.

Scene Text Recognition is divided into two scenarios, one called Optical

Charac-ter Recognition on scanned documents and Scene Text Recognition for the latCharac-ter

Both of these can be distinguished by different aspects suggested by [7] such as

background, font, form, noise and access Specifically, we will be focusing onScene Text Recognition scenarios later on, the difference can be referring from

below:

€leansui

FIGURE 1.3: Scene Text Recognition examples.

Trang 20

¢ Background: OCR in scanned documents have a white backdrop and are

less noisy; the presence of a mark depends on the document’s content Text

in natural scenes, on the other hand, may contain many items and noise

in the backdrop, such as (sign, board, people, animals, vehicles ), whichcould make the image more complex and difficult to detect In addition,

the background may visually resemble text, which might make recognition

much more difficult

¢ Font: Documents that have been scanned typically have a single font for all

of their information, along with an uniform font size, making them easy

to recognize Unlike scanned documents, the font of natural settings varies

depending on the images used Generally, scene text fonts might be difficult

to identify due to their artistic nature, making the recognition process moredifficult

¢ Form: Text in scanned document can be printed in uniform arrangement

and diverse orientations The diversity of text makes STR more difficult

and challenging than OCR in scanned documents

® Noise: Text in natural situations can be impacted by noise inferences such

as nonuniform illumination, low resolution, motion blurring, and coloring,causing Scene Text Recognition to fail under imperfect imaging conditions

e Access: Documents that have been scanned are printed front-facing and

appear complete as images Unfortunately, text in natural situations is

col-lected at random, resulting in various geometric distortions such as

per-spective, shear, and scale Diverse text shapes increase the difficulty of

character recognition and have an effect on the final output

Trang 21

In this chapter, we will be focus on recognizing text in natural scenes and describe

its fundamental problems and issues along with present solutions and future search

re-We will recognize an image which contains text in natural scenes, the result will

be a string of characters contain the content of that text in the image

® Input: A image containing text

¢ Output: Recognized text in the image

CAVE SPRING — Recognition Ce SPRING

-—

HIGH SCHOOL Hk Tih,

FIGURE 1.5: Scene Text Recognition output.

Several competitions have been organized in the field of research to

val-idate alternative approaches and methods, thereby establishing a measurement

for State Of The Art in those real-world standards Researchers have evaluatedand distinguished benchmarks based on their properties within these bench-

marks In conclusion, seven main real-world benchmarks in STR is divided into

two categories:

® Regular benchmarks

Trang 22

® Irregular benchmarks

On the foundation of these two categories, a model’s performance and time

in-ference can be validated Due to these categories, numerous models have been

characterized by their benefits and advantages as well as their limitations and

drawbacks Each of these categories are defined by their level of difficulty and

their text’s geometric layout We will going through these categories more ically later on

specif-1.1.1.1 STR in Regular benchmarks

Regular benchmarks contain text images with horizontally laid out characters

that have even spacing between them The images are usually captured frontal

and in decent imaging conditions, considered as relatively easy cases for STR, as

described:

® IIIT5-Words (IIIT): is the dataset crawled from Google image searches,

with query words that are likely to return text images, such as "billboards",

"signboard", "house numbers", "house name plates", and "movie posters".IIIT consists of 2,000 images for training and 3,000 images for evaluation

¢ Street View Text (SVT): contains outdoor street images collected from

Google Street View Some of these images are noisy, blurry, or of

low-resolution SVT consists of 257 images for training and 647 images for

eval-uation.

se ICDAR2003 (IC03): was create for the ICDAR 2003 Robust Reading

competition for reading camera-captured scene texts It contains 1,156

im-ages for training and 1,110 imim-ages for evaluation Ignoring all words that areeither too short (less than 3 characters) or ones that contain non-alphanumeric

Trang 23

characters reduces 1,110 images to 867 However, researchers have used two

different version of the dataset for evaluation: versions with 860 and 867

im-ages The 860-image dataset is missing 7 word boxes compared to the 867

dataset The omitted word boxes can be found in the supplementary

mate-rials

¢ ICDAR2013 (IC13): inherits most of IC03”s images and was also created

for the ICDAR 2013 Robust Reading competition It contains 848 images for

training and 1,095 images for evaluation, where pruning words with alphanumeric characters results in 1,015 images Again, researchers have

non-used two different versions of evaluation: 857 and 1,015 images The

857-image set is a subset of the 1,015 set where words shorter than 3 characters

se ICDAR2015 (IC15): was create for the ICDAR 2015 Robust Reading

competitions and contains 4,4468 images for training and 2,077 images for

Trang 24

evaluation The images are captured by Google Glasses while under the

natural movements of the wearer Thus, many are noisy, blurry, and

ro-tated, and some are also of low resolution Again, researchers have used two

different versions for evaluation: 1,811 and 2,088 images discarding

non-alphanumeric character images and some extremely rotated,

perspective-shifted, and curved images for evaluation Some of the discarded wordboxes can be found in the supplementary materials

® SVT Perspective (SVTP): is collected form Google Street View and

con-tains 645 images for evaluation Many of the images contain perspectiveprojections due to the prevalence of non-frontal viewpoints

se CUTE80: is collected from natural scenes and contains 288 cropped

images for evaluation Many of these are curved text images

(a)

G25 sisn2Ý ni

(b)

FIGURE 1.7: Example of irregular dataset a) Perspective text,

sam-ples from SVTP [33] b) Curved text, taken from CUTE80 [35].

Trang 25

1.1.2 Applications

Scene Text Recognition plays a significant part in the development of autonomous

systems, the improvement of the disabled population, and the medical

profes-sion Hence, researchers and businesses may have an interest in improving anddeveloping OCR systems

1.1.2.1 Intelligent transportation

Building an automatic traffic system is not only advantageous for traffic, but it

also allows drivers to overcome language barriers by, for example, automatically

detecting road signs and traffic instructions

Phase #1: Detect ——

traffic sign

FIGURE 1.8: Traffic sign recognition.

Traveling can be difficult for foreigners when language barrier occurs, traffic roadsigns or products name or even menu in restaurant can make travelers feel un-

comfortable Therefore, STR can help overcome the language problem and

im-proved travelers without the local language 60].

Trang 26

1.1.2.2 Diverse translation

Translation has traditionally been the diplomatic option, but both the translator

and the subject must be fluent in the target language in order to communicate

effectively In spite of this, human effort is still required to solve this particularchallenge In contrast, text recognition can circumvent this issue and eliminate the

requirement for human-based translation Moreover, not only may text

recogni-tion be applied to text-to-text, but also image-to-text and audio-to-text, hence the

name diversified translation Text recognition can produce a description based

on multiple inputs, including images and audio recordings More importantly,while human can perform poorly on imperfect visual images, recognizing text

can support in enhancing the recognition capability [43].

Trang 27

1.1.2.3 Self-driving system

Recent developments in technology 4.0 have led to the introduction of automatic

systems, including self-driving systems Using a camera-based recognition

sys-tem can significantly enhance development The ability to recognize road signsand traffic instructions can aid in navigation and potentially avoid fatal accidents

Text recognition can serve as the ‘eyes’ in ‘eyes and ears’ for self-driving vehicles,which is a critical component for many sectors

General Object Recognition |

Image Identification % Specific Object Recognition

lô ts this a building? X Is traffic sign “Stop'? )

Image Classification - Scene Understanding

k What kind of image is it? lu, 16 What kind of scane is it? 3

Object Detection

Where are pedestrians? ,

FIGURE 1.10: Semantic recognition in STR.

Besides the self-driving system, other automatic systems can inherit this as well,

as long as camera-based can be applied Whether a production chain, or

secu-rity surveillance, text recognition can provides visual information and thereforeimproves the overall performance [16].

1.1.2.4 Replacement for the disabled community

Another application that text recognizing can be of used is to improved a abled person, specifically a blind person They say "The eyes are the window into

dis-the soul" - Perspectives on Psychological Science, a journal of dis-the Association for

Trang 28

Psychological Science, text recognition can be a replacement to a person ‘eyes’

This interests has attracted many researchers in the medical field and seems to be

a potential solution [4].

`

' Output Module ;

Fig 3 General Overview of Systems Architecture

FIGURE 1.11: Voice assisted text reading system for visually impared

person.

1.1.2.5 Information Extraction

Insome instances, document censoring is essential for the development of

security-related automatic systems that play a vital role in numerous businesses Bad

imaging conditions might be difficult on the eyes Despite the complexity of the

imaging environment, the censorship process can run easily with the aid of text

recognition Regardless, text recognition can replace human efforts in document

censorship and will become a potential solution for many industries [54].

1.1.3 Challenges and recent research

1.1.3.1 Challenges in Scene Text Recognition

The reason why STR becomes an attractive interests for researchers and

indus-tries is because of its challenges and issues Ever since OCR system was firstintroduced, many sub-problems in STR are still remain, some of them include:

Trang 29

Time2Go

qume26o} |_| Sender name _ | Time2Go Company

Your Company Contact:

M Max Apfel-Bime Time2Go Company Supplier ID | 009753

Bimenstr 24 ‘Am Apfelbau 13 t

19% taxes incl Amount: 1400

Te ti 7 0 pay until 11.12.2013 Position 1 an 00

Description: Apples

Time2Go Company DET2 129412941234

Am Apfolbau 13 Apfelbank

12345 Apfolstadt XYZ1234ZxY

FIGURE 1.12: Document information extraction.

® Text enhancement: Text enhancement can increase the quality of the image,

making the text easier to be recognized This technique includes recover

blurry images, improved image resolution [54], remove distortion of text, or

remove the background [28] Many algorithms have been introduced and

given positive results, such as deconvolution [56] or sparse reconstruction

[57].

¢ Text tracking: Text tracking is employed to preserve the integrity of text

localization and to track text in adjacent video frames Whereas image texthas a static background, video text may be merged with the backgroundand noise This makes recognition considerably more difficult

® Natural language processing: Apart of the first two challenges which

re-lated to imaging condition and tracking, Natural language processing is ainteresting topic in computer vision The purpose is to make the computer

to understand and manipulate natural language text or speech There is a

wide range of text-based applications of NLP, including machine translation

(3), automatic summarization [B9], question answering [1] and

relation-ship extraction [55].

Trang 30

Many problems are still needed to be worked on in Scene Text Recognition, with

each proposal has their related works on specific problem contributed to

over-come these challenges However, with the development of STR progress, new

challenges rise overtime

Trang 31

be-supervised learning helped the model achieved generalization and perform well

on validate data In another approach, semi-supervised has been introduced tothe learning scheme By using labeled data and unlabeled data, semi-supervisedcan increased the overall performance across multiple domains by reducing dis-

crepancy distance between the two domains The idea to create this uni-domain

model called Unsupervised Domain Adaptation

1.2.2 Unsupervised Domain Adaptation in Scene Text

Recogni-tion

Semi-supervised domain adaptation (SSDA) is an important task, [10], [58]

How-ever, it has not been fully explored, with regard to deep learning methods Themain challenge in domain adaptation is the gap in feature distribution between

domains, which degrades the source data performance if the gap is big This

creates a phenomenon called domain shift which could damaged the adapting

process One of the approach measures the discrepancy distance between two

distributions in source and target dataset respectively, then train a model to imize that distance

min-In this thesis, we will apply Unsupervised Domain Adaptation to SceneText Recognition by minimizing latent entropy as the discrepancy measurement,

called Sequence-to-sequence unsupervised domain adaptation with focal against

Trang 32

imbalance distribution (SU-FOCALID) and we will compare result to another

method called Sequence-to-sequence on Minimizing Latent Entropy (SMILE)

1.3 Requirement and Contribution

1.3.1 Requirement

In this thesis, we have created certain goals in order to validate our method

including:

® Research on academic papers related to UDA, STR, Entropy Minimization,

Expectation Maximization algorithms

* Collect official scene text recognition benchmarks for validation

¢ Compare method results to SMILE

® Propose future research on methods or subjects related to UDA for STR

1.3.2 Contribution

Our main contributions include:

® Defining Unsupervised Domain Adaptation in Scene Text Recognition by

using Entropy Minimization.

® Proposed a sequence-to-sequence unsupervised domain adaptation with

focal against imbalance distribution (SU-FOCALID)

¢ Perform comparison result with the proposed method

Trang 33

1.4 Thesis Structure

Chapter 1: Introduction

Chapter 2: Prior research and SU-FOCALID definition

Chapter 3: Experiment implementation and comparison results

Chapter 4: Thesis conclusion

Trang 34

Chapter 2

Prior Research

2.1 Related works

2.1.1 Scene text recognition

Scene text recognition is a popular interests which developed from basic CNNs

to large-scale deep learning structures While more and more methods proposed,

challenges arise along the way, making the topic keep its interests

The most straightforward pipeline includes 3 stages: Feature extraction, sequencemodeling, prediction

® Feature extraction: Using traditional or sophisticate CNN structure to

ex-tract image features, CNNs have proven promising results over hand-crafted

features such as Histogram Orieneted Gradients (HOG), Local Binary

Pat-tern (LBP) Some of them even gain reputation such as ResNet(15],

Effi-cientNet[45], DenseNet|16]

s® Sequence modeling: The extracted features from Feature extraction stage

are reshaped into a sequence of feature The sequence features can be create

by using LSTM, BiLSTM, RNN layers [40] in the structures

Trang 35

Chapter 2 Prior Research 22

¢ Prediction: The final stage includes translate sequence feature into

charac-ters or a distribution of probability of each characcharac-ters This can be the

fi-nal output of the pipeline CTC7] 1s the most straight-forward technique,

nowadays researchers tend to use attention mechanism[49] for its efficiency

and fast computation

Though the mentioned framework are common, most of the methods follow thiswith their distinguished techniques used mentioned framework combinedwith 2D attention map comprised of feature map and sequence feature to guidethe prediction suggested 2D feature map with positional encoding allow-

ing the model to recognize text in arbitrary shapes proposed using multiple

BiLSTM layers with selective decoders to enhance prediction On the other hand,

other proposal suggested their own framework such as adapted BERT’s Mask

Language Module and Visual Reasoning Module to provide prediction vided a dictionary guidance to sharpen language predictions proposed iter-

pro-ative bi-directional language model to correct wrong predictions

In this thesis, we will be using STR frameworks provided by [2] for comparisonand using model weights With their universal four-stage STR frameworks, wecan compare our method with most of the STR proposals As mentioned,

divided a STR model into four-stage:

® Transformations (Trans.)

® Feature Extraction (Feat.)

® Sequence modeling (Seq.)

e Prediction (Pred.)

Tiêu đề	Unsupervised Domain Adaptation in Scene Text Recognition Using Entropy Minimization
Tác giả	Tran Tien Hung
Người hướng dẫn	Ph.D. Ngo Duc Thanh
Trường học	University of Information Technology
Chuyên ngành	Computer Science
Thể loại	Graduation Thesis
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	71
Dung lượng	33,85 MB