VIETNAM NATIONAL UNIVERSITY OF HO CHI MINH CITYUNIVERSITY OF INFORMATION TECHNOLOGY TRAN TIEN HUNG GRADUATION THESIS UNSUPERVISED DOMAIN ADAPTATION IN SCENE TEXT RECOGNITION USING ENTROP
Trang 1VIETNAM NATIONAL UNIVERSITY OF HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
TRAN TIEN HUNG
GRADUATION THESIS
UNSUPERVISED DOMAIN ADAPTATION IN
SCENE TEXT RECOGNITION USING
ENTROPY MINIMIZATION
BACHELOR’S DEGREE OF COMPUTER SCIENCE
HO CHI MINH CITY, 2023
Trang 2VIETNAM NATIONAL UNIVERSITY OF HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
TRAN TIEN HUNG - 19521587
GRADUATION THESIS
UNSUPERVISED DOMAIN ADAPTATION IN
SCENE TEXT RECOGNITION USING
ENTROPY MINIMIZATION
BACHELOR’S DEGREE OF COMPUTER SCIENCE
SUPERVISOR
PhD NGO DUC THANH
HO CHI MINH CITY, 2023
Trang 3THESIS COMMITTEE MEMBERS
Council of graduation thesis committee , established under number- decision on date from the principal of University of Information Technology — VNUHCM.
Lecce eee c eee eeeeeeeeeeeeeeeneeteneeeeneenes — Chairperson
2 Ki nh ng chi th tin eee — Secretary.
he — Commissioner.
4 kh kg kh nhi kh in tin ky — Commissioner.
Trang 4We are also thankful for our research group’s support and encouragement pecially, we would like to thank Khanh, Phuong, and Tan for their outstanding
Es-assistance and technical advice during the project’s development.
Finally, we would like to express our gratitude to our family and friends, as well
as Tu, for their support and spiritual assistance during the difficult phase.
All things considered, this project could not be achieved without the support of
everyone listed and involved Thank you for your encouragement and support.
Trang 5IAbstractl
1_ Introductioni
[LT Scene Text
Recognition| -1.11 1.1.2 Definiion| Ặ.ẶẶ ee 1.111 STRinRegularbenchmarks|
1.1.1.2 SIRin Irregularbenchmarksl
Applicatlonsl -1.1.2.1 Intelligent transportation|
1.1.2.2 Diverse
translation| -1.1.2.3 Self-drivingsystem|
1.1.2.4 Replacement for the disabled community]
1.1.2 Information ExtracHion|
1.1.3 Challenges and recent research|
1.1.3.1 Challenges in Scene Text Recogrition|
1.2 Unsupervised Domain
Trang 6Adaptation| -1.3 Requirement and
Contribution| 1.3.1 equirement} 6
1.3.2 ntibutlon|
14 Thesis Structurel ẶẶẶ SỐ ằ SỐ 2_ Prior Research| 21
Relatedworksl -2.11
Scenetextrecopniion| -2.1.1.1
Transformationsstagel -2.1.1.2 Feature extraction
stage| -2.1.1.3 | Sequence modeling
stagel -2.1.1.4
Predicionstagel -2.1.2 Unsupervised domain adaptation in Scene text recognition] E2 SMIH
[2.2.1 Scene Text Recognition with Attention-based Model] .
U-FOCALID
2.3.1 Methodology| - -c 1.2.2 Unsupervised Domain Adaptation in Scene Text Recognition| 18 24 2.2.2 Sequence-to-sequence UDA with Minimizing Latent Entropy] 26 2.2.3 Class-balanced Self-paced Learning]
Trang 72.3.1.3 Adapting Stage) 2 26 2 ee ee2.3.2 Baseline} 2 0 ee2.4 Implementation detalls|l
Trang 84_ Thesis conclusion 49
Trang 9List of Figures
1.1 Scene Text Recognition visualizaton.| 5
¬ 6 1.3 Scene Text Recognition examples| - 6
1.4 Scene Text Recognitloninput| - 8
1.5 Scene Text Recognition output| - 8
1.6 Regulardatasetin[l9jÏ[ 10
1.7 Example of irregular dataset a) Perspective text, samples from SVTP [33] b) Curved text, taken from CUTE80 [35].} 11
Ƒ= 4.4.4 12
Xa eee 13 Áiiáa 14
- đỗ 1.12 Document information extracHion| 16
1.13 Imperfect imaging conditions.| 17
1.14 Blurry, distortion and geometric transformation.) 17
1.15 Perspective, shear and small text images.) 17
Trang 102.1 Visualization of[2|SITRframeworks| - 23
2.2_ SU-FOCALID overall architecture F is the decoder with the focal
mechanism with green and red arrows indicating weight control
The brown dot line indicates each block shares weight with the
other c is the predicted character in a sequence Ly;eq is the
classi-fication loss and Leyide 1s the entropy minimization loss) 30
2.3 Class distributions in synthetic dataset including MJ and ST) 33
2.4 Class distributions in real-world dataset including SVT, IC03, IC13,
IC15,CUTE,IIT5k| 332.5 MJSynth samples generation process|l - 342.6 SynthTextsamples| 2.0.0.0 000000000 eee ee 34
Q
3.1 Bar plot for accuracy comparison.| - 3
3.2 Flowchart of hyper-parameters optimization Pink boxes are « where
x is & and 1 is «3, blue boxes are y, green boxes are official
bench-marks} 2 QC CS Q HQ HH v2 40
3.3 Model loss comparisons with synthetic as source dataset and
real-world as target dataset) 2.0 2 ee ee eee 42
3.4 Entropy minimization loss with focal entropy and cross-entropy.| 42
3.5 Classification loss comparisons with synthetic as source dataset
and real-world as target dataset with multiple trials| 43
3.6 Model loss for optimized (a,7) = (0.592,4) as focal entropy and|
= 2 for focal loss compared to entropy loss and entropy as decoder loss (L„;„„) and entropy minimization loss (Le yide)
cross-respectively| 2 ee 44
Trang 113.7 Entropy minimization loss - Leyige for optimized focal-based and
cross-entropy based model|
3.8 Classification loss - Ly;eq for focal-based and cross-entropy based
¬
3.9 Comparison between SU-FOCALID and SMILE]
Trang 12List of Tables
3.1 œandsyvwaluesl 0.0.0.0 va 373.2_ Comparison with UDA methods on regular benchmarks Bold is
the highest value and underline is the second-highest value 48
Trang 13List of Abbreviations
1 CNN Convolutional Neural Networks
2 OCR Optical Character Recognition
3 STR Scene Text Recognition
4 EM Expectation Maximization
5 SOTA State Of The Art
6 STN Spatial Transformer Networks
7 CTC Connectionist Temporal Classification
8 UDA Unsupervised Domain Adaptation
9 SU-FOCALID Sequence-to-sequence Unsupervised domain
adaptation with FOCAL on Imbalance Distribution
10 LSTM Long Short Term Memory
11 BiLSTM Bi-directional Long Short Term Memory
12 RNN Recurrent Neural Network
Trang 14Scene Text Recognition is a subproblem of Optical Character Recognition
that will be the focus of this thesis In recent years, numerous Scene Text
Recogni-tion approaches have been developed Since the amount of real dataset is not
sig-nificant and labeling process can be time consuming for Scene Text Recognition
The most common training method involves training a model using synthetic
data and then predicting on actual data Yet, this could result in domain shifts
between synthetic and actual images In addition, each real-world benchmark
has its own unique characteristics, including perspective, curved text, contrast,
and brightness, etc., resulting in poor performance and inaccurate predictions
To address these limitations, Unsupervised Domain Adaptation (UDA)
has been proposed, which can reduce the disparity between source domain datasetsand target domain datasets With pseudo labels generated by a source-labeled
dataset, we may utilize Expectation Maximization methods in conjunction with
Entropy Minimization to produce predictions with high confidence, hence
reduc-ing the discrepancy of an unlabeled target domain dataset In addition, by
em-ploying class-wise self-pace balance, Unsupervised Domain Adaptation can pick
a high-confidence portion of pseudo label for the training set, thereby improving
the adaptation process
In order to implement the described concept, we have modified and
validated multiple models in Unsupervised Domain Adaptation and proposed
a sequence-to-sequence unsupervised domain adaptation with focal against
im-balance distribution (SU-FOCALID) based on a Scene Text Recognition
frame-work that applies adaptation on minimizing latent entropy in pseudo labels
gen-erated by its decoder in order to strengthen predictions from the unlabeled target
Trang 15dataset SU-FOCALID will be evaluated on official scene text recognition marks with prior UDA methods Our main contributions include:
bench-® Presenting main concept in Unsupervised Domain Adaptation for Scene
Text Recognition using Entropy Minimization
¢ Presenting official Scene Text Recognition benchmarks and training datasets
® Proposing sequence-to-sequence unsupervised domain adaptation against
imbalance distribution (SU-FOCALID)
Keywords: scene text recognition, unsupervised domain adaptation,entropy minimization, deep learning, domain shift
Trang 16Chapter 1
Introduction
Text has always been an essential aspect of human life, and its application has
benefited humanity throughout human evolution Text is a system of symbols
used for recording and intercultural communication Rich and precise semantic
information carried by text is used in many applications such as image search
[47], intelligent inspection [6], industrial automation [52], robot navigation [9] and instant translation [24] Consequently, recognizing text for applications in
the real world is a crucial task for computer vision in order to enhance
technol-ogy Scene Text Recognition, often known as text recognition in natural scenes,
is a major branch of Optical Character Recognition in the Computer Vision field
Despite the fact that text recognition in scanned documents is extensively
de-veloped, Scene Text Recognition remains difficult due to numerous real-worldfactors, such as complicated backgrounds, diverse typefaces, diverse text posi-
tions, and bad image conditions Early studies [52], mainly based on
hand-crafted features which had low efficiency and resulted in poor performance
Re-cently, deep learning has demonstrated promising results on numerous
bench-marks, and proposals have been introduced alongside competitive State Of The
Art method results during the span of the year Since then, numerous ways are
based on neural networks with additional techniques that represent its benefits
Trang 17Chapter 1 Introduction 4
Also, training strategies and learning methods have contributed to the ment of model performance
enhance-However, Scene Text Recognition model require a large amount of data
in order to perform well, while labeling process is time-consuming and thereforelack of human-labeled real-world data To counteract the lack of data, synthetic
data (18]have been presented, creating the main training pattern for future
STR research This method includes training the model on synthetic data, lowed by validation or tuning on real-world data Prior research focused mostly
fol-on modifying architecture rather than training diverse datasets [2] [2] proposed
a main STR model with four stages: transformation, feature extraction, sequence
modeling, and prediction The majority of modern scene text recognition modelsadhere to this scheme Another method is data-centric, and suggests that
a model trained on one domain may perform badly when presented with datafrom a different domain, a phenomenon known as domain shift Trained model
on all domain can lead well performance across domain and can reduce the need
for human-labeled data Recent interesting domains are handwritten, real-world,
document printed and synthetic, based on these domain, we can just use one
sin-gle domain in order to validate across domains or use a union of cross-domaindata, this created a learning strategy called Unsupervised Domain Adaptation
In this thesis, we will focus on training from a source domain to a target
domain and validating across domains using Unsupervised Domain Adaptation
for Scene Text Recognition Using Expectation Maximization techniques and
clus-tering methods to get entropy distribution, [13] recommended semi-supervised
learning to deliver high-confidence predictions in deep learning models, with
low entropy indicating the model is confident with one sample and vice versa.Consequently, it can be utilized as a pseudo-label in the training process We re-
produce result from a method called sequence-to-sequence on minimizing latent
Trang 18Chapter 1 Introduction 5
entropy (SMILE) to observe the adaptation process along with proposing a new
method called Sequence-to-sequence unsupervised domain adaptation against
imbalance distribution called SU-FOCALID
1.1 Scene Text Recognition
1.1.1 Definition
Text has been used by humanity as a way to communicate and to document
culture, knowledge, history, and accomplishments The image technologies of
the twenty-first century have progressed progressively, with more sophisticatedequipment (camera, smartphone) for capturing high-quality photographs As a
result, text in images has grown increasingly popular in the field of Computer
Vision, as the precise and rich information conveyed by text is crucial in many
vision-based application scenarios Recognizing text in natural scene has become
an active study subject in computer vision and pattern recognition Yet, ing text from natural situations and using it in another process is a hard task with
extract-several fundamental concerns and problems
Charterer ———> charterer
Pleading ————* Pleading
FIGURE 1.1: Scene Text Recognition visualization.
Trang 19Chapter 1 Introduction 6
FIGURE 1.2: Optical Character Recognition visualization.
Scene Text Recognition is divided into two scenarios, one called Optical
Charac-ter Recognition on scanned documents and Scene Text Recognition for the latCharac-ter
Both of these can be distinguished by different aspects suggested by [7] such as
background, font, form, noise and access Specifically, we will be focusing onScene Text Recognition scenarios later on, the difference can be referring from
below:
€leansui
FIGURE 1.3: Scene Text Recognition examples.
Trang 20Chapter 1 Introduction 7
¢ Background: OCR in scanned documents have a white backdrop and are
less noisy; the presence of a mark depends on the document’s content Text
in natural scenes, on the other hand, may contain many items and noise
in the backdrop, such as (sign, board, people, animals, vehicles ), whichcould make the image more complex and difficult to detect In addition,
the background may visually resemble text, which might make recognition
much more difficult
¢ Font: Documents that have been scanned typically have a single font for all
of their information, along with an uniform font size, making them easy
to recognize Unlike scanned documents, the font of natural settings varies
depending on the images used Generally, scene text fonts might be difficult
to identify due to their artistic nature, making the recognition process moredifficult
¢ Form: Text in scanned document can be printed in uniform arrangement
and diverse orientations The diversity of text makes STR more difficult
and challenging than OCR in scanned documents
® Noise: Text in natural situations can be impacted by noise inferences such
as nonuniform illumination, low resolution, motion blurring, and coloring,causing Scene Text Recognition to fail under imperfect imaging conditions
e Access: Documents that have been scanned are printed front-facing and
appear complete as images Unfortunately, text in natural situations is
col-lected at random, resulting in various geometric distortions such as
per-spective, shear, and scale Diverse text shapes increase the difficulty of
character recognition and have an effect on the final output
Trang 21Chapter 1 Introduction 8
In this chapter, we will be focus on recognizing text in natural scenes and describe
its fundamental problems and issues along with present solutions and future search
re-We will recognize an image which contains text in natural scenes, the result will
be a string of characters contain the content of that text in the image
® Input: A image containing text
¢ Output: Recognized text in the image
CAVE SPRING — Recognition Ce SPRING
-—
HIGH SCHOOL Hk Tih,
FIGURE 1.5: Scene Text Recognition output.
Several competitions have been organized in the field of research to
val-idate alternative approaches and methods, thereby establishing a measurement
for State Of The Art in those real-world standards Researchers have evaluatedand distinguished benchmarks based on their properties within these bench-
marks In conclusion, seven main real-world benchmarks in STR is divided into
two categories:
® Regular benchmarks
Trang 22Chapter 1 Introduction 9
® Irregular benchmarks
On the foundation of these two categories, a model’s performance and time
in-ference can be validated Due to these categories, numerous models have been
characterized by their benefits and advantages as well as their limitations and
drawbacks Each of these categories are defined by their level of difficulty and
their text’s geometric layout We will going through these categories more ically later on
specif-1.1.1.1 STR in Regular benchmarks
Regular benchmarks contain text images with horizontally laid out characters
that have even spacing between them The images are usually captured frontal
and in decent imaging conditions, considered as relatively easy cases for STR, as
described:
® IIIT5-Words (IIIT): is the dataset crawled from Google image searches,
with query words that are likely to return text images, such as "billboards",
"signboard", "house numbers", "house name plates", and "movie posters".IIIT consists of 2,000 images for training and 3,000 images for evaluation
¢ Street View Text (SVT): contains outdoor street images collected from
Google Street View Some of these images are noisy, blurry, or of
low-resolution SVT consists of 257 images for training and 647 images for
eval-uation.
se ICDAR2003 (IC03): was create for the ICDAR 2003 Robust Reading
competition for reading camera-captured scene texts It contains 1,156
im-ages for training and 1,110 imim-ages for evaluation Ignoring all words that areeither too short (less than 3 characters) or ones that contain non-alphanumeric
Trang 23Chapter 1 Introduction 10
characters reduces 1,110 images to 867 However, researchers have used two
different version of the dataset for evaluation: versions with 860 and 867
im-ages The 860-image dataset is missing 7 word boxes compared to the 867
dataset The omitted word boxes can be found in the supplementary
mate-rials
¢ ICDAR2013 (IC13): inherits most of IC03”s images and was also created
for the ICDAR 2013 Robust Reading competition It contains 848 images for
training and 1,095 images for evaluation, where pruning words with alphanumeric characters results in 1,015 images Again, researchers have
non-used two different versions of evaluation: 857 and 1,015 images The
857-image set is a subset of the 1,015 set where words shorter than 3 characters
se ICDAR2015 (IC15): was create for the ICDAR 2015 Robust Reading
competitions and contains 4,4468 images for training and 2,077 images for
Trang 24Chapter 1 Introduction 11
evaluation The images are captured by Google Glasses while under the
natural movements of the wearer Thus, many are noisy, blurry, and
ro-tated, and some are also of low resolution Again, researchers have used two
different versions for evaluation: 1,811 and 2,088 images discarding
non-alphanumeric character images and some extremely rotated,
perspective-shifted, and curved images for evaluation Some of the discarded wordboxes can be found in the supplementary materials
® SVT Perspective (SVTP): is collected form Google Street View and
con-tains 645 images for evaluation Many of the images contain perspectiveprojections due to the prevalence of non-frontal viewpoints
se CUTE80: is collected from natural scenes and contains 288 cropped
images for evaluation Many of these are curved text images
(a)
G25 sisn2Ý ni
(b)
FIGURE 1.7: Example of irregular dataset a) Perspective text,
sam-ples from SVTP [33] b) Curved text, taken from CUTE80 [35].
Trang 25Chapter 1 Introduction 12
1.1.2 Applications
Scene Text Recognition plays a significant part in the development of autonomous
systems, the improvement of the disabled population, and the medical
profes-sion Hence, researchers and businesses may have an interest in improving anddeveloping OCR systems
1.1.2.1 Intelligent transportation
Building an automatic traffic system is not only advantageous for traffic, but it
also allows drivers to overcome language barriers by, for example, automatically
detecting road signs and traffic instructions
Phase #1: Detect ——
traffic sign
FIGURE 1.8: Traffic sign recognition.
Traveling can be difficult for foreigners when language barrier occurs, traffic roadsigns or products name or even menu in restaurant can make travelers feel un-
comfortable Therefore, STR can help overcome the language problem and
im-proved travelers without the local language 60].
Trang 26Chapter 1 Introduction 13
1.1.2.2 Diverse translation
Translation has traditionally been the diplomatic option, but both the translator
and the subject must be fluent in the target language in order to communicate
effectively In spite of this, human effort is still required to solve this particularchallenge In contrast, text recognition can circumvent this issue and eliminate the
requirement for human-based translation Moreover, not only may text
recogni-tion be applied to text-to-text, but also image-to-text and audio-to-text, hence the
name diversified translation Text recognition can produce a description based
on multiple inputs, including images and audio recordings More importantly,while human can perform poorly on imperfect visual images, recognizing text
can support in enhancing the recognition capability [43].
Trang 27Chapter 1 Introduction 14
1.1.2.3 Self-driving system
Recent developments in technology 4.0 have led to the introduction of automatic
systems, including self-driving systems Using a camera-based recognition
sys-tem can significantly enhance development The ability to recognize road signsand traffic instructions can aid in navigation and potentially avoid fatal accidents
Text recognition can serve as the ‘eyes’ in ‘eyes and ears’ for self-driving vehicles,which is a critical component for many sectors
General Object Recognition |
Image Identification % Specific Object Recognition
lô ts this a building? X Is traffic sign “Stop'? )
Image Classification - Scene Understanding
k What kind of image is it? lu, 16 What kind of scane is it? 3
Object Detection
Where are pedestrians? ,
FIGURE 1.10: Semantic recognition in STR.
Besides the self-driving system, other automatic systems can inherit this as well,
as long as camera-based can be applied Whether a production chain, or
secu-rity surveillance, text recognition can provides visual information and thereforeimproves the overall performance [16].
1.1.2.4 Replacement for the disabled community
Another application that text recognizing can be of used is to improved a abled person, specifically a blind person They say "The eyes are the window into
dis-the soul" - Perspectives on Psychological Science, a journal of dis-the Association for
Trang 28Chapter 1 Introduction 15
Psychological Science, text recognition can be a replacement to a person ‘eyes’
This interests has attracted many researchers in the medical field and seems to be
a potential solution [4].
`
' Output Module ;
Fig 3 General Overview of Systems Architecture
FIGURE 1.11: Voice assisted text reading system for visually impared
person.
1.1.2.5 Information Extraction
Insome instances, document censoring is essential for the development of
security-related automatic systems that play a vital role in numerous businesses Bad
imaging conditions might be difficult on the eyes Despite the complexity of the
imaging environment, the censorship process can run easily with the aid of text
recognition Regardless, text recognition can replace human efforts in document
censorship and will become a potential solution for many industries [54].
1.1.3 Challenges and recent research
1.1.3.1 Challenges in Scene Text Recognition
The reason why STR becomes an attractive interests for researchers and
indus-tries is because of its challenges and issues Ever since OCR system was firstintroduced, many sub-problems in STR are still remain, some of them include:
Trang 29Chapter 1 Introduction 16
Time2Go
qume26o} |_| Sender name _ | Time2Go Company
Your Company Contact:
M Max Apfel-Bime Time2Go Company Supplier ID | 009753
Bimenstr 24 ‘Am Apfelbau 13 t
19% taxes incl Amount: 1400
Te ti 7 0 pay until 11.12.2013 Position 1 an 00
Description: Apples
Time2Go Company DET2 129412941234
Am Apfolbau 13 Apfelbank
12345 Apfolstadt XYZ1234ZxY
FIGURE 1.12: Document information extraction.
® Text enhancement: Text enhancement can increase the quality of the image,
making the text easier to be recognized This technique includes recover
blurry images, improved image resolution [54], remove distortion of text, or
remove the background [28] Many algorithms have been introduced and
given positive results, such as deconvolution [56] or sparse reconstruction
[57].
¢ Text tracking: Text tracking is employed to preserve the integrity of text
localization and to track text in adjacent video frames Whereas image texthas a static background, video text may be merged with the backgroundand noise This makes recognition considerably more difficult
® Natural language processing: Apart of the first two challenges which
re-lated to imaging condition and tracking, Natural language processing is ainteresting topic in computer vision The purpose is to make the computer
to understand and manipulate natural language text or speech There is a
wide range of text-based applications of NLP, including machine translation
(3), automatic summarization [B9], question answering [1] and
relation-ship extraction [55].
Trang 30Chapter 1 Introduction 17
Many problems are still needed to be worked on in Scene Text Recognition, with
each proposal has their related works on specific problem contributed to
over-come these challenges However, with the development of STR progress, new
challenges rise overtime
Trang 31be-supervised learning helped the model achieved generalization and perform well
on validate data In another approach, semi-supervised has been introduced tothe learning scheme By using labeled data and unlabeled data, semi-supervisedcan increased the overall performance across multiple domains by reducing dis-
crepancy distance between the two domains The idea to create this uni-domain
model called Unsupervised Domain Adaptation
1.2.2 Unsupervised Domain Adaptation in Scene Text
Recogni-tion
Semi-supervised domain adaptation (SSDA) is an important task, [10], [58]
How-ever, it has not been fully explored, with regard to deep learning methods Themain challenge in domain adaptation is the gap in feature distribution between
domains, which degrades the source data performance if the gap is big This
creates a phenomenon called domain shift which could damaged the adapting
process One of the approach measures the discrepancy distance between two
distributions in source and target dataset respectively, then train a model to imize that distance
min-In this thesis, we will apply Unsupervised Domain Adaptation to SceneText Recognition by minimizing latent entropy as the discrepancy measurement,
called Sequence-to-sequence unsupervised domain adaptation with focal against
Trang 32Chapter 1 Introduction 19
imbalance distribution (SU-FOCALID) and we will compare result to another
method called Sequence-to-sequence on Minimizing Latent Entropy (SMILE)
1.3 Requirement and Contribution
1.3.1 Requirement
In this thesis, we have created certain goals in order to validate our method
including:
® Research on academic papers related to UDA, STR, Entropy Minimization,
Expectation Maximization algorithms
* Collect official scene text recognition benchmarks for validation
¢ Compare method results to SMILE
® Propose future research on methods or subjects related to UDA for STR
1.3.2 Contribution
Our main contributions include:
® Defining Unsupervised Domain Adaptation in Scene Text Recognition by
using Entropy Minimization.
® Proposed a sequence-to-sequence unsupervised domain adaptation with
focal against imbalance distribution (SU-FOCALID)
¢ Perform comparison result with the proposed method
Trang 33Chapter 1 Introduction 20
1.4 Thesis Structure
Chapter 1: Introduction
Chapter 2: Prior research and SU-FOCALID definition
Chapter 3: Experiment implementation and comparison results
Chapter 4: Thesis conclusion
Trang 34Chapter 2
Prior Research
2.1 Related works
2.1.1 Scene text recognition
Scene text recognition is a popular interests which developed from basic CNNs
to large-scale deep learning structures While more and more methods proposed,
challenges arise along the way, making the topic keep its interests
The most straightforward pipeline includes 3 stages: Feature extraction, sequencemodeling, prediction
® Feature extraction: Using traditional or sophisticate CNN structure to
ex-tract image features, CNNs have proven promising results over hand-crafted
features such as Histogram Orieneted Gradients (HOG), Local Binary
Pat-tern (LBP) Some of them even gain reputation such as ResNet(15],
Effi-cientNet[45], DenseNet|16]
s® Sequence modeling: The extracted features from Feature extraction stage
are reshaped into a sequence of feature The sequence features can be create
by using LSTM, BiLSTM, RNN layers [40] in the structures
Trang 35Chapter 2 Prior Research 22
¢ Prediction: The final stage includes translate sequence feature into
charac-ters or a distribution of probability of each characcharac-ters This can be the
fi-nal output of the pipeline CTC7] 1s the most straight-forward technique,
nowadays researchers tend to use attention mechanism[49] for its efficiency
and fast computation
Though the mentioned framework are common, most of the methods follow thiswith their distinguished techniques used mentioned framework combinedwith 2D attention map comprised of feature map and sequence feature to guidethe prediction suggested 2D feature map with positional encoding allow-
ing the model to recognize text in arbitrary shapes proposed using multiple
BiLSTM layers with selective decoders to enhance prediction On the other hand,
other proposal suggested their own framework such as adapted BERT’s Mask
Language Module and Visual Reasoning Module to provide prediction vided a dictionary guidance to sharpen language predictions proposed iter-
pro-ative bi-directional language model to correct wrong predictions
In this thesis, we will be using STR frameworks provided by [2] for comparisonand using model weights With their universal four-stage STR frameworks, wecan compare our method with most of the STR proposals As mentioned,
divided a STR model into four-stage:
® Transformations (Trans.)
® Feature Extraction (Feat.)
® Sequence modeling (Seq.)
e Prediction (Pred.)