Does intermediate domain routings affect overall performance of Scene Text Recognition model using Gradual Domain Adaptation2. By doing this, the model is able to learn from the labeled
Trang 1VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
FACULTY OF COMPUTER SCIENCE
BACHELOR THESIS
Gradual Domain Adaptation
in Scene Text RecognitionBachelor of Computer Science (Honors degree)
HO CHUNG DUC KHANH - 19520624NGUYEN THI MINH PHUONG - 19522065
Supervised by
PhD THANH DUC NGO
HO CHI MINH CITY, 2023
Trang 2VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
FACULTY OF COMPUTER SCIENCE
BACHELOR THESIS
Gradual Domain Adaptation
in Scene Text RecognitionBachelor of Computer Science (Honors degree)
HO CHUNG DUC KHANH - 19520624NGUYEN THI MINH PHUONG - 19522065
Supervised by
PhD THANH DUC NGO
HO CHI MINH CITY, 2023
Trang 3The Thesis Defense Committee has been carefully established in accordance with the Decision 155/QD-DHCNTT, issued on 01/03/2023 by the President of the University of Information Technology This committee is comprised of eminent individuals who possess a great deal of expertise and knowledge in the specific
area of study that is relevant to the thesis defense To ensure that all aspects
of the thesis defense are properly addressed, the following personnel have been carefully chosen to comprise the committee:
¢ Chairman: PhD Le Dinh Duy.
¢ Secretary: MS Nguyen Thanh Son.
¢ Member: PhD Le Minh Hung.
Trang 4suggestions, and challenging us to think outside of the box we are deeply ful for their commitment and dedication to our project.
grate-We are also thankful to all the members of our research group for their help and encouragement We are especially grateful to Tan and Hung, whose technical guidance and knowledge of domain adaptation techniques have been instrumen-
tal in the successful completion of this thesis.
We are thankful to the reviewers for their valuable feedback and suggestions,
which have been extremely helpful in improving the quality of this thesis.
We would also like to thank our family and friends for their support and
encour-agement throughout the entire process.
Finally, we would like to express our gratitude to the computing resources vided at MMLAB UIT and the Faculty of Computer Science, which enabled us to develop the algorithms and experiments presented in this thesis.
pro-All in all, this project would not have been possible without the assistance and
help of the people mentioned above Our sincere appreciation goes out to each of
them.
Trang 52
IAbstractl
1_ Introductioni
1.2 Scene Text Recogniion| co
1.3 Domain Adaptaton|
1.4 Gradual Domain
Adaptation| -1.5 Scope Sh ø 4ý eee [6 Structure of This
Thesisl -Related Works! 2.1 Scene TextRecognition|
See (21.2 FeatureExtractonl
2.1.3 Sequence
Modelingl -2.1.3.1 Language-free methods|
2.1.3.2 Language-based methodsl
1.4 redi
11
11
13
Trang 63_ Gradual Domain Adaptation in Scene Text Recognition via Pseudo-labeling] 28
3.1 Domain Adaptation in Scene Text Recognition via Pseudo-labeling| 28
3.1.1 Baselneoverviewl - 28
3.1.2 The Scene Text Recognition Framework| 31
3.2_ Gradual Domain Adaptation and Domain Routing] 38
4 Experiments| 41
4.1 Unsupervised Domain Adaptation on synthetic-trained model] 43
4.2 Gradual Domain Adaptation on synthetic-trained model] 44
[£3 Domain routing approach for applying GDA on STR task] 46
[£4 Comparing domain routing approach with state-of-the-art models] 52
54
See 54
¬ 55
56
Trang 7List of Figures
1.1 Scene Text Recognition task: output the text content in the image.
The sample image is taken from ICDAR 2015 dataset| 5
1.2 Common challenges in Scene Text Recognition} 7
[3 Domain gap in Scene Text Recognition] - 8
1.4 Gradual Domain Adaptation for Scene Text Recognition] 10
[2.1 Hlustration of Scene Text Recognition framework] 13
2.2 Transformation stage from 2] THỦ: - - - - - - 15
[2.3 Some samples of three unlabeled real-world đatasets| 27
3.1 Overview of Pseudo-labeling approach| 29
{3.2 Two model combinations according to the STR Framework (Image [rom Do eee 32 3.3 Structure of Spatial Transform Networks Image from [52]| 33
3.4 Structure of BiLSTM (Image from 58] ¬ 35 3.5 Ilustration of Attention mechanisml 37
3.6 The Portraits dataset of students from 1905 to 2005 [17]| 39
4.1 Illustration of Experiment Resultsl| 43
Trang 84.2 Illustration of Experiment ResultslH|
43 Samples of three unlabeled real-world datasets|
[4.4 Illustration of Experiment ResultsM] .
Trang 9List of Tables
1 Experimental resultsl| - 43
2 Experimental results H| - 45
4.3 Compare three unlabeled real-world datasets| 46
4.4 Experiment results on different domain routings} 51
5 Comparison between our domain routing approach and state-of-the-art methodsl 53
Trang 10Textual information is essential to virtually all aspects of our daily lives, and
au-tomating the process of bringing this information onto the digital world is a key goal of research in the field of computer vision Scene Text Recognition (STR)
is a particularly important task in this domain, as it has numerous applications
in areas such as automated number plate recognition for vehicles, access control
systems, and much more.
However, the challenge of Scene Text Recognition is that it requires large amounts
of annotated data, which is very expensive and time-consuming to collect As such, a common approach is to leverage generated synthetic data during train- ing, and test the result on real-world data Unfortunately, this approach can be ineffective, as there is often a large discrepancy between the synthetic data used
for training and the data in the real world, referred to as a “domain gap”.
Recent approaches to Scene Text Recognition have attempted to address the main gap by adopting Domain Adaptation techniques, which try to minimize the discrepancy between the two domains in a semi-supervised manner However, when the gap between the two domains is too large, Domain Adaptation may not
do-be effective.
This thesis proposes and evaluates a Gradual Domain Adaptation approach, which trains the model on multiple intermediate domains in order to minimize the gap before training on the final domain This technique is evaluated in terms of its ef-
fectiveness in reducing the domain gap, as well as the impact of adding diate domains, changing domains, and finding the appropriate domain routing.
interme-As a result, the experiments are able to confirm the following:
1 Gradual Domain Adaptation improves Scene Text Recognition baseline formance by up to 3.12%, compared to Domain Adaptation approach.
per-1
Trang 11List of Tables
2 Switching domains order improves performance by 1.56%.
3 Performance is consistently improved when adapting with increasingly
sta-ble domain routings We observe a performance boost of up to 5.81% in our
experiments.
Moreover, our method was able to outperform state-of-the-art approaches by
leveraging the domain routing approach This demonstrates the potential for this approach in scene text recognition.
Keywords:
Gradual Domain Adaptation, Domain Routing, Domain Adaptation, Scene Text Recognition, Unsupervised Domain Adaptation, Pseudo-labeling.
Trang 12au-is a particularly important task in thau-is domain, as it has numerous applications
in areas such as automated number plate recognition for vehicles, access control systems, and much more.
However, the challenge of Scene Text Recognition is that it requires large amounts
of annotated data, which is very expensive and time-consuming to collect As
such, a common approach is to leverage generated synthetic data during ing, and test the result on real-world data Unfortunately, this approach can be ineffective, as there is often a large discrepancy between the synthetic data used for training and the testing data in the real world, referred to as a “domain shift”.
train-Recent approaches to Scene Text Recognition have attempted to address the main shift by adopting Domain Adaptation techniques, which try to minimize the discrepancy between the two domains in a semi-supervised manner How-
do-ever, when the gap between the two domains is too large, Domain Adaptation
3
Trang 13Chapter 1 Introduction
may not be effective.
This thesis proposes and evaluates a Gradual Domain Adaptation approach, which
trains the model on multiple intermediate domains in order to minimize the gap before training on the final domain This technique is evaluated in terms of its ef- fectiveness in reducing the domain gap, as well as the impact of adding interme- diate domains, changing domains, and finding the appropriate domain routing The main research questions of this thesis are:
1 What is the performance of Gradual Domain Adaptation in Scene Text
Recog-nition?
2 Does intermediate domain routings affect overall performance of Scene Text
Recognition model using Gradual Domain Adaptation?
3 How to choose a good domain routing when applying Gradual Domain
Adaptation for Scene Text Recognition?
1.2 Scene Text Recognition
Scene Text Recognition (STR) is an important task in the fields of computer vision
and natural language processing, and has been heavily studied due to its many useful applications Scene Text Recognition is capable of recognizing text com- ponents in a wide range of settings, from street signs and license plates, to news- paper headlines, advertisements, and digital images and videos This powerful
technology enables tasks that would otherwise be impossible, such as quickly and accurately searching, translating, and verifying documents In addition to aiding law enforcement, Scene Text Recognition can also be used to help compa-
nies read customer reviews and detect text in security documents for automated
Trang 14Chapter 1 Introduction
verification It is also a powerful tool for a variety of industries, from automotive
to healthcare, and even retail, as it can be used to build more efficient and secure systems Scene Text Recognition is a versatile technology that has numerous ap- plications in multiple fields and can be used to greatly improve the speed and
accuracy of various tasks, making it an invaluable tool.
The input of this task is an image containing a text instance and the output of
the task is the corresponding text sequence As an illustration, figure [I-1]shows
an example of the input and output of the Scene Text Recognition task ically, the image contains the text "MOVING" which is the input of the Scene Text Recognition task The corresponding output of the task is the text sequence
Specif-"MOVING" which can be seen in the output box By recognizing the text nents present in a scene, the Scene Text Recognition task can be used to perform a range of tasks from automating document processing to providing assistance for
compo-the visually impaired.
INPUT OUTPUT
FIGURE 1.1: Scene Text Recognition task: output the text content in the image The sample image is taken from ICDAR 2015 dataset.
Despite the progress made on Scene Text Recognition, the task still faces several
challenges A key challenge is the diversity of the conditions of the input images The text components can be in various fonts, sizes, colors, orientations, and even shapes This means that a single recognition algorithm may not work optimally
Trang 15Chapter 1 Introduction
across different types of images Additionally, the background can contain noise,
complex patterns and other distracting elements, making the task more difficult.
For instance, figure [1.2] demonstrates some of the common challenges of Scene
Text Recognition, such as irregular fonts (Fig noisy background (Fig.
irregular text orientation (Fig 2.3cp, and uneven lighting/obstructed texts (Fig.
system must take into account the context of the image in order to accurately
These conditions make it difficult to process the images correctly, as the
recognize the text components As a result, a robust and accurate Scene Text Recognition algorithm must be able to handle possible variations in the input
images.
An additional challenge lies in the limited amount of annotated data available for training To address this issue, researchers often resort to the use of synthetic data to train their models, since it is possible to generate large amounts of data in
this manner [29] [20] While this approach can be successful in some cases, the
models trained on synthetic data tend to have poor performance when applied to real-world data due to the domain gap To reduce this domain gap and increase the accuracy of the models, recent works have adopted domain
adaptation techniques to bridge the difference between synthetic and real-world data
1.3 Domain Adaptation
Domain Adaptation (DA) is a powerful machine learning technique for bridging
the gap between two different datasets, particularly in cases where the training
and testing data have different distributions (Fig [L3) It has become increasingly
popular in the field of Scene Text Recognition (STR), as it can help to minimize the
discrepancy between synthetic and real-world data, which is commonly referred
Trang 16Chapter 1 Introduction
ons
(C) Curved texts (D) Obstructed texts
FIGURE 1.2: Common challenges in Scene Text Recognition
Trang 17Chapter 1 Introduction
to as the domain gap There are two branches of Domain Adaptation, namely Supervised Domain Adaptation and Unsupervised Domain Adaptation In this work, we mainly focus on the more widely adopted branch: Unsupervised Do-
main Adaptation.
Source domain Target domain
MJSynih[1]
ees
FIGURE 1.3: Domain gap in Scene Text Recognition.
The approaches used for Unsupervised Domain Adaptation in Scene Text nition can be broadly classified into two categories: Self-trained Domain Adap-
Recog-tation and Adversarial Domain AdapRecog-tation In Self-trained Domain AdapRecog-tation, the model is trained on a labeled dataset in the source domain and an unlabeled dataset in the target domain, and the aim is to minimize the distributional dis-
crepancy between the two domains [4] [39][38][80][79][70] Adversarial Domain
Adaptation approaches, on the other hand, are focused on learning a invariant feature representation by training a generative adversarial network to
domain-`N discriminate between the source and target domains
To bridge the gap between the source and target datasets, we focus our evaluation
Trang 18Chapter 1 Introduction
on Self-trained Domain Adaptation for the Scene Text Recognition task This
ap-proach commonly uses Pseudo-labeling to adapt from a labeled source dataset,
typically a synthetic one, to an unlabeled target dataset, usually real-world data
However, due to the large domain gap between the two datasets, Pseudo-labeling
can be ineffective and lead to the model being trained on the wrong label
(33) [34] As such, to improve the performance of the model, the domain
adapta-tion process should be approached in a more gradual manner To this end, we
explore the idea of gradually training the model on intermediate domains,
in-stead of jumping directly from source to target domain This strategy introduces
a series of intermediate domains that progressively bridge the gap between the
source and target domains, allowing for a more effective domain adaptation By
doing this, the model is able to learn from the labeled source data more tively, as well as from the unlabeled target data, thus being able to bridge the gap
effec-between the two datasets and perform better on the Scene Text Recognition task
1.4 Gradual Domain Adaptation
Gradual Domain Adaptation (GDA) is a relatively novel technique for dealingwith large domain gaps between source and target datasets This technique is
based on the concept of gradually adapting the model by introducing multiple
intermediate domains during the training process [57][9][34][60] This gradual
approach allows the model to benefit from both the labeled source data and the
unlabeled target data, which can reduce the domain gap and improve the
perfor-mance of the model In particular, instead of directly training the model on the
source and target domains, Gradual Domain Adaptation uses multiple
interme-diate domains as bridges between the two datasets
Trang 19Chapter 1 Introduction
The Gradual Domain Adaptation technique has mainly been studied in the
con-text of image classification, and has been found to be effective in bridging the
gap between source and target domains and improving the performance of the
model [76] In this thesis, we investigate the application of
Gradual Domain Adaptation to Scene Text Recognition, and evaluate its
effective-ness in reducing the domain gap and improving the performance of the model
We assess how adding additional intermediate domains, changing domains, and
finding the appropriate domain routing affects the result of Domain Adaptation
in Scene Text Recognition, and discuss the implications of the results We also
explore the various benefits of using Gradual Domain Adaptation in this setting,
such as the ability to better adapt to the target domain, and the potential for more
accurate predictions due to the combination of labeled source data and unlabeledtarget data By exploring these topics, we aim to gain a better understanding of
the potential of Gradual Domain Adaptation in the field of Scene Text
AUTO GIFTS "TA
aN] NN
FIGURE 1.4: Gradual Domain Adaptation for Scene Text Recognition.
10
Trang 20Chapter 1 Introduction
1.5 Scope
To further explore the effectiveness of the Gradual Domain Adaptation approach,
this thesis will focus on the following objectives:
1 Evaluating how adding intermediate domains affects the result of
Pseudo-labeling Domain Adaptation in Scene Text Recognition We will investigate
how adding new domains can help provide additional training data to
im-prove the performance of the model and how it can also help reduce the
domain gap between source and target domains
2 Evaluating how changing domains affects the result of Pseudo-labeling
Do-main Adaptation in Scene Text Recognition We will look into how different
domain combinations and sequences can potentially lead to improved
per-formance of the model We will also assess how different domains can
pro-vide different types of information to the model, and how this can improve
the overall result
3 Evaluating how to find appropriate domain routing for Pseudo-labeling
Domain Adaptation in Scene Text Recognition We will analyze how ent domain routing strategies can be used to effectively transfer knowledge
differ-from source to target domains, and how this can be used in practice
1.6 Structure of This Thesis
This thesis is divided into five key sections, each of which is designed to provide
the reader with a comprehensive overview of the topic:
¢ Chapter 1 - Introduction: this section provides an overview of Scene Text
Recognition (STR), Domain Adaptation, and Gradual Domain Adaptation
11
Trang 21Chapter 1 Introduction
This section will seek to provide a comprehensive overview of the current
state of the field, as well as a review of the literature on this topic
¢ Chapter 2 - Related works: this section discusses the existing works in the
field of Scene Text Recognition and Domain Adaptation This section will
include a review of the various works in the field of Scene Text Recognition
and DA
¢ Chapter 3 - Gradual Domain Adaptation in Scene Text Recognition via
Pseudo-labeling: this section explains the proposed Gradual Domain
Adap-tation approach and outlines the objectives of this thesis This section will
analyze the effectiveness of the proposed approach and provide a detailed
description of the framework
¢ Chapter 4 - Experiments: this section presents the experimental results of
the proposed approach and evaluates how adding intermediate domains,
changing domains, and finding the appropriate domain routing affects the
result of Domain Adaptation in Scene Text Recognition This section will
include a thorough analysis of the various experiments conducted, as well
as a discussion of the results and implications
¢ Chapter 5 - Discussions: this section summarizes the findings of this thesis
and outlines possible future work in this area This section will provide an
overview of the implications of the findings, as well as potential
applica-tions and avenues for further exploration
12
Trang 22Chapter 2
Related Works
In this section, we review the literature of Scene Text Recognition methods We
then discuss the recent trials of applying Domain Adaptation and Gradual Domain
Adaptation techniques to Scene Text Recognition Finally, we discuss the relateddatasets used for training, adapting, and evaluation
2.1 Scene Text Recognition
Scene Text Recognition (STR) is a powerful and widely used tool for recognizing
text on scene images Deep learning methods have quickly become the go-to
approach for image text reading, and have achieved impressive results Baek et
al (2019) [5] proposed a comprehensive STR model framework that combines
existing related studies into a single framework This framework is composed of
four stages, including Transformation, Feature Extraction, Sequence Modeling, and
Prediction
Input image
+ Trans FOOTBALLS Feat lhl - Seq _„ M
ủ ‹oo -FIGURE 2.1: Illustration of Scene Text Recognition framework
13
Trang 23Chapter 2 Related Works
In Transformation stage, images of scene text are converted into a suitable format for further processing Feature Extraction follows, where the transformed images
are processed to extract relevant features from the scene text such as font, color, size, and background Next, Sequence Modeling stage uses the extracted features
to capture the contextual information within a sequence of characters Finally, Prediction stage uses the sequence model to make predictions about the output character sequence With the combined strength of various techniques, this STR
model framework is able to achieve strong performance on a variety of STR tasks.
2.1.1 Transformation
The Transformation component of a Scene Text Recognition model is responsible
for preparing the input images for further processing - it is a crucial stage that enables the model to extract useful features from the image It involves convert- ing the input images into a format that is suitable for feature extraction, such as
transforming the image into a binary representation or converting it into a set of contours and shapes This process is particularly important since it allows the model to identify features from the image that would otherwise be unidentifi- able In essence, the Transformation component helps the model to interpret the
input image in a way that is suitable for feature extraction As a result, the model can more effectively process the image and extract meaningful features from it.
Additionally, text images in natural scenes come in diverse shapes, as shown by curved and tilted texts These text images, when fed unaltered, can pose a chal- lenge to the feature extraction stage, as it needs to learn an invariant representa-
tion with respect to the complex geometry of the input image To overcome this problem, several methods have been proposed For example, Shi et al (2016)
and Liu et al (2016) 42) introduced a spatial transformer network (STN) (28) that
14
Trang 24Chapter 2 Related Works
rectifies the entire text before recognition This method can be effective in dressing perspective distortion in scene text, although it is limited in its ability to handle more complex forms of distortion To address this issue, CharNet was proposed, which introduces a character-level spatial transformer to rectify indi-
ad-vidual characters, allowing it to be more effective in addressing more complex distortions that cannot be modeled by a single global transformation easily.
Input Image Rectified Image
FIGURE 2.2: Transformation stage from
By performing these pre-processing steps with its flexibility to diverse aspect tios of text lines, the model can extract useful features more effectively and make
ra-more accurate predictions.
15
Trang 25Chapter 2 Related Works
2.1.2 Feature Extraction
The Feature Extraction component of a Scene Text Recognition model is an
indis-pensable part of the model that allows it to effectively recognize text from images.
It is responsible for extracting the important features from a given image The tracted features are then used as input to the Sequence Modeling or Prediction stage and enable the model to make more accurate predictions about the text
ex-contained in the image This is done by extracting the edges, shapes, and other structural elements of the image that are necessary for accurate recognition.
We study three architectures of VGG [53], RCNN [36], and ResNet [15], which
have been previously used as feature extractors for STR.
Visual Geometry Group (VGG) [53] architecture is a deep convolutional neural network (CNN) that has been widely used for image recognition tasks, partic- ularly in scene text recognition It consists of multiple convolutional layers fol- lowed by a few fully connected layers, enabling it to extract complex features
from an image and accurately identify objects within it This makes VGG an ideal choice for Scene Text Recognition tasks, as it can identify text components in a wide range of conditions, such as varying fonts, sizes, and backgrounds Addi-
tionally, its use of a very deep convolutional neural network - with up to 19 layers
- gives it the ability to accurately classify images, making it a powerful tool for STR.
Recurrent Convolutional Neural Networks (RCNNs) are a type of CNN that can be used to adjust its receptive fields based on the character shapes in an im-
age This is done by recursively applying a convolutional layer and a non-linear function to the input image, allowing the size and shape of the receptive field
to be adjusted This makes RCNNs ideal for scene text recognition, as they can better recognize specific shapes and patterns RCNNs are composed of multiple
16
Trang 26Chapter 2 Related Works
convolutional layers, followed by fully connected layers, which allow them to tract complex features from an image and accurately identify objects within it Additionally, they are capable of learning and adapting to changes in the input, resulting in improved performance over time.
ex-ResNet (Residual Network) is a type of deep convolutional neural network (CNN) that has been used in a variety of tasks, including scene text recognition.
It consists of multiple convolutional layers connected by a series of residual nections, which allows the network to learn more efficiently by reusing features from earlier layers Additionally, the use of residual connections helps to reduce the vanishing gradient problem and enables the network to train more effectively
con-on deeper networks As a result, ResNet has been demcon-onstrated to achieve of-the-art performance in image recognition tasks, making it a suitable choice for scene text recognition.
state-2.13 Sequence Modeling
The Sequence Modeling component of a Scene Text Recognition model is a cal component for providing the context necessary for accurately recognizing text This component takes the features extracted by the Feature Extraction component
criti-and applies context-aware sequence modeling techniques such as Recurrent ral Networks (RNNs) and Long Short-Term Memory networks (LSTMs) It uses these algorithms to predict the probability of a sequence of characters based on their context and the co-occurrence of characters This helps the model to better recognize words and long sequences of characters, even when the characters are
Neu-in different fonts or have different orientations This is an essential element for
Scene Text Recognition models, as it allows the model to better understand the image and make more accurate predictions.
17
Trang 27Chapter 2 Related Works
On the other hand, using the Sequence Modeling component may harm Scene Text Recognition models by increasing computational complexity and memory consumption As a result, many models choose not to use Sequence Modeling, even though it lowers accuracy, in order to obtain a simpler model Baek’s frame-
work [5] allows for the selection or de-selection of Sequence Modeling Models
using Sequence Modeling are referred to as “Janguage-based", whereas those that
do not are referred to as “language-free”.
2.1.3.1 Language-free methods
Language-free methods typically employ convolutional features without regard for character dependency They are divided into two categories: CTC-based
and segmentation-based 57] methods.
CTC-based methods [26|(B1](23]|B0] first extract visual features through CNNs
and then train the CNN and RNN end-to-end using CTC loss to find the most
After attention-based methods became popular, language-based methods mainly
employed the attention mechanism, which implicitly models language using more
18
Trang 28Chapter 2 Related Works
powerful RNNs [5 | or Transformers [4 The encoder-decoder tecture makes use of linguistic information and character dependency To boost performance, some methods focus on learning a new feature representation For example, some previous works use Bidirectional Long Short-Term Memory (BiL-
archi-STM) to make a better sequence after the feature extraction stage E0] 52](11] 1]
proposed a contrastive learning algorithm that first divides each feature map into
a sequence of individual elements and performs the contrastive loss [2
the learned representation features are fed to the recognizer Yan et al posed a primitive representation learning method that aims to exploit intrinsic representations of scene text images.
pro-Others may focus on integrating a rectification module [51)[44][71[67] to
recon-struct normal images for irregular images Then, the reconrecon-structed images are fed to the encoder-decoder module for further recognition However, scene text
usually has a variety of shapes and sizes, making it difficult for the rectification module to transform all the irregular text instances into regular ones.
Since current attention-based methods suffer from the attention drift problem
, directly decoding upon the convolutional features or linguistic features
will degrade recognition performance Inspired by Wang et al (2020) (641, which
generates character center masks to help focus attention on the right position.
2.1.4 Prediction
The Prediction component, the final step in the Scene Text Recognition model, is
responsible for making predictions about the text contained in an image This component uses the features extracted in the Feature Extraction stage and the lin- guistic information of the Sequence Modeling stage to make a prediction about
the sequence of characters in the text This is done by taking the probabilities
19
Trang 29Chapter 2 Related Works
generated by previous component and applying them to the characters in the
im-age The resulting probabilities are then used to rank the characters in the image,
and the model can then output the most likely sequence of characters
Addi-tionally, the Prediction component can also be used for post-processing, such as
correcting errors or applying post-processing techniques such as spell-correction
or text-normalization By using the Prediction component, the model can make
more accurate predictions and provide useful post-processing features, resulting
in improved performance on the Scene Text Recognition task
We have two options for prediction: Connectionist temporal classification (CTC)
[18] and attention-based sequence prediction (Attn) [52) [11].
CTC is a powerful tool for predicting a non-fixed number of sequences, evenwhen a fixed number of features are given This is done by predicting a char-acter at each column (hi H) and by modifying the full character sequence into
a non-fixed stream of characters by deleting repeated characters and blanks [18],
[50].
Attn, on the other hand, automatically captures the information flow within the
input sequence to predict the output sequence [6] It allows an STR model to learn
a character-level language model representing output class dependencies
2.2 Domain Adaptation
Domain shift (or domain gap) is an important issue in machine learning, as it
deals with the problem of a model’s performance degrading when it is applied
to a new domain Domain shift occurs when the model is trained on data from
one domain but then is applied to data from another domain that it is not familiar
with This can lead to a decrease in accuracy and robustness, as the model is not
20
Trang 30Chapter 2 Related Works
able to accurately recognize text from the new domain [61] [19][33]][47] [24].
In order to address this issue, Domain Adaptation has been proposed to bridge
the gap between training and testing data by adapting trained-models to the test
distribution with the help of data from the target domain [75)[43][43][14] This
is done by transferring the knowledge of the source domain, ie the training
distribution, to the target domain, allowing the model to adapt to the new main while maintaining its original performance The goal is to minimize thedomain discrepancy so that the model trained on the source domain can perform
do-well on the target domain Domain Adaptation techniques can be used to adapt
a model to different domains, such as different languages, different geographic
areas, or different types of data Especially with Scene Text Recognition, most
models are trained on synthetic data and evaluated on real-world data Syntheticdata is generated to simulate real world data, but it still does not cover enough
the complexity of real-world data such as fonts, backgrounds, styles, leading
to a large gap between two domains As a result, we employ Domain Adaptation
to address the domain shift issue in Scene Text Recognition By using Domain
Adaptation, models can remain generalizable and can be more easily deployed in
various contexts.
Domain Adaptation is classified into two main types: Supervised Domain
Adapta-tion (SDA) and Unsupervised Domain AdaptaAdapta-tion (UDA) SDA is based on labeled
target domains but works well with a limited number of labels UDA, on the other
hand, employs unlabeled target domain data but necessitates a large number of
target samples [13] We use synthetic data for training and labeled real-world
data for evaluation in this thesis However, due to the scarcity of labeled
real-world data and the abundance of unlabeled real-real-world data, we use unlabeled
real-world data as the target domain and attempt to align the distribution of
syn-thetic and real-world data using Unsupervised Domain Adaptation
21
Trang 31Chapter 2 Related Works
Unsupervised Domain Adaptation
In recent years, various strategies for UDA have been proposed One of the most
widely used methods is known as invariant representation learning [56] [76] In this
method, adversarial training is typically used to learn feature representations that
are constant between the source and target domains [14] Recently, self-training
(also known as pseudo-labels) is adapted for UDA [39][38][80][79][70] The quick
concept of pseudo-labels is source-trained classifiers produce pseudo-labels of
unlabeled target domain data and use them to further improve trained classifiers
Unsupervised Domain Adaptation in Scene Text Recognition
Scene Text Recognition has utilized both self-training and adversarial training
for Unsupervised Domain Adaptation Azadi et al (2018) presented the
geometry-aware domain adaptation network (GA-DAN), which uses the converted
text image to train the target recognition model after converting a synthetic textimage to a real scene text image Baek et al (2021) [4 employed pseudo-label asself-training to improve STR performance while only utilizing real data to trainthe STR model Zhang et al (2021) develop a Sequence-to-Sequence Domain
Adaptation Network (SSDAN) for robust text image recognition [73] Zheng et al.
(2022) proposed a domain adaptation framework for scene text recognition
by combining both pseudo-labeling and adversarial learning approach
2.3 Gradual Domain Adaptation
The most prominent issue of Unsupervised Domain Adaptation is the large
do-main shift between the source and target dodo-main, which occurs when the sourceand target domains differ significantly Previous theoretical study showed that
22
Trang 32Chapter 2 Related Works
as the gap between two domains widens, the generalization error of UDA also
increases [76][7] It may be challenging to adapt to the target domain in a one-off
manner because of a possible large shift between these two domains, and it has
been noticed that existing UDA algorithms do not perform well under large shift
6364].
One obvious solution to the large domain shift problem is to split a large shift intomultiple smaller shifts Inspired by curriculum learning or “divide-and-conquer”
idea, Gradual Domain Adaptation [57](9](34](60] was developed to address the
issue of significant gaps between adapted domains Curriculum learning
sug-gests beginning the training process with easier samples and progressively
mak-ing them more difficult To the current source and target domains of UDA, GDA
adds additional unlabeled data as the intermediate domains that gradually shift
from the source to the target, which helps the model to adapt to intermediate
dis-tributions before being exposed to the target domain According to 24], the
gen-eralization gap (from source to target distribution) will be significantly smaller if
learning algorithms are exposed to incremental changes in the data distribution
throughout a self-training regime
Gradual self-training, a general machine learning technique recently proposed
by Kumar et al (2020) [54], outperforms vanilla self-training on a number of
synthetic and real-world datasets In parallel, an adversarial adaptation
tech-nique for GDA is also put forth by Wang et al (2020) [60] Furthermore,
Ab-nar et al (2021) [2] provide a form of gradual self-training which does not
re-quire intermediate domain data, since it might generate pseudo-data for
inter-mediate domains Zhou et al (2022) proposed an algorithm based on the
teacher-student paradigm with an active query method A slightly different
set-ting, where labeled data is also available during the intermediate domains, was
explored by Dong et al (2022) [12].
23
Trang 33Chapter 2 Related Works
However, Gradual Domain Adaptation has not been applied to Scene Text
Recog-nition task yet In this thesis, we investigate the efficacy of Gradual Domain
Adaptation approach for Scene Text Recognition task We focus on applying
grad-ual for self-training approach because when there is a large distribution shift
be-tween the source and target, the performance of self-training suffers markedly
[61] Additionally, Image Classification task of Kumar empirically validates
the effectiveness of gradual self-training on both synthetic and real datasets Due
to these reasons, Baek et al (2021) [4] is the baseline we use when implementing
Gradual Domain Adaptation for self-training
2.4 Datasets
In this section, we introduce the datasets used for the Scene Text Recognition task
in three stages: synthetic datasets for training, unlabeled real-world datasets for
adaptation, and labeled real-world datasets (i.e., benchmark datasets) for
evalua-tion.
2.4.1 Synthetic Datasets
MJSynth (MJ) is a synthetic dataset created for Scene Text Recognition (STR)
tasks It contains 9 million synthetic images of scene text, each with a single line
of text The dataset is generated by an algorithm which renders realistic text ages with varying fonts, sizes, orientations, and backgrounds MJSynth has been
im-used extensively for training and evaluating Scene Text Recognition models, as it
provides a large number of labeled data which is useful for model training and
24
Trang 34Chapter 2 Related Works
evaluation Additionally, the synthetic data allows for the use of data
augmen-tation techniques, which can improve the performance of Scene Text Recognition
models
SynthText (ST) is a synthetic dataset that is generated by combining text
with a large collection of background images It is composed of 800,000 synthetic
images and 7M word boxes of text with a variety of fonts, colors, sizes,
back-grounds, and shapes The images are generated using a text-to-image rendering
algorithm, and the text is written in Latin-based languages Additionally, the
im-ages are augmented using various transformations such as rotation, scaling, and
perspective distortion This allows the model to learn from a wide variety of data,
and provides a more diverse set of training data ST is a valuable resource for
cre-ating models that can accurately detect text in different languages and domains.For STR, we crop the texts in scene images and use them for training
2.4.2 Unlabeled Real-world Datasets
Unlabeled real-world datasets are essential for training Scene Text Recognition
(STR) models because they provide a large amount of data that is not limited
to a single language or domain This enables the model to learn from a diverse
set of data and contributes to its generalization ability Furthermore, because
these datasets are unlabeled, no manual annotation is required, which can be timeconsuming and expensive As a result, they are a low-cost and efficient method
of training STR models Furthermore, using real-world data enables the model tolearn from data with more realistic variations, such as different fonts, sizes, andorientations, which can significantly improve its performance on the STR task
For that reason, we utilize three unlabeled datasets which contain scene images
but do not have any word region annotation: Book32, TextVQA, and ST-VOQA
25
Trang 35Chapter 2 Related Works
Some samples of three unlabeled real-world datasets are illustrated in Figure[2.3|
Book32 is composed of 208K book cover images in 32 categories made of
book cover images, title text, author text, and category membership, collected
from the Amazon Books dataset It contains a large number of handwritten or
curved texts This dataset is very challenging and can be used for a variety of
tasks include STR After being cropped, Book32 contains 3.9M word boxes forSTR model
TextVOA was created for text-based visual question answering and consists
of 28,408 OpenImage V3 [23] images from categories that tend to contain text such
as “billboard” and “traffic sign” TextVQA contains 551K word boxes for the STRmodel after being cropped
ST-VOA [8] was created for scene text-based visual question answering, but
in-cludes datasets such as IC13, IC15, and COCO, so we chose to exclude them from
our consolidation TextVQA has 79K word boxes for the STR model after being
cropped
2.4.3 Benchmark Datasets
Street View Text (SVT) dataset is used to evaluate Scene Text Recognition
models It consists of 257 images for training and 647 images for evaluating
Google Street View text, with each image containing a single line of text The
text in the images is in Latin-based languages and comes in a variety of fonts,orientations, and backgrounds
IIT5K-Words (IIIT) dataset contains 5,000 natural scene images of printedEnglish text It is one of the most popular datasets for scene text recognition
and was created by the Indian Institute of Information Technology (IIIT) in
Hy-derabad The images in the dataset were retrieved from Google image searches
26
Trang 36Chapter 2 Related Works
(B) Text VQA (c) STVQA
FIGURE 2.3: Some samples of three unlabeled real-world datasets.
using terms like "billboards" and "movie posters." It has 2,000 training images and
3,000 evaluation images
ICDAR2013 (IC13) is created for the ICDAR 2013 Robust Reading
competi-tion It contains 848 images for training and 1,015 images for evaluacompeti-tion
ICDAR2015 (C15) is collected by people who wear Google Glass, and thus,many of them contain perspective texts and some of them are blurry It contains4,468 images for training and 2,077 images for evaluation
SVT Perspective (SP) is collected from Google Street View, similar to SVT
Unlike SVT, SP contains many perspective texts It contains 645 images for
evalu-ation.
CUTE80 (CT) is collected for curved text The images are captured by a
digital camera or collected from the Internet It contains 288 cropped images for
evaluation
27