Khóa luận tốt nghiệp Khoa học máy tính: Thích ứng miền tăng tiến cho bài toán nhận diện văn bản ngoại cảnh

Does intermediate domain routings affect overall performance of Scene Text Recognition model using Gradual Domain Adaptation2. By doing this, the model is able to learn from the labeled

Trang 1

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

FACULTY OF COMPUTER SCIENCE

BACHELOR THESIS

Gradual Domain Adaptation

in Scene Text RecognitionBachelor of Computer Science (Honors degree)

HO CHUNG DUC KHANH - 19520624NGUYEN THI MINH PHUONG - 19522065

Supervised by

PhD THANH DUC NGO

HO CHI MINH CITY, 2023

Trang 2

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

FACULTY OF COMPUTER SCIENCE

BACHELOR THESIS

Gradual Domain Adaptation

in Scene Text RecognitionBachelor of Computer Science (Honors degree)

HO CHUNG DUC KHANH - 19520624NGUYEN THI MINH PHUONG - 19522065

Supervised by

PhD THANH DUC NGO

HO CHI MINH CITY, 2023

Trang 3

The Thesis Defense Committee has been carefully established in accordance with the Decision 155/QD-DHCNTT, issued on 01/03/2023 by the President of the University of Information Technology This committee is comprised of eminent individuals who possess a great deal of expertise and knowledge in the specific

area of study that is relevant to the thesis defense To ensure that all aspects

of the thesis defense are properly addressed, the following personnel have been carefully chosen to comprise the committee:

¢ Chairman: PhD Le Dinh Duy.

¢ Secretary: MS Nguyen Thanh Son.

¢ Member: PhD Le Minh Hung.

Trang 4

suggestions, and challenging us to think outside of the box we are deeply ful for their commitment and dedication to our project.

grate-We are also thankful to all the members of our research group for their help and encouragement We are especially grateful to Tan and Hung, whose technical guidance and knowledge of domain adaptation techniques have been instrumen-

tal in the successful completion of this thesis.

We are thankful to the reviewers for their valuable feedback and suggestions,

which have been extremely helpful in improving the quality of this thesis.

We would also like to thank our family and friends for their support and

encour-agement throughout the entire process.

Finally, we would like to express our gratitude to the computing resources vided at MMLAB UIT and the Faculty of Computer Science, which enabled us to develop the algorithms and experiments presented in this thesis.

pro-All in all, this project would not have been possible without the assistance and

help of the people mentioned above Our sincere appreciation goes out to each of

them.

Trang 5

2

IAbstractl

1_ Introductioni

1.2 Scene Text Recogniion| co

1.3 Domain Adaptaton|

1.4 Gradual Domain

Adaptation| -1.5 Scope Sh ø 4ý eee [6 Structure of This

Thesisl -Related Works! 2.1 Scene TextRecognition|

See (21.2 FeatureExtractonl

2.1.3 Sequence

Modelingl -2.1.3.1 Language-free methods|

2.1.3.2 Language-based methodsl

1.4 redi

11

13

Trang 6

3_ Gradual Domain Adaptation in Scene Text Recognition via Pseudo-labeling] 28

3.1 Domain Adaptation in Scene Text Recognition via Pseudo-labeling| 28

3.1.1 Baselneoverviewl - 28

3.1.2 The Scene Text Recognition Framework| 31

3.2_ Gradual Domain Adaptation and Domain Routing] 38

4 Experiments| 41

4.1 Unsupervised Domain Adaptation on synthetic-trained model] 43

4.2 Gradual Domain Adaptation on synthetic-trained model] 44

[£3 Domain routing approach for applying GDA on STR task] 46

[£4 Comparing domain routing approach with state-of-the-art models] 52

54

See 54

¬ 55

56

Trang 7

List of Figures

1.1 Scene Text Recognition task: output the text content in the image.

The sample image is taken from ICDAR 2015 dataset| 5

1.2 Common challenges in Scene Text Recognition} 7

[3 Domain gap in Scene Text Recognition] - 8

1.4 Gradual Domain Adaptation for Scene Text Recognition] 10

[2.1 Hlustration of Scene Text Recognition framework] 13

2.2 Transformation stage from 2] THỦ: - - - - - - 15

[2.3 Some samples of three unlabeled real-world đatasets| 27

3.1 Overview of Pseudo-labeling approach| 29

{3.2 Two model combinations according to the STR Framework (Image [rom Do eee 32 3.3 Structure of Spatial Transform Networks Image from [52]| 33

3.4 Structure of BiLSTM (Image from 58] ¬ 35 3.5 Ilustration of Attention mechanisml 37

3.6 The Portraits dataset of students from 1905 to 2005 [17]| 39

4.1 Illustration of Experiment Resultsl| 43

Trang 8

4.2 Illustration of Experiment ResultslH|

43 Samples of three unlabeled real-world datasets|

[4.4 Illustration of Experiment ResultsM] .

Trang 9

List of Tables

1 Experimental resultsl| - 43

2 Experimental results H| - 45

4.3 Compare three unlabeled real-world datasets| 46

4.4 Experiment results on different domain routings} 51

5 Comparison between our domain routing approach and state-of-the-art methodsl 53

Trang 10

Textual information is essential to virtually all aspects of our daily lives, and

au-tomating the process of bringing this information onto the digital world is a key goal of research in the field of computer vision Scene Text Recognition (STR)

is a particularly important task in this domain, as it has numerous applications

in areas such as automated number plate recognition for vehicles, access control

systems, and much more.

However, the challenge of Scene Text Recognition is that it requires large amounts

of annotated data, which is very expensive and time-consuming to collect As such, a common approach is to leverage generated synthetic data during training, and test the result on real-world data Unfortunately, this approach can be ineffective, as there is often a large discrepancy between the synthetic data used

for training and the data in the real world, referred to as a “domain gap”.

Recent approaches to Scene Text Recognition have attempted to address the main gap by adopting Domain Adaptation techniques, which try to minimize the discrepancy between the two domains in a semi-supervised manner However, when the gap between the two domains is too large, Domain Adaptation may not

do-be effective.

This thesis proposes and evaluates a Gradual Domain Adaptation approach, which trains the model on multiple intermediate domains in order to minimize the gap before training on the final domain This technique is evaluated in terms of its ef-

fectiveness in reducing the domain gap, as well as the impact of adding diate domains, changing domains, and finding the appropriate domain routing.

interme-As a result, the experiments are able to confirm the following:

1 Gradual Domain Adaptation improves Scene Text Recognition baseline formance by up to 3.12%, compared to Domain Adaptation approach.

per-1

Trang 11

List of Tables

2 Switching domains order improves performance by 1.56%.

3 Performance is consistently improved when adapting with increasingly

sta-ble domain routings We observe a performance boost of up to 5.81% in our

experiments.

Moreover, our method was able to outperform state-of-the-art approaches by

leveraging the domain routing approach This demonstrates the potential for this approach in scene text recognition.

Keywords:

Gradual Domain Adaptation, Domain Routing, Domain Adaptation, Scene Text Recognition, Unsupervised Domain Adaptation, Pseudo-labeling.

Trang 12

au-is a particularly important task in thau-is domain, as it has numerous applications

in areas such as automated number plate recognition for vehicles, access control systems, and much more.

However, the challenge of Scene Text Recognition is that it requires large amounts

of annotated data, which is very expensive and time-consuming to collect As

such, a common approach is to leverage generated synthetic data during ing, and test the result on real-world data Unfortunately, this approach can be ineffective, as there is often a large discrepancy between the synthetic data used for training and the testing data in the real world, referred to as a “domain shift”.

train-Recent approaches to Scene Text Recognition have attempted to address the main shift by adopting Domain Adaptation techniques, which try to minimize the discrepancy between the two domains in a semi-supervised manner How-

do-ever, when the gap between the two domains is too large, Domain Adaptation

3

Trang 13

Chapter 1 Introduction

may not be effective.

This thesis proposes and evaluates a Gradual Domain Adaptation approach, which

trains the model on multiple intermediate domains in order to minimize the gap before training on the final domain This technique is evaluated in terms of its effectiveness in reducing the domain gap, as well as the impact of adding intermediate domains, changing domains, and finding the appropriate domain routing The main research questions of this thesis are:

1 What is the performance of Gradual Domain Adaptation in Scene Text

Recog-nition?

2 Does intermediate domain routings affect overall performance of Scene Text

Recognition model using Gradual Domain Adaptation?

3 How to choose a good domain routing when applying Gradual Domain

Adaptation for Scene Text Recognition?

1.2 Scene Text Recognition

Scene Text Recognition (STR) is an important task in the fields of computer vision

and natural language processing, and has been heavily studied due to its many useful applications Scene Text Recognition is capable of recognizing text components in a wide range of settings, from street signs and license plates, to news- paper headlines, advertisements, and digital images and videos This powerful

technology enables tasks that would otherwise be impossible, such as quickly and accurately searching, translating, and verifying documents In addition to aiding law enforcement, Scene Text Recognition can also be used to help compa-

nies read customer reviews and detect text in security documents for automated

Trang 14

verification It is also a powerful tool for a variety of industries, from automotive

to healthcare, and even retail, as it can be used to build more efficient and secure systems Scene Text Recognition is a versatile technology that has numerous applications in multiple fields and can be used to greatly improve the speed and

accuracy of various tasks, making it an invaluable tool.

The input of this task is an image containing a text instance and the output of

the task is the corresponding text sequence As an illustration, figure [I-1]shows

an example of the input and output of the Scene Text Recognition task ically, the image contains the text "MOVING" which is the input of the Scene Text Recognition task The corresponding output of the task is the text sequence

Specif-"MOVING" which can be seen in the output box By recognizing the text nents present in a scene, the Scene Text Recognition task can be used to perform a range of tasks from automating document processing to providing assistance for

compo-the visually impaired.

INPUT OUTPUT

FIGURE 1.1: Scene Text Recognition task: output the text content in the image The sample image is taken from ICDAR 2015 dataset.

Despite the progress made on Scene Text Recognition, the task still faces several

challenges A key challenge is the diversity of the conditions of the input images The text components can be in various fonts, sizes, colors, orientations, and even shapes This means that a single recognition algorithm may not work optimally

Trang 15

across different types of images Additionally, the background can contain noise,

complex patterns and other distracting elements, making the task more difficult.

For instance, figure [1.2] demonstrates some of the common challenges of Scene

Text Recognition, such as irregular fonts (Fig noisy background (Fig.

irregular text orientation (Fig 2.3cp, and uneven lighting/obstructed texts (Fig.

system must take into account the context of the image in order to accurately

These conditions make it difficult to process the images correctly, as the

recognize the text components As a result, a robust and accurate Scene Text Recognition algorithm must be able to handle possible variations in the input

images.

An additional challenge lies in the limited amount of annotated data available for training To address this issue, researchers often resort to the use of synthetic data to train their models, since it is possible to generate large amounts of data in

this manner [29] [20] While this approach can be successful in some cases, the

models trained on synthetic data tend to have poor performance when applied to real-world data due to the domain gap To reduce this domain gap and increase the accuracy of the models, recent works have adopted domain

adaptation techniques to bridge the difference between synthetic and real-world data

1.3 Domain Adaptation

Domain Adaptation (DA) is a powerful machine learning technique for bridging

the gap between two different datasets, particularly in cases where the training

and testing data have different distributions (Fig [L3) It has become increasingly

popular in the field of Scene Text Recognition (STR), as it can help to minimize the

discrepancy between synthetic and real-world data, which is commonly referred

Trang 16

ons

(C) Curved texts (D) Obstructed texts

FIGURE 1.2: Common challenges in Scene Text Recognition

Trang 17

to as the domain gap There are two branches of Domain Adaptation, namely Supervised Domain Adaptation and Unsupervised Domain Adaptation In this work, we mainly focus on the more widely adopted branch: Unsupervised Do-

main Adaptation.

Source domain Target domain

MJSynih[1]

ees

FIGURE 1.3: Domain gap in Scene Text Recognition.

The approaches used for Unsupervised Domain Adaptation in Scene Text nition can be broadly classified into two categories: Self-trained Domain Adap-

Recog-tation and Adversarial Domain AdapRecog-tation In Self-trained Domain AdapRecog-tation, the model is trained on a labeled dataset in the source domain and an unlabeled dataset in the target domain, and the aim is to minimize the distributional dis-

crepancy between the two domains [4] [39][38][80][79][70] Adversarial Domain

Adaptation approaches, on the other hand, are focused on learning a invariant feature representation by training a generative adversarial network to

domain-`N discriminate between the source and target domains

To bridge the gap between the source and target datasets, we focus our evaluation

Trang 18

on Self-trained Domain Adaptation for the Scene Text Recognition task This

ap-proach commonly uses Pseudo-labeling to adapt from a labeled source dataset,

typically a synthetic one, to an unlabeled target dataset, usually real-world data

However, due to the large domain gap between the two datasets, Pseudo-labeling

can be ineffective and lead to the model being trained on the wrong label

(33) [34] As such, to improve the performance of the model, the domain

adapta-tion process should be approached in a more gradual manner To this end, we

explore the idea of gradually training the model on intermediate domains,

in-stead of jumping directly from source to target domain This strategy introduces

a series of intermediate domains that progressively bridge the gap between the

source and target domains, allowing for a more effective domain adaptation By

doing this, the model is able to learn from the labeled source data more tively, as well as from the unlabeled target data, thus being able to bridge the gap

effec-between the two datasets and perform better on the Scene Text Recognition task

1.4 Gradual Domain Adaptation

Gradual Domain Adaptation (GDA) is a relatively novel technique for dealingwith large domain gaps between source and target datasets This technique is

based on the concept of gradually adapting the model by introducing multiple

intermediate domains during the training process [57][9][34][60] This gradual

approach allows the model to benefit from both the labeled source data and the

unlabeled target data, which can reduce the domain gap and improve the

perfor-mance of the model In particular, instead of directly training the model on the

source and target domains, Gradual Domain Adaptation uses multiple

interme-diate domains as bridges between the two datasets

Trang 19

The Gradual Domain Adaptation technique has mainly been studied in the

con-text of image classification, and has been found to be effective in bridging the

gap between source and target domains and improving the performance of the

model [76] In this thesis, we investigate the application of

Gradual Domain Adaptation to Scene Text Recognition, and evaluate its

effective-ness in reducing the domain gap and improving the performance of the model

We assess how adding additional intermediate domains, changing domains, and

finding the appropriate domain routing affects the result of Domain Adaptation

in Scene Text Recognition, and discuss the implications of the results We also

explore the various benefits of using Gradual Domain Adaptation in this setting,

such as the ability to better adapt to the target domain, and the potential for more

accurate predictions due to the combination of labeled source data and unlabeledtarget data By exploring these topics, we aim to gain a better understanding of

the potential of Gradual Domain Adaptation in the field of Scene Text

AUTO GIFTS "TA

aN] NN

FIGURE 1.4: Gradual Domain Adaptation for Scene Text Recognition.

10

Trang 20

1.5 Scope

To further explore the effectiveness of the Gradual Domain Adaptation approach,

this thesis will focus on the following objectives:

1 Evaluating how adding intermediate domains affects the result of

Pseudo-labeling Domain Adaptation in Scene Text Recognition We will investigate

how adding new domains can help provide additional training data to

im-prove the performance of the model and how it can also help reduce the

domain gap between source and target domains

2 Evaluating how changing domains affects the result of Pseudo-labeling

Do-main Adaptation in Scene Text Recognition We will look into how different

domain combinations and sequences can potentially lead to improved

per-formance of the model We will also assess how different domains can

pro-vide different types of information to the model, and how this can improve

the overall result

3 Evaluating how to find appropriate domain routing for Pseudo-labeling

Domain Adaptation in Scene Text Recognition We will analyze how ent domain routing strategies can be used to effectively transfer knowledge

differ-from source to target domains, and how this can be used in practice

1.6 Structure of This Thesis

This thesis is divided into five key sections, each of which is designed to provide

the reader with a comprehensive overview of the topic:

¢ Chapter 1 - Introduction: this section provides an overview of Scene Text

Recognition (STR), Domain Adaptation, and Gradual Domain Adaptation

11

Trang 21

This section will seek to provide a comprehensive overview of the current

state of the field, as well as a review of the literature on this topic

¢ Chapter 2 - Related works: this section discusses the existing works in the

field of Scene Text Recognition and Domain Adaptation This section will

include a review of the various works in the field of Scene Text Recognition

and DA

¢ Chapter 3 - Gradual Domain Adaptation in Scene Text Recognition via

Pseudo-labeling: this section explains the proposed Gradual Domain

Adap-tation approach and outlines the objectives of this thesis This section will

analyze the effectiveness of the proposed approach and provide a detailed

description of the framework

¢ Chapter 4 - Experiments: this section presents the experimental results of

the proposed approach and evaluates how adding intermediate domains,

changing domains, and finding the appropriate domain routing affects the

result of Domain Adaptation in Scene Text Recognition This section will

include a thorough analysis of the various experiments conducted, as well

as a discussion of the results and implications

¢ Chapter 5 - Discussions: this section summarizes the findings of this thesis

and outlines possible future work in this area This section will provide an

overview of the implications of the findings, as well as potential

applica-tions and avenues for further exploration

12

Trang 22

Chapter 2

Related Works

In this section, we review the literature of Scene Text Recognition methods We

then discuss the recent trials of applying Domain Adaptation and Gradual Domain

Adaptation techniques to Scene Text Recognition Finally, we discuss the relateddatasets used for training, adapting, and evaluation

2.1 Scene Text Recognition

Scene Text Recognition (STR) is a powerful and widely used tool for recognizing

text on scene images Deep learning methods have quickly become the go-to

approach for image text reading, and have achieved impressive results Baek et

al (2019) [5] proposed a comprehensive STR model framework that combines

existing related studies into a single framework This framework is composed of

four stages, including Transformation, Feature Extraction, Sequence Modeling, and

Prediction

Input image

+ Trans FOOTBALLS Feat lhl - Seq _„ M

ủ ‹oo -FIGURE 2.1: Illustration of Scene Text Recognition framework

13

Trang 23

Chapter 2 Related Works

In Transformation stage, images of scene text are converted into a suitable format for further processing Feature Extraction follows, where the transformed images

are processed to extract relevant features from the scene text such as font, color, size, and background Next, Sequence Modeling stage uses the extracted features

to capture the contextual information within a sequence of characters Finally, Prediction stage uses the sequence model to make predictions about the output character sequence With the combined strength of various techniques, this STR

model framework is able to achieve strong performance on a variety of STR tasks.

2.1.1 Transformation

The Transformation component of a Scene Text Recognition model is responsible

for preparing the input images for further processing - it is a crucial stage that enables the model to extract useful features from the image It involves converting the input images into a format that is suitable for feature extraction, such as

transforming the image into a binary representation or converting it into a set of contours and shapes This process is particularly important since it allows the model to identify features from the image that would otherwise be unidentifi- able In essence, the Transformation component helps the model to interpret the

input image in a way that is suitable for feature extraction As a result, the model can more effectively process the image and extract meaningful features from it.

Additionally, text images in natural scenes come in diverse shapes, as shown by curved and tilted texts These text images, when fed unaltered, can pose a challenge to the feature extraction stage, as it needs to learn an invariant representa-

tion with respect to the complex geometry of the input image To overcome this problem, several methods have been proposed For example, Shi et al (2016)

and Liu et al (2016) 42) introduced a spatial transformer network (STN) (28) that

14

Trang 24

rectifies the entire text before recognition This method can be effective in dressing perspective distortion in scene text, although it is limited in its ability to handle more complex forms of distortion To address this issue, CharNet was proposed, which introduces a character-level spatial transformer to rectify indi-

ad-vidual characters, allowing it to be more effective in addressing more complex distortions that cannot be modeled by a single global transformation easily.

Input Image Rectified Image

FIGURE 2.2: Transformation stage from

By performing these pre-processing steps with its flexibility to diverse aspect tios of text lines, the model can extract useful features more effectively and make

ra-more accurate predictions.

15

Trang 25

2.1.2 Feature Extraction

The Feature Extraction component of a Scene Text Recognition model is an

indis-pensable part of the model that allows it to effectively recognize text from images.

It is responsible for extracting the important features from a given image The tracted features are then used as input to the Sequence Modeling or Prediction stage and enable the model to make more accurate predictions about the text

ex-contained in the image This is done by extracting the edges, shapes, and other structural elements of the image that are necessary for accurate recognition.

We study three architectures of VGG [53], RCNN [36], and ResNet [15], which

have been previously used as feature extractors for STR.

Visual Geometry Group (VGG) [53] architecture is a deep convolutional neural network (CNN) that has been widely used for image recognition tasks, particularly in scene text recognition It consists of multiple convolutional layers followed by a few fully connected layers, enabling it to extract complex features

from an image and accurately identify objects within it This makes VGG an ideal choice for Scene Text Recognition tasks, as it can identify text components in a wide range of conditions, such as varying fonts, sizes, and backgrounds Addi-

tionally, its use of a very deep convolutional neural network - with up to 19 layers

- gives it the ability to accurately classify images, making it a powerful tool for STR.

Recurrent Convolutional Neural Networks (RCNNs) are a type of CNN that can be used to adjust its receptive fields based on the character shapes in an im-

age This is done by recursively applying a convolutional layer and a non-linear function to the input image, allowing the size and shape of the receptive field

to be adjusted This makes RCNNs ideal for scene text recognition, as they can better recognize specific shapes and patterns RCNNs are composed of multiple

16

Trang 26

convolutional layers, followed by fully connected layers, which allow them to tract complex features from an image and accurately identify objects within it Additionally, they are capable of learning and adapting to changes in the input, resulting in improved performance over time.

ex-ResNet (Residual Network) is a type of deep convolutional neural network (CNN) that has been used in a variety of tasks, including scene text recognition.

It consists of multiple convolutional layers connected by a series of residual nections, which allows the network to learn more efficiently by reusing features from earlier layers Additionally, the use of residual connections helps to reduce the vanishing gradient problem and enables the network to train more effectively

con-on deeper networks As a result, ResNet has been demcon-onstrated to achieve of-the-art performance in image recognition tasks, making it a suitable choice for scene text recognition.

state-2.13 Sequence Modeling

The Sequence Modeling component of a Scene Text Recognition model is a cal component for providing the context necessary for accurately recognizing text This component takes the features extracted by the Feature Extraction component

criti-and applies context-aware sequence modeling techniques such as Recurrent ral Networks (RNNs) and Long Short-Term Memory networks (LSTMs) It uses these algorithms to predict the probability of a sequence of characters based on their context and the co-occurrence of characters This helps the model to better recognize words and long sequences of characters, even when the characters are

Neu-in different fonts or have different orientations This is an essential element for

Scene Text Recognition models, as it allows the model to better understand the image and make more accurate predictions.

17

Trang 27

On the other hand, using the Sequence Modeling component may harm Scene Text Recognition models by increasing computational complexity and memory consumption As a result, many models choose not to use Sequence Modeling, even though it lowers accuracy, in order to obtain a simpler model Baek’s frame-

work [5] allows for the selection or de-selection of Sequence Modeling Models

using Sequence Modeling are referred to as “Janguage-based", whereas those that

do not are referred to as “language-free”.

2.1.3.1 Language-free methods

Language-free methods typically employ convolutional features without regard for character dependency They are divided into two categories: CTC-based

and segmentation-based 57] methods.

CTC-based methods [26|(B1](23]|B0] first extract visual features through CNNs

and then train the CNN and RNN end-to-end using CTC loss to find the most

After attention-based methods became popular, language-based methods mainly

employed the attention mechanism, which implicitly models language using more

18

Trang 28

powerful RNNs [5 | or Transformers [4 The encoder-decoder tecture makes use of linguistic information and character dependency To boost performance, some methods focus on learning a new feature representation For example, some previous works use Bidirectional Long Short-Term Memory (BiL-

archi-STM) to make a better sequence after the feature extraction stage E0] 52](11] 1]

proposed a contrastive learning algorithm that first divides each feature map into

a sequence of individual elements and performs the contrastive loss [2

the learned representation features are fed to the recognizer Yan et al posed a primitive representation learning method that aims to exploit intrinsic representations of scene text images.

pro-Others may focus on integrating a rectification module [51)[44][71[67] to

recon-struct normal images for irregular images Then, the reconrecon-structed images are fed to the encoder-decoder module for further recognition However, scene text

usually has a variety of shapes and sizes, making it difficult for the rectification module to transform all the irregular text instances into regular ones.

Since current attention-based methods suffer from the attention drift problem

, directly decoding upon the convolutional features or linguistic features

will degrade recognition performance Inspired by Wang et al (2020) (641, which

generates character center masks to help focus attention on the right position.

2.1.4 Prediction

The Prediction component, the final step in the Scene Text Recognition model, is

responsible for making predictions about the text contained in an image This component uses the features extracted in the Feature Extraction stage and the linguistic information of the Sequence Modeling stage to make a prediction about

the sequence of characters in the text This is done by taking the probabilities

19

Trang 29

generated by previous component and applying them to the characters in the

im-age The resulting probabilities are then used to rank the characters in the image,

and the model can then output the most likely sequence of characters

Addi-tionally, the Prediction component can also be used for post-processing, such as

correcting errors or applying post-processing techniques such as spell-correction

or text-normalization By using the Prediction component, the model can make

more accurate predictions and provide useful post-processing features, resulting

in improved performance on the Scene Text Recognition task

We have two options for prediction: Connectionist temporal classification (CTC)

[18] and attention-based sequence prediction (Attn) [52) [11].

CTC is a powerful tool for predicting a non-fixed number of sequences, evenwhen a fixed number of features are given This is done by predicting a char-acter at each column (hi H) and by modifying the full character sequence into

a non-fixed stream of characters by deleting repeated characters and blanks [18],

[50].

Attn, on the other hand, automatically captures the information flow within the

input sequence to predict the output sequence [6] It allows an STR model to learn

a character-level language model representing output class dependencies

2.2 Domain Adaptation

Domain shift (or domain gap) is an important issue in machine learning, as it

deals with the problem of a model’s performance degrading when it is applied

to a new domain Domain shift occurs when the model is trained on data from

one domain but then is applied to data from another domain that it is not familiar

with This can lead to a decrease in accuracy and robustness, as the model is not

20

Trang 30

able to accurately recognize text from the new domain [61] [19][33]][47] [24].

In order to address this issue, Domain Adaptation has been proposed to bridge

the gap between training and testing data by adapting trained-models to the test

distribution with the help of data from the target domain [75)[43][43][14] This

is done by transferring the knowledge of the source domain, ie the training

distribution, to the target domain, allowing the model to adapt to the new main while maintaining its original performance The goal is to minimize thedomain discrepancy so that the model trained on the source domain can perform

do-well on the target domain Domain Adaptation techniques can be used to adapt

a model to different domains, such as different languages, different geographic

areas, or different types of data Especially with Scene Text Recognition, most

models are trained on synthetic data and evaluated on real-world data Syntheticdata is generated to simulate real world data, but it still does not cover enough

the complexity of real-world data such as fonts, backgrounds, styles, leading

to a large gap between two domains As a result, we employ Domain Adaptation

to address the domain shift issue in Scene Text Recognition By using Domain

Adaptation, models can remain generalizable and can be more easily deployed in

various contexts.

Domain Adaptation is classified into two main types: Supervised Domain

Adapta-tion (SDA) and Unsupervised Domain AdaptaAdapta-tion (UDA) SDA is based on labeled

target domains but works well with a limited number of labels UDA, on the other

hand, employs unlabeled target domain data but necessitates a large number of

target samples [13] We use synthetic data for training and labeled real-world

data for evaluation in this thesis However, due to the scarcity of labeled

real-world data and the abundance of unlabeled real-real-world data, we use unlabeled

real-world data as the target domain and attempt to align the distribution of

syn-thetic and real-world data using Unsupervised Domain Adaptation

21

Trang 31

Unsupervised Domain Adaptation

In recent years, various strategies for UDA have been proposed One of the most

widely used methods is known as invariant representation learning [56] [76] In this

method, adversarial training is typically used to learn feature representations that

are constant between the source and target domains [14] Recently, self-training

(also known as pseudo-labels) is adapted for UDA [39][38][80][79][70] The quick

concept of pseudo-labels is source-trained classifiers produce pseudo-labels of

unlabeled target domain data and use them to further improve trained classifiers

Unsupervised Domain Adaptation in Scene Text Recognition

Scene Text Recognition has utilized both self-training and adversarial training

for Unsupervised Domain Adaptation Azadi et al (2018) presented the

geometry-aware domain adaptation network (GA-DAN), which uses the converted

text image to train the target recognition model after converting a synthetic textimage to a real scene text image Baek et al (2021) [4 employed pseudo-label asself-training to improve STR performance while only utilizing real data to trainthe STR model Zhang et al (2021) develop a Sequence-to-Sequence Domain

Adaptation Network (SSDAN) for robust text image recognition [73] Zheng et al.

(2022) proposed a domain adaptation framework for scene text recognition

by combining both pseudo-labeling and adversarial learning approach

2.3 Gradual Domain Adaptation

The most prominent issue of Unsupervised Domain Adaptation is the large

do-main shift between the source and target dodo-main, which occurs when the sourceand target domains differ significantly Previous theoretical study showed that

22

Trang 32

as the gap between two domains widens, the generalization error of UDA also

increases [76][7] It may be challenging to adapt to the target domain in a one-off

manner because of a possible large shift between these two domains, and it has

been noticed that existing UDA algorithms do not perform well under large shift

6364].

One obvious solution to the large domain shift problem is to split a large shift intomultiple smaller shifts Inspired by curriculum learning or “divide-and-conquer”

idea, Gradual Domain Adaptation [57](9](34](60] was developed to address the

issue of significant gaps between adapted domains Curriculum learning

sug-gests beginning the training process with easier samples and progressively

mak-ing them more difficult To the current source and target domains of UDA, GDA

adds additional unlabeled data as the intermediate domains that gradually shift

from the source to the target, which helps the model to adapt to intermediate

dis-tributions before being exposed to the target domain According to 24], the

gen-eralization gap (from source to target distribution) will be significantly smaller if

learning algorithms are exposed to incremental changes in the data distribution

throughout a self-training regime

Gradual self-training, a general machine learning technique recently proposed

by Kumar et al (2020) [54], outperforms vanilla self-training on a number of

synthetic and real-world datasets In parallel, an adversarial adaptation

tech-nique for GDA is also put forth by Wang et al (2020) [60] Furthermore,

Ab-nar et al (2021) [2] provide a form of gradual self-training which does not

re-quire intermediate domain data, since it might generate pseudo-data for

inter-mediate domains Zhou et al (2022) proposed an algorithm based on the

teacher-student paradigm with an active query method A slightly different

set-ting, where labeled data is also available during the intermediate domains, was

explored by Dong et al (2022) [12].

23

Trang 33

However, Gradual Domain Adaptation has not been applied to Scene Text

Recog-nition task yet In this thesis, we investigate the efficacy of Gradual Domain

Adaptation approach for Scene Text Recognition task We focus on applying

grad-ual for self-training approach because when there is a large distribution shift

be-tween the source and target, the performance of self-training suffers markedly

[61] Additionally, Image Classification task of Kumar empirically validates

the effectiveness of gradual self-training on both synthetic and real datasets Due

to these reasons, Baek et al (2021) [4] is the baseline we use when implementing

Gradual Domain Adaptation for self-training

2.4 Datasets

In this section, we introduce the datasets used for the Scene Text Recognition task

in three stages: synthetic datasets for training, unlabeled real-world datasets for

adaptation, and labeled real-world datasets (i.e., benchmark datasets) for

evalua-tion.

2.4.1 Synthetic Datasets

MJSynth (MJ) is a synthetic dataset created for Scene Text Recognition (STR)

tasks It contains 9 million synthetic images of scene text, each with a single line

of text The dataset is generated by an algorithm which renders realistic text ages with varying fonts, sizes, orientations, and backgrounds MJSynth has been

im-used extensively for training and evaluating Scene Text Recognition models, as it

provides a large number of labeled data which is useful for model training and

24

Trang 34

evaluation Additionally, the synthetic data allows for the use of data

augmen-tation techniques, which can improve the performance of Scene Text Recognition

models

SynthText (ST) is a synthetic dataset that is generated by combining text

with a large collection of background images It is composed of 800,000 synthetic

images and 7M word boxes of text with a variety of fonts, colors, sizes,

back-grounds, and shapes The images are generated using a text-to-image rendering

algorithm, and the text is written in Latin-based languages Additionally, the

im-ages are augmented using various transformations such as rotation, scaling, and

perspective distortion This allows the model to learn from a wide variety of data,

and provides a more diverse set of training data ST is a valuable resource for

cre-ating models that can accurately detect text in different languages and domains.For STR, we crop the texts in scene images and use them for training

2.4.2 Unlabeled Real-world Datasets

Unlabeled real-world datasets are essential for training Scene Text Recognition

(STR) models because they provide a large amount of data that is not limited

to a single language or domain This enables the model to learn from a diverse

set of data and contributes to its generalization ability Furthermore, because

these datasets are unlabeled, no manual annotation is required, which can be timeconsuming and expensive As a result, they are a low-cost and efficient method

of training STR models Furthermore, using real-world data enables the model tolearn from data with more realistic variations, such as different fonts, sizes, andorientations, which can significantly improve its performance on the STR task

For that reason, we utilize three unlabeled datasets which contain scene images

but do not have any word region annotation: Book32, TextVQA, and ST-VOQA

25

Trang 35

Some samples of three unlabeled real-world datasets are illustrated in Figure[2.3|

Book32 is composed of 208K book cover images in 32 categories made of

book cover images, title text, author text, and category membership, collected

from the Amazon Books dataset It contains a large number of handwritten or

curved texts This dataset is very challenging and can be used for a variety of

tasks include STR After being cropped, Book32 contains 3.9M word boxes forSTR model

TextVOA was created for text-based visual question answering and consists

of 28,408 OpenImage V3 [23] images from categories that tend to contain text such

as “billboard” and “traffic sign” TextVQA contains 551K word boxes for the STRmodel after being cropped

ST-VOA [8] was created for scene text-based visual question answering, but

in-cludes datasets such as IC13, IC15, and COCO, so we chose to exclude them from

our consolidation TextVQA has 79K word boxes for the STR model after being

cropped

2.4.3 Benchmark Datasets

Street View Text (SVT) dataset is used to evaluate Scene Text Recognition

models It consists of 257 images for training and 647 images for evaluating

Google Street View text, with each image containing a single line of text The

text in the images is in Latin-based languages and comes in a variety of fonts,orientations, and backgrounds

IIT5K-Words (IIIT) dataset contains 5,000 natural scene images of printedEnglish text It is one of the most popular datasets for scene text recognition

and was created by the Indian Institute of Information Technology (IIIT) in

Hy-derabad The images in the dataset were retrieved from Google image searches

26

Trang 36

(B) Text VQA (c) STVQA

FIGURE 2.3: Some samples of three unlabeled real-world datasets.

using terms like "billboards" and "movie posters." It has 2,000 training images and

3,000 evaluation images

ICDAR2013 (IC13) is created for the ICDAR 2013 Robust Reading

competi-tion It contains 848 images for training and 1,015 images for evaluacompeti-tion

ICDAR2015 (C15) is collected by people who wear Google Glass, and thus,many of them contain perspective texts and some of them are blurry It contains4,468 images for training and 2,077 images for evaluation

SVT Perspective (SP) is collected from Google Street View, similar to SVT

Unlike SVT, SP contains many perspective texts It contains 645 images for

evalu-ation.

CUTE80 (CT) is collected for curved text The images are captured by a

digital camera or collected from the Internet It contains 288 cropped images for

evaluation

27

Tiêu đề	Gradual Domain Adaptation in Scene Text Recognition
Tác giả	Ho Chung Duc Khanh, Nguyen Thi Minh Phuong
Người hướng dẫn	PhD. Thanh Duc Ngo
Trường học	University of Information Technology
Chuyên ngành	Computer Science
Thể loại	Bachelor Thesis
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	73
Dung lượng	33,26 MB