Khóa luận tốt nghiệp Khoa học máy tính: Thích ứng miền tăng tiến trong bài toán scene text recognition

By recognizing the text components present in a scene, the scene text recognition task can be used to perform a range of tasks, from automating document processing to providing assistanc

Synthetic datasets for training

The key challenge of scene text recognition lies in the limited amount of annotated data available for training To address this issue, researchers often resort to the use of synthetic data to train their models, since it is possible to generate large amounts of data in this manner [27] [20]. ¢ MJSynth (MJ) [27] is a synthetic dataset created for Scene Text Recognition (STR) tasks It contains 9 million synthetic images of scene text, each with a single line of text The dataset is generated by an algorithm that renders realistic text images with varying fonts, sizes, orientations, and backgrounds MJSynth has been extensively used for training and evaluating scene text recognition models, as it provides a large amount of labeled data that is useful for model training and evaluation Addition- ally, the synthetic data allows for the use of data augmentation techniques, which can improve the performance of scene text recognition models. ¢ SynthText (ST) [20] is a synthetic dataset that is generated by combining text with a large collection of background images It is composed of 800,000 synthetic images and 7 million word boxes of text with a variety of fonts, colors, sizes, backgrounds, and shapes The images are generated using a text-to-image rendering algorithm, and the text is written in Latin-based languages Additionally, the images are aug- mented using various transformations such as rotation, scaling, and perspective distortion This allows the model to learn from a wide variety of data and provides a more diverse set of training data ST is a valuable resource for creating models that can accurately detect text in different languages and domains For STR, we crop the texts into scene images and use them for training.

Chapter 2 Background and Relevant Literature

Font rendering ym Border/shadow & color Composition ‘ Projective distortion h Natural image blending canbe "l GENERATOR : i — —

FIGURE 2.4: The process of creating the MJSynth dataset.

The image is taken from [27].

Realdatasets 2 0 0.000.000 0000004 21 2.3 Domain Adaptation 2 20 0 ee 24

Real-world datasets are essential for training Scene Text Recognition (STR) models However, collecting real data is both time-consuming and costly Baek et al [2] and Bautista et al [6] have compiled some real datasets collected from text recognition competitions and some publicly available data. ¢ ITS5K-Words (IIIT) dataset [51] contains 5,000 (3000 for training, 2000 for evaluation) natural scene images of printed English text It is one of the most popular datasets for scene text recognition and was created by the Indian Institute of Infor- mation Technology (IIIT) in Hyderabad The images in the dataset were retrieved from Google image searches using terms like "billboards" and "movie posters”. ¢ Street View Text (SVT) [74] dataset is used to evaluate STR models It consists of

257 images for training and 647 images for evaluating Google Street View text, with each image containing a single line of text The text in the images is in Latin-based languages and comes in a variety of fonts, orientations, and backgrounds. ¢ ICDAR2013 (IC13) [30] inherits most of IC03”s images [47] and was also created for the ICDAR 2013 Robust Reading competition It contains 848 images for training and 1,095 images for evaluation, where pruning words with non-alphanumeric characters results in 1,015 images Again, researchers have used two different versions for evaluation: 857 and 1,015 images The 857-image set is a subset of the 1,015 set where words shorter than 3 characters are pruned. ¢ ICDAR2015 (IC15) [31] was created for the ICDAR 2015 Robust Reading competitions and contains 4,468 images for training and 2,077 images for evaluation The images are captured by Google Glasses while following the natural movements of the wearer Thus, many are noisy, blurry, and rotated, and some are also of low resolution Again, researchers have used two different versions for evaluation: 1,811

Chapter 2 Background and Relevant Literature and 2,077 images Previous papers [5, 12] have only used 1,811 images, discard- ing non-alphanumeric character images and some extremely rotated, perspective- shifted, curved images for evaluation. ¢ SVT Perspective (SP) [55] is collected from Google Street View and contains 645 images for evaluation Many of the images contain perspective projections due to the prevalence of non-frontal viewpoints ¢ CUTE80 (CT) [57] is collected for curved text The images are captured by a dig- ital camera or collected from Internet It contains 288 cropped evaluation images. ¢ COCO-Text (COCO) [71] is created from the MS COCO dataset [41] As the MS

COCO dataset is not intended to capture text, COCO contains many occluded or low-resolution texts. ¢ Uber-Text (Uber) [92] is collected from Bing Maps Streetside Many are house numbers, and some are text on signboards. ¢ ArT [13] is created to recognize Arbitrary-shaped Text Many are perspective or curved texts It also includes Totaltext [8] and CTW1500 [44], which contain many rotated or curved texts. ¢ ReCTS [89] is created for the Reading Chinese Text on Signboard competition It contains many irregular texts arranged in various layouts, written with unique fonts. ¢ LSVT [68, 67] is a Large-scale Street View Text dataset, collected from streets in

China, and thus many are Chinese text. ¢ MLT19 [53] is created to recognize Multi-Lingual Text It consists of seven languages: Arabic, Latin, Chinese, Japanese, Korean, Bangla, and Hindi. ¢ RCTW14 [62] is created for the Reading Chinese Text in the Wild competition.

Thus, many are Chinese texts.

Chapter 2 Background and Relevant Literature ¢ TextOCR [65] is a large arbitrary STR dataset derived from Open Images—a large dataset with very diverse images, often containing complex scenes.

FIGURE 2.6: Examples of real-world data.

The sample images are taken from [2].

Domain shift [88] (or domain gap) is an important issue in machine learning, as it deals with the problem of a model’s performance degrading when it is applied to a new domain Domain shift occurs when the model is trained on data from one domain but then applies it to data from another domain that it is not familiar with This can lead to a decrease in accuracy and robustness, as the model is not able to accurately recognize text from the new domain [73, 19, 32, 58, 24].

In order to address this issue, Domain Adaptation has been proposed to bridge the gap between training and testing data by adapting trained models to the test distribution with the help of data from the target domain [93, 45, 16] This is done by transferring the knowledge of the source domain, 1.e., the training distribution, to the target domain, allowing the model to adapt to the new domain while maintaining its original performance. The goal is to minimize the domain discrepancy so that the model trained on the source domain can perform well on the target domain Domain adaptation techniques can be used to adapt a model to different domains, such as different languages, different geographic areas, or different types of data Especially with scene text recognition, most models are trained on synthetic data and evaluated on real-world data Synthetic data is generated to simulate real-world data, but it still does not cover enough of the complexity of real- world data, such as fonts, backgrounds, styles, etc., leading to a large gap between the two domains As a result, we employ domain adaptation to address the domain shift issue in scene text recognition By using domain adaptation, models can remain generalizable and be more easily deployed in various contexts.

Domain Adaptation is classified into two main types: Supervised Domain Adapta- tion (SDA) and Unsupervised Domain Adaptation (UDA) SDA is based on labeled target domains but works well with a limited number of labels UDA, on the other hand, employs unlabeled target domain data but necessitates a large number of target samples [15].

We use synthetic data for training and labeled real-world data for evaluation in this thesis.However, due to the scarcity of labeled real-world data and the abundance of unlabeled real-world data, we use unlabeled real-world data as the target domain and attempt to align the distribution of synthetic and real-world data using unsupervised domain adaptation.

Unsupervised Domain Adaptation forSTR

In recent years, various strategies for unsupervised domain adaptation have been proposed The approaches used for UDA can be broadly classified into two categories: Self-trained Domain Adaptation and Adversarial Domain Adaptation In self-trained domain adaptation, the model is trained on a labeled dataset in the source domain and an unlabeled dataset in the target domain, and the aim is to minimize the distributional discrepancy between the two domains [2, 38, 37, 99, 98, 85] Adversarial Domain Adap- tation approaches, on the other hand, are focused on learning a domain-invariant feature representation by training a generative adversarial network to discriminate between the source and target domains [91, 90, 16].

Several methods have made efforts to implement domain adaptation for scene text recognition Zhan et al (2018) [87] presented the Geometry-Aware Domain Adaptation Network (GA-DAN), which uses the converted text image to train the target recognition model after converting a synthetic text image to a real-scene text image Baek et al (2021) [2] employed pseudo-label as self-training to improve STR performance while only utiliz- ing real data to train the STR model SMILE [9] uses a sesequence-to-sequence unsupervised domain adaptation in order to minimize the latent entropy within the decoder ICD-

DA [70] addresses the imbalance of character distribution during self-training ASSDA [90] employs an Adversarial Sequence-to-Sequence Domain Adaptation network, which could adaptively transfer coarse global-level and fine-grained character-level knowledge.

To bridge the gap between the source and target datasets, we focus our evaluation on self-trained domain adaptation for the scene text recognition task This approach commonly uses pseudo-labeling to adapt from a labeled source dataset, typically a synthetic one, to an unlabeled target dataset, usually real-world data When learning both synthetic and real datasets at the same time, domain adaptation can improve performance by at- tempting to reduce the difference between domains However, due to the large domain gap between the two datasets, Pseudo-labeling can be ineffective and lead to the model being trained on the wrong label [94, 7, 32, 33] We take into account the fact that the domain gap has a progressive tendency To this end, unlike directly adapting from the source domain to the target domain, we propose to exploit the gradual escalation of the domain gap This strategy introduces a series of target domains that progressively bridge the gap between the source and target domains, allowing for a more effective domain adaptation By doing this, the model is able to learn from the labeled source data more effectively as well as from the unlabeled target data, thus being able to bridge the gap between the two datasets and perform better on the scene text recognition task Therefore, provide observations on how domain adaptation works and optimize the adaptation across domains.

| Stage 1: Domain Stratifying | § Domain-gap | estimator xị,DŒI) X,D(x2) ad X„;D(X„) xr De) = —_ v Sort in

Unlabeledimages—> Nhung Pseudo-label model J i —+( mat) =

FIGURE 3.1: The overall framework of our proposed Stratified Domain Adap- tation (StrDA) approach for STR Our approach leverages labeled synthetic data and unlabeled real data, eliminating human annotation costs entirely The entire process is divided into 2 stages: Domain Stratifying (partitioning the unlabeled real data into groups satisfying Eq 3.1) and Progressive Self-Training.

Overview <@iPrte ce ‹â ` ệ -ˆ.-

In this thesis, our focus is on addressing the problem using two predefined datasets: labeled data acting as the source domain Š = {œ , 1°) \ 1 and unlabeled data serving |S|

1= as the target domain T = {x} F } The goal of domain adaptation is to improve the T IT| performance of the pre-trained (source) model using both S and T.

Unsupervised Domain Adaptation (UDA) To investigate the Stratified Domain Gap approach, we relied on the traditional UDA approach using self-training UDA takes a pre-trained (source) model (referred to as the baseline model) to generate a pseudo-label for x}, Subsequently, the model is trained using the pseudo-labeled data combined with the labeled data from the source domain Applying domain adaptation directly (using the entire dataset for a single self-training process) may encounter several disadvantages (Chap 1) Instead, our approach employs a series of UDA rounds with a sequence of target sub-domain data.

We partition the unlabeled data into a series of equally-sized groups T„ = {x” } 1" [Tn

1= forming a sequence of data T1, Tạ, Ta, , Ty By this, we assume that the domain shift between group T;, and source domain S is less than that between Tị„,; and S:

, where ứ(P, Q)! represents the domain gap between distributions P and Q.

To partition the data regarding the assumption Eq 3.1, we propose two methods for estimating the proximity of a data point x; € T and the source domain S: Domain

Discriminator (DD) and Harmonic Domain Gap Estimator (HDGE) Afterward, we arrange and partition the data satisfying Eq 3.1 We refer to the entire process as Stage 1-Domain Stratifying After obtaining the data groups from Stage 1, we sequentially lp is treated as a distance function between two distributions, e.g Kullback-Leibler Divergence (KL

Chapter 3 Stratified Domain Adaptation apply self-training to each group This process is referred to as Stage 2-Progressive Self-Training The entire Stratified Domain Adaptation approach consists of two stages, as described in Fig 3.1.

Stage 1: Domain Stratifying 20202 0 000 29

Given the source data S and the target data T, we introduce the domain-gap estimator to evaluate the proximity of a data point x} € T and the source domain S, denoted as d; A lower d; indicates that x} is closer to the source We design two methods for the domain-gap estimator: Domain Discriminator (DD) and Harmonic Domain Gap Estima- tor (HDGE) After assigning d; to each data point x}, we arrange the data in ascending order of d; and then partition them into 1 data groups with the same size Tj, = {xJ" \ iT l

1 serving the purpose of progressive self-training.

DomainDiscrimiatdr

Harmonic Domain Gap Estimator

In DD, the discriminator is trained to differentiate between the source domain and the target domain As the data distributions of both domains are not separable, with data points located in the intersection between the two domains and data points that are out of both distributions, the discriminator suffers from precisely predicting dj.

Therefore, we propose a novel method using a discriminator-pair, one for the source domain and the other for the target domain (Ds and Dr) The pair of discriminators is responsible for evaluating the difference between the data under consideration and the main features of the domain (or what can be referred to as the degree of out-of- distribution of the data) By synthesizing the outputs of the two discriminators, we can determine whether the data is in-domain (near source or target) or out of both distributions We denote these two out-of-distribution levels as ds and dr To calculate the d; for x; , we use the formula:T

, where 0 < 6 < 1, we tend to bias the data towards smaller ds(x}) meaning closer to the source domain This aligns with the condition Eq 3.1.

With the designed d;-computation function as above, we aim to arrange the data for the progressive self-training process with the following prioritization order:

1 x; situated at the intersection of the two distribution data (both ds and dry are small).

2 x; closer to the source domain (small ds, large dr).

3 x; closer to the target domain (small ấy, large ds).

4 x; that is out of the two distributions (both ds and dr are large).

To create a pair of discriminators Ds and Dr with the ability to assess out-of- distribution (OOD) levels effectively, we designed a learning strategy inspired by [97].

As illustrated in Fig 3.2, in addition to the two discriminators Ds and Dr, we also utilize two generators: Gr translates images from the source domain to the target domain (Gr : S — T), and Gs performs a similar task from the target domain to the source domain (Gs : T — S).

While the generators strive to learn how to represent from one domain to another, the discriminators learn to distinguish between images generated by the generator and real images Through adversarial learning, Gs and Gr will improve image generation, consequently enhancing the discriminative abilities of Ds and Dr As a result, when new data x; is introduced, the discriminator pair accurately assesses the out-of-distribution levels (ds and dr).

Given training samples tx } 1 where x? € 5 and {x} \ 1 where x; € T, the

1= i= data distribution is denoted as x5 ~ pdata(x°) and x’ ~ paaa(x?) The adversarial loss S

FIGURE 3.2: Our architecture consists of two mapping functions, Gr : S > T and Gs : T — S, along with associated adversarial discriminators, Ds and

Dr While Gg and Gr are tasked with translating images from one domain to another, Ds estimates the difference between an image and the data distribution of the source domain S, and similarly, Dr does so for the target domain T. for the mapping function Gr : S — T and its discriminator Dr is expressed as follows:

Loan (Gr, DỊ, S, T) — EAT nu(xT) [log Dr(x?)|

, where Gr attempts to generate images Gr(x°) that resemble images from domain T, while the objective of Dr is to differentiate between translated samples Gr(x°) and real samples x’ Gr strives to minimize this objective against adversary Dr that seeks to maximize it, i.e., ming,maxp,Lgan(Gr,Dr,S,T) We use a similar adversarial loss for the mapping function Gs : T — S and its discriminator Ds as well: i.e., ming,maxp,Lcan(Gs, Ds,T, S)

After training, we obtain a pair of discriminators, Ds and Dry with the ability to estimate the domain gap d; for data x} using Eq 3.1.

Stage 2: Progressive Self-Training

At the end of Stage 1, we have n data groups ready for Stage 2 progressive self- training As demonstrated in Fig 3.1, we will conduct self-training sequentially on each set of sub-domain data T; The entire learning process is outlined in Algorithm 1.

Require: Labeled images (X,Y) € S and sequence of unlabeled image groups

1: Train STR model M(-, 69) with (X, Y) using Eq 3.5.

3: T; > M(.,0;-1) —> V; (pseudo-labels) and m; (mean of confidence scores)

Given the input image x" and the ground-truth character sequence y = yt, yb, " yk, the STR model M(-;@) outputs a sequence of vector p> = M(x!;6) = pr, ph, cư:

Cross-entropy loss is employed to train the STR model: log pi (yy |x") (3.5) lào

, where pr (yt) represents the predicted probability of the output being yt at time step t. k is the sequence length.

In each adaptation round, after obtaining labeled data and pseudo-labeled data, we proceed to train the STR model M1(-; 0) to minimize the objective function:

L(g) = Tj ằ L3) + tị Ye L.(x”;w”) (3.6) xSES x'iT;

, where 7; is the mean (average) of confidence scores when generating pseudo-labels for the unlabeled image group T; m; serves as an adaptive controller.

Both DD and HDGE share the commonality of being discriminator-based methods. However, there are distinctions between them Firstly, DD employs a single discriminator to distinguish between the two domains In contrast, HDGE utilizes a pair of discriminators, each evaluating the out-of-distribution (OOD) status concerning the considered domain HDGE synthesizes the two OOD levels to quantify the domain gap Secondly,

DD relies on the backbone architecture of the problem (in our case, STR baseline model).

On the other hand, HDGE is constructed independently.

Each method has its advantages DD is effective when incorporating domain knowledge through the design architecture and features of the binary classifier HDGE is a general method that can use the same architecture for various tasks However, HDGE’s drawback lies in its dependency on hyper-parameters ổ in the formula Both methods demonstrate effectiveness in our experiments Depending on these key factors, one can choose between two methods.

Label Sharpening We "sharpen" the soft labels to encourage the model to up- date its parameters Consequently, during the training process in Stage 2, we utilize the model’s predictions on unlabeled data as definitive pseudo-labels rather than relying on their probabilities.

Regularization Regularization is a significant factor in self-training Without regularization, the model is not incentivized to change during self-training Therefore, we also incorporate it into our model training process.

Our work focuses on addressing the domain gap problem between the source domain, which is synthetic data, and the target domain, which is real data in scene text recognition.

To overcome limiting factors highlighted by [2, 28] on the effectiveness of synthetic- trained models in the real domain, our study aims to boost the performance of the STR model in real scenarios without incurring human-annotation costs We leverage two types of data during the training process: synthetic data (SynthText (ST) [20] and MJSynth (MJ) [27]) and real data without labels.

For real scene text data, we collected them from public real training datasets, in- cluding ArT [13], COCO [71], LSVT [68], MLT19 [53], RCTW17 [62], ReCTS [89], Uber [92], TextOCR [65] Similar to [2], we exclude vertical text (height > width) and images whose width is greater than 25 times the height Consequently, we discard their labels and acquire an unlabeled dataset consisting of around 1M real samples, denoted as real unlabeled data (RU).

Real datasets IHTSk [51] BMVC 2012 2,000 - 3,000 SVT [74] ICCV 2011 257 - 647 IC13 [30] ICDAR 2013 848 - 1,015 1C15 [31] ICDAR 2015 4.468 - 2,077 SVTP [55] ICCV 2013 - - 645

ArT [13] ICDAR 2019 32,349 - 35,149 ReCTS [89] ICDAR 2019 25,328 - 2,592 LSVT [68,67] ICDAR 2019 43,244 - - MLT19 [53] ICDAR 2019 56,937 - - RCTWI17 [62] ICDAR 2017 10,509 -

TABLE 4.1: Summary of dataset usage Numbers indicate how many samples were used from each dataset "t” refers to splits that were repurposed as training data.

Six real-world STR datasets have been widely used for evaluating a trained STR model Baek ef al [3] introduce the datasets by categorizing them into regular and irregular datasets The benchmark datasets are given the distinction of being “regular” or

“irregular” datasets [63, 82, 11], according to the difficulty and geometric layout of the texts First, regular datasets contain text images with horizontally laid-out characters that have even spacing between them These represent relatively easy cases for STR: IHHTSK-Words (IIIT) [51], Street View Text (SVT) [74], ICDAR2013 (1C13) [30]. Second, irregular datasets typically contain harder corner cases for STR, such as curved and arbitrarily rotated or distorted texts [63, 82, II]: ICDAR2015 (IC15) [31], SVT Perspective (SVTP) [55], and CUTE80 (CUTE) [57].

These benchmark datasets are categorized as either "regular" or "irregular" datasets, and detailed information about them can be found in prior studies [3, 63, 82] We utilize the re-annotated datasets by [6] to address inconsistencies in our evaluation It is important to note that both IC13 and IC15 have two versions of their respective test splits, which are commonly referenced in the literature—857 and 1,015 for IC13, and 1,811 and 2,077 for IC15 To avoid any confusion, we refer to the benchmark as the combination of IHTSk, CUTE, SVT, SVTP, IC13 (1,015), and IC15 (2,077)

To have a more comprehensive comparison, we extend our evaluation to include four larger and more challenging datasets: COCO [71] (13.4k samples; occluded text, low-resolution), Uber [92] (36.2k samples; low-resolution, difficult font), ArT [13] (3k2 samples; perspective, curved text), and ReCTS [89] (2k5 samples; difficult font layout).

Our approach leverages labeled synthetic data and unlabeled real data, as shown in Tab 4.1 We discard the labels of real datasets to align with the experiments The

"Train." data we report is slightly different from [2, 6] because we use raw images (with labels discarded) and do not employ any filtering approaches related to labels.

In accordance with standard conventions [3], we present word-level accuracy for each dataset Furthermore, to provide a thorough evaluation of models concerning their recognition performance on both regular and irregular text, as per [2], we introduce an average score denoted as "Avg." This score represents the accuracy across the combined set of samples from all six benchmark datasets (IIIT, SVT, IC13, IC15, SVTP, and CUTE).

According to Baek et al [3], STR is performed in four stages: Transformation (Trans.), Feature extraction (Feat.), Sequence modeling (Seq.), Prediction (Pred.), as illustrated in Fig 4.1 For our experiments, we adopt two widely-used models from the STR benchmark [3]: CRNN [60] (None, VGG, BiLSTM, CTC) and TRBA [3] (TPS [26], ResNet, BiLSTM, Attention).

Two STR models, CRNN [3] and TRBA [3], are employed to assess the effectiveness of the proposed framework using their default configurations Both STR models are derived from the synthetic-trained (source) model (baseline model) in [2] The reason we chose CRNN and TRBA is that they are widely adopted models in domain adaptation work for STR Hence, it is an appropriate choice for a fair and comprehensive comparison.

Prediction ope mage jormealized fils ‘Contextual feature

Fully Connected Input image malized image Visual feature

FIGURE 4.1: Visualization of an example flow of scene text recognition.

Domain Discriminator is based on the architecture of the STR model.

Domain Discriminator (DD) employs a binary classifier with a feature extractor (Feat.) from the baseline STR model combined with a fully connected layer at the last layer, as shown in Fig 4.1.

TABLE 4.2: Discriminator architecture configuration for the Harmonic Domain Gap Estimator Here, c, k, s, and p stand for no of channels, filter size, stride, and padding, respectively.

Additional Training Techniques

Label Sharpening We "sharpen" the soft labels to encourage the model to up- date its parameters Consequently, during the training process in Stage 2, we utilize the model’s predictions on unlabeled data as definitive pseudo-labels rather than relying on their probabilities.

Regularization Regularization is a significant factor in self-training Without regularization, the model is not incentivized to change during self-training Therefore, we also incorporate it into our model training process.

Our work focuses on addressing the domain gap problem between the source domain, which is synthetic data, and the target domain, which is real data in scene text recognition.

Dataset ee en 35