By recognizing the text components present in a scene, the scene text recognition task can be used to perform a range of tasks, from automating document processing to providing assistanc
Synthetic datasets for training
The key challenge of scene text recognition lies in the limited amount of annotated data available for training To address this issue, researchers often resort to the use of synthetic data to train their models, since it is possible to generate large amounts of data in this manner [27] [20]. ¢ MJSynth (MJ) [27] is a synthetic dataset created for Scene Text Recognition (STR) tasks It contains 9 million synthetic images of scene text, each with a single line of text The dataset is generated by an algorithm that renders realistic text images with varying fonts, sizes, orientations, and backgrounds MJSynth has been extensively used for training and evaluating scene text recognition models, as it provides a large amount of labeled data that is useful for model training and evaluation Addition- ally, the synthetic data allows for the use of data augmentation techniques, which can improve the performance of scene text recognition models. ¢ SynthText (ST) [20] is a synthetic dataset that is generated by combining text with a large collection of background images It is composed of 800,000 synthetic images and 7 million word boxes of text with a variety of fonts, colors, sizes, backgrounds, and shapes The images are generated using a text-to-image rendering algorithm, and the text is written in Latin-based languages Additionally, the images are aug- mented using various transformations such as rotation, scaling, and perspective dis- tortion This allows the model to learn from a wide variety of data and provides a more diverse set of training data ST is a valuable resource for creating models that can accurately detect text in different languages and domains For STR, we crop the texts into scene images and use them for training.
Chapter 2 Background and Relevant Literature
Font rendering ym Border/shadow & color Composition ‘ Projective distortion h Natural image blending canbe "l GENERATOR : i — —
FIGURE 2.4: The process of creating the MJSynth dataset.
The image is taken from [27].
Chapter 2 Background and Relevant Literature
Realdatasets 2 0 0.000.000 0000004 21 2.3 Domain Adaptation 2 20 0 ee 24
Real-world datasets are essential for training Scene Text Recognition (STR) mod- els However, collecting real data is both time-consuming and costly Baek et al [2] and Bautista et al [6] have compiled some real datasets collected from text recognition competitions and some publicly available data. ¢ ITS5K-Words (IIIT) dataset [51] contains 5,000 (3000 for training, 2000 for eval- uation) natural scene images of printed English text It is one of the most popular datasets for scene text recognition and was created by the Indian Institute of Infor- mation Technology (IIIT) in Hyderabad The images in the dataset were retrieved from Google image searches using terms like "billboards" and "movie posters”. ¢ Street View Text (SVT) [74] dataset is used to evaluate STR models It consists of
257 images for training and 647 images for evaluating Google Street View text, with each image containing a single line of text The text in the images is in Latin-based languages and comes in a variety of fonts, orientations, and backgrounds. ¢ ICDAR2013 (IC13) [30] inherits most of IC03”s images [47] and was also created for the ICDAR 2013 Robust Reading competition It contains 848 images for train- ing and 1,095 images for evaluation, where pruning words with non-alphanumeric characters results in 1,015 images Again, researchers have used two different ver- sions for evaluation: 857 and 1,015 images The 857-image set is a subset of the 1,015 set where words shorter than 3 characters are pruned. ¢ ICDAR2015 (IC15) [31] was created for the ICDAR 2015 Robust Reading compe- titions and contains 4,468 images for training and 2,077 images for evaluation The images are captured by Google Glasses while following the natural movements of the wearer Thus, many are noisy, blurry, and rotated, and some are also of low res- olution Again, researchers have used two different versions for evaluation: 1,811
Chapter 2 Background and Relevant Literature and 2,077 images Previous papers [5, 12] have only used 1,811 images, discard- ing non-alphanumeric character images and some extremely rotated, perspective- shifted, curved images for evaluation. ¢ SVT Perspective (SP) [55] is collected from Google Street View and contains 645 images for evaluation Many of the images contain perspective projections due to the prevalence of non-frontal viewpoints ¢ CUTE80 (CT) [57] is collected for curved text The images are captured by a dig- ital camera or collected from Internet It contains 288 cropped evaluation images. ¢ COCO-Text (COCO) [71] is created from the MS COCO dataset [41] As the MS
COCO dataset is not intended to capture text, COCO contains many occluded or low-resolution texts. ¢ Uber-Text (Uber) [92] is collected from Bing Maps Streetside Many are house numbers, and some are text on signboards. ¢ ArT [13] is created to recognize Arbitrary-shaped Text Many are perspective or curved texts It also includes Totaltext [8] and CTW1500 [44], which contain many rotated or curved texts. ¢ ReCTS [89] is created for the Reading Chinese Text on Signboard competition It contains many irregular texts arranged in various layouts, written with unique fonts. ¢ LSVT [68, 67] is a Large-scale Street View Text dataset, collected from streets in
China, and thus many are Chinese text. ¢ MLT19 [53] is created to recognize Multi-Lingual Text It consists of seven lan- guages: Arabic, Latin, Chinese, Japanese, Korean, Bangla, and Hindi. ¢ RCTW14 [62] is created for the Reading Chinese Text in the Wild competition.
Thus, many are Chinese texts.
Chapter 2 Background and Relevant Literature ¢ TextOCR [65] is a large arbitrary STR dataset derived from Open Images—a large dataset with very diverse images, often containing complex scenes.
FIGURE 2.6: Examples of real-world data.
The sample images are taken from [2].
Chapter 2 Background and Relevant Literature
Domain shift [88] (or domain gap) is an important issue in machine learning, as it deals with the problem of a model’s performance degrading when it is applied to a new domain Domain shift occurs when the model is trained on data from one domain but then applies it to data from another domain that it is not familiar with This can lead to a decrease in accuracy and robustness, as the model is not able to accurately recognize text from the new domain [73, 19, 32, 58, 24].
In order to address this issue, Domain Adaptation has been proposed to bridge the gap between training and testing data by adapting trained models to the test distribution with the help of data from the target domain [93, 45, 16] This is done by transferring the knowledge of the source domain, 1.e., the training distribution, to the target domain, al- lowing the model to adapt to the new domain while maintaining its original performance. The goal is to minimize the domain discrepancy so that the model trained on the source domain can perform well on the target domain Domain adaptation techniques can be used to adapt a model to different domains, such as different languages, different geographic areas, or different types of data Especially with scene text recognition, most models are trained on synthetic data and evaluated on real-world data Synthetic data is generated to simulate real-world data, but it still does not cover enough of the complexity of real- world data, such as fonts, backgrounds, styles, etc., leading to a large gap between the two domains As a result, we employ domain adaptation to address the domain shift issue in scene text recognition By using domain adaptation, models can remain generalizable and be more easily deployed in various contexts.
Domain Adaptation is classified into two main types: Supervised Domain Adapta- tion (SDA) and Unsupervised Domain Adaptation (UDA) SDA is based on labeled target domains but works well with a limited number of labels UDA, on the other hand, em- ploys unlabeled target domain data but necessitates a large number of target samples [15].
Chapter 2 Background and Relevant Literature
We use synthetic data for training and labeled real-world data for evaluation in this thesis.However, due to the scarcity of labeled real-world data and the abundance of unlabeled real-world data, we use unlabeled real-world data as the target domain and attempt to align the distribution of synthetic and real-world data using unsupervised domain adaptation.
Unsupervised Domain Adaptation forSTR
In recent years, various strategies for unsupervised domain adaptation have been proposed The approaches used for UDA can be broadly classified into two categories: Self-trained Domain Adaptation and Adversarial Domain Adaptation In self-trained do- main adaptation, the model is trained on a labeled dataset in the source domain and an unlabeled dataset in the target domain, and the aim is to minimize the distributional dis- crepancy between the two domains [2, 38, 37, 99, 98, 85] Adversarial Domain Adap- tation approaches, on the other hand, are focused on learning a domain-invariant feature representation by training a generative adversarial network to discriminate between the source and target domains [91, 90, 16].
Several methods have made efforts to implement domain adaptation for scene text recognition Zhan et al (2018) [87] presented the Geometry-Aware Domain Adaptation Network (GA-DAN), which uses the converted text image to train the target recognition model after converting a synthetic text image to a real-scene text image Baek et al (2021) [2] employed pseudo-label as self-training to improve STR performance while only utiliz- ing real data to train the STR model SMILE [9] uses a sesequence-to-sequence unsuper- vised domain adaptation in order to minimize the latent entropy within the decoder ICD-
DA [70] addresses the imbalance of character distribution during self-training ASSDA [90] employs an Adversarial Sequence-to-Sequence Domain Adaptation network, which could adaptively transfer coarse global-level and fine-grained character-level knowledge.
Chapter 2 Background and Relevant Literature
To bridge the gap between the source and target datasets, we focus our evaluation on self-trained domain adaptation for the scene text recognition task This approach com- monly uses pseudo-labeling to adapt from a labeled source dataset, typically a synthetic one, to an unlabeled target dataset, usually real-world data When learning both synthetic and real datasets at the same time, domain adaptation can improve performance by at- tempting to reduce the difference between domains However, due to the large domain gap between the two datasets, Pseudo-labeling can be ineffective and lead to the model being trained on the wrong label [94, 7, 32, 33] We take into account the fact that the domain gap has a progressive tendency To this end, unlike directly adapting from the source domain to the target domain, we propose to exploit the gradual escalation of the domain gap This strategy introduces a series of target domains that progressively bridge the gap between the source and target domains, allowing for a more effective domain adaptation By doing this, the model is able to learn from the labeled source data more effectively as well as from the unlabeled target data, thus being able to bridge the gap be- tween the two datasets and perform better on the scene text recognition task Therefore, provide observations on how domain adaptation works and optimize the adaptation across domains.
| Stage 1: Domain Stratifying | § Domain-gap | estimator xị,DŒI) X,D(x2) ad X„;D(X„) xr De) = —_ v Sort in
Unlabeledimages—> Nhung Pseudo-label model J i —+( mat) =
FIGURE 3.1: The overall framework of our proposed Stratified Domain Adap- tation (StrDA) approach for STR Our approach leverages labeled synthetic data and unlabeled real data, eliminating human annotation costs entirely The entire process is divided into 2 stages: Domain Stratifying (partitioning the unlabeled real data into groups satisfying Eq 3.1) and Progressive Self-Training.
Overview <@iPrte ce ‹â ` ệ -ˆ.-
In this thesis, our focus is on addressing the problem using two predefined datasets: labeled data acting as the source domain Š = {œ , 1°) \ 1 and unlabeled data serving |S|
1= as the target domain T = {x} F } The goal of domain adaptation is to improve the T IT| performance of the pre-trained (source) model using both S and T.
Unsupervised Domain Adaptation (UDA) To investigate the Stratified Domain Gap approach, we relied on the traditional UDA approach using self-training UDA takes a pre-trained (source) model (referred to as the baseline model) to generate a pseudo-label for x}, Subsequently, the model is trained using the pseudo-labeled data combined with the labeled data from the source domain Applying domain adaptation directly (using the entire dataset for a single self-training process) may encounter several disadvantages (Chap 1) Instead, our approach employs a series of UDA rounds with a sequence of target sub-domain data.
We partition the unlabeled data into a series of equally-sized groups T„ = {x” } 1" [Tn
1= forming a sequence of data T1, Tạ, Ta, , Ty By this, we assume that the domain shift between group T;, and source domain S is less than that between Tị„,; and S:
, where ứ(P, Q)! represents the domain gap between distributions P and Q.
To partition the data regarding the assumption Eq 3.1, we propose two methods for estimating the proximity of a data point x; € T and the source domain S: Domain
Discriminator (DD) and Harmonic Domain Gap Estimator (HDGE) Afterward, we arrange and partition the data satisfying Eq 3.1 We refer to the entire process as Stage 1-Domain Stratifying After obtaining the data groups from Stage 1, we sequentially lp is treated as a distance function between two distributions, e.g Kullback-Leibler Divergence (KL
Chapter 3 Stratified Domain Adaptation apply self-training to each group This process is referred to as Stage 2-Progressive Self-Training The entire Stratified Domain Adaptation approach consists of two stages, as described in Fig 3.1.
Stage 1: Domain Stratifying 20202 0 000 29
Given the source data S and the target data T, we introduce the domain-gap esti- mator to evaluate the proximity of a data point x} € T and the source domain S, denoted as d; A lower d; indicates that x} is closer to the source We design two methods for the domain-gap estimator: Domain Discriminator (DD) and Harmonic Domain Gap Estima- tor (HDGE) After assigning d; to each data point x}, we arrange the data in ascending order of d; and then partition them into 1 data groups with the same size Tj, = {xJ" \ iT l
1 serving the purpose of progressive self-training.
DomainDiscrimiatdr
Harmonic Domain Gap Estimator
In DD, the discriminator is trained to differentiate between the source domain and the target domain As the data distributions of both domains are not separable, with data points located in the intersection between the two domains and data points that are out of both distributions, the discriminator suffers from precisely predicting dj.
Therefore, we propose a novel method using a discriminator-pair, one for the source domain and the other for the target domain (Ds and Dr) The pair of discrim- inators is responsible for evaluating the difference between the data under consideration and the main features of the domain (or what can be referred to as the degree of out-of- distribution of the data) By synthesizing the outputs of the two discriminators, we can determine whether the data is in-domain (near source or target) or out of both distribu- tions We denote these two out-of-distribution levels as ds and dr To calculate the d; for x; , we use the formula:T
, where 0 < 6 < 1, we tend to bias the data towards smaller ds(x}) meaning closer to the source domain This aligns with the condition Eq 3.1.
With the designed d;-computation function as above, we aim to arrange the data for the progressive self-training process with the following prioritization order:
1 x; situated at the intersection of the two distribution data (both ds and dry are small).
2 x; closer to the source domain (small ds, large dr).
3 x; closer to the target domain (small ấy, large ds).
4 x; that is out of the two distributions (both ds and dr are large).
To create a pair of discriminators Ds and Dr with the ability to assess out-of- distribution (OOD) levels effectively, we designed a learning strategy inspired by [97].
As illustrated in Fig 3.2, in addition to the two discriminators Ds and Dr, we also utilize two generators: Gr translates images from the source domain to the target domain (Gr : S — T), and Gs performs a similar task from the target domain to the source domain (Gs : T — S).
While the generators strive to learn how to represent from one domain to another, the discriminators learn to distinguish between images generated by the generator and real images Through adversarial learning, Gs and Gr will improve image generation, consequently enhancing the discriminative abilities of Ds and Dr As a result, when new data x; is introduced, the discriminator pair accurately assesses the out-of-distribution levels (ds and dr).
Given training samples tx } 1 where x? € 5 and {x} \ 1 where x; € T, the
1= i= data distribution is denoted as x5 ~ pdata(x°) and x’ ~ paaa(x?) The adversarial loss S
FIGURE 3.2: Our architecture consists of two mapping functions, Gr : S > T and Gs : T — S, along with associated adversarial discriminators, Ds and
Dr While Gg and Gr are tasked with translating images from one domain to another, Ds estimates the difference between an image and the data distribution of the source domain S, and similarly, Dr does so for the target domain T. for the mapping function Gr : S — T and its discriminator Dr is expressed as follows:
Loan (Gr, DỊ, S, T) — EAT nu(xT) [log Dr(x?)|
, where Gr attempts to generate images Gr(x°) that resemble images from domain T, while the objective of Dr is to differentiate between translated samples Gr(x°) and real samples x’ Gr strives to minimize this objective against adversary Dr that seeks to maximize it, i.e., ming,maxp,Lgan(Gr,Dr,S,T) We use a similar adversarial loss for the mapping function Gs : T — S and its discriminator Ds as well: i.e., ming,maxp,Lcan(Gs, Ds,T, S)
After training, we obtain a pair of discriminators, Ds and Dry with the ability to estimate the domain gap d; for data x} using Eq 3.1.
Stage 2: Progressive Self-Training
At the end of Stage 1, we have n data groups ready for Stage 2 progressive self- training As demonstrated in Fig 3.1, we will conduct self-training sequentially on each set of sub-domain data T; The entire learning process is outlined in Algorithm 1.
Require: Labeled images (X,Y) € S and sequence of unlabeled image groups
1: Train STR model M(-, 69) with (X, Y) using Eq 3.5.
3: T; > M(.,0;-1) —> V; (pseudo-labels) and m; (mean of confidence scores)
Given the input image x" and the ground-truth character sequence y = yt, yb, " yk, the STR model M(-;@) outputs a sequence of vector p> = M(x!;6) = pr, ph, cư:
Cross-entropy loss is employed to train the STR model: log pi (yy |x") (3.5) lào
, where pr (yt) represents the predicted probability of the output being yt at time step t. k is the sequence length.
In each adaptation round, after obtaining labeled data and pseudo-labeled data, we proceed to train the STR model M1(-; 0) to minimize the objective function:
L(g) = Tj ằ L3) + tị Ye L.(x”;w”) (3.6) xSES x'iT;
, where 7; is the mean (average) of confidence scores when generating pseudo-labels for the unlabeled image group T; m; serves as an adaptive controller.
Both DD and HDGE share the commonality of being discriminator-based methods. However, there are distinctions between them Firstly, DD employs a single discriminator to distinguish between the two domains In contrast, HDGE utilizes a pair of discrim- inators, each evaluating the out-of-distribution (OOD) status concerning the considered domain HDGE synthesizes the two OOD levels to quantify the domain gap Secondly,
DD relies on the backbone architecture of the problem (in our case, STR baseline model).
On the other hand, HDGE is constructed independently.
Each method has its advantages DD is effective when incorporating domain knowl- edge through the design architecture and features of the binary classifier HDGE is a general method that can use the same architecture for various tasks However, HDGE’s drawback lies in its dependency on hyper-parameters ổ in the formula Both methods demonstrate effectiveness in our experiments Depending on these key factors, one can choose between two methods.
Label Sharpening We "sharpen" the soft labels to encourage the model to up- date its parameters Consequently, during the training process in Stage 2, we utilize the model’s predictions on unlabeled data as definitive pseudo-labels rather than relying on their probabilities.
Regularization Regularization is a significant factor in self-training Without reg- ularization, the model is not incentivized to change during self-training Therefore, we also incorporate it into our model training process.
Our work focuses on addressing the domain gap problem between the source do- main, which is synthetic data, and the target domain, which is real data in scene text recognition.
To overcome limiting factors highlighted by [2, 28] on the effectiveness of synthetic- trained models in the real domain, our study aims to boost the performance of the STR model in real scenarios without incurring human-annotation costs We leverage two types of data during the training process: synthetic data (SynthText (ST) [20] and MJSynth (MJ) [27]) and real data without labels.
For real scene text data, we collected them from public real training datasets, in- cluding ArT [13], COCO [71], LSVT [68], MLT19 [53], RCTW17 [62], ReCTS [89], Uber [92], TextOCR [65] Similar to [2], we exclude vertical text (height > width) and images whose width is greater than 25 times the height Consequently, we discard their labels and acquire an unlabeled dataset consisting of around 1M real samples, denoted as real unlabeled data (RU).
Real datasets IHTSk [51] BMVC 2012 2,000 - 3,000 SVT [74] ICCV 2011 257 - 647 IC13 [30] ICDAR 2013 848 - 1,015 1C15 [31] ICDAR 2015 4.468 - 2,077 SVTP [55] ICCV 2013 - - 645
ArT [13] ICDAR 2019 32,349 - 35,149 ReCTS [89] ICDAR 2019 25,328 - 2,592 LSVT [68,67] ICDAR 2019 43,244 - - MLT19 [53] ICDAR 2019 56,937 - - RCTWI17 [62] ICDAR 2017 10,509 -
TABLE 4.1: Summary of dataset usage Numbers indicate how many samples were used from each dataset "t” refers to splits that were repurposed as training data.
Six real-world STR datasets have been widely used for evaluating a trained STR model Baek ef al [3] introduce the datasets by categorizing them into regular and ir- regular datasets The benchmark datasets are given the distinction of being “regular” or
“irregular” datasets [63, 82, 11], according to the difficulty and geometric layout of the texts First, regular datasets contain text images with horizontally laid-out characters that have even spacing between them These represent relatively easy cases for STR: IHHTSK-Words (IIIT) [51], Street View Text (SVT) [74], ICDAR2013 (1C13) [30]. Second, irregular datasets typically contain harder corner cases for STR, such as curved and arbitrarily rotated or distorted texts [63, 82, II]: ICDAR2015 (IC15) [31], SVT Perspective (SVTP) [55], and CUTE80 (CUTE) [57].
These benchmark datasets are categorized as either "regular" or "irregular" datasets, and detailed information about them can be found in prior studies [3, 63, 82] We utilize the re-annotated datasets by [6] to address inconsistencies in our evaluation It is im- portant to note that both IC13 and IC15 have two versions of their respective test splits, which are commonly referenced in the literature—857 and 1,015 for IC13, and 1,811 and 2,077 for IC15 To avoid any confusion, we refer to the benchmark as the combination of IHTSk, CUTE, SVT, SVTP, IC13 (1,015), and IC15 (2,077)
To have a more comprehensive comparison, we extend our evaluation to include four larger and more challenging datasets: COCO [71] (13.4k samples; occluded text, low-resolution), Uber [92] (36.2k samples; low-resolution, difficult font), ArT [13] (3k2 samples; perspective, curved text), and ReCTS [89] (2k5 samples; difficult font layout).
Our approach leverages labeled synthetic data and unlabeled real data, as shown in Tab 4.1 We discard the labels of real datasets to align with the experiments The
"Train." data we report is slightly different from [2, 6] because we use raw images (with labels discarded) and do not employ any filtering approaches related to labels.
In accordance with standard conventions [3], we present word-level accuracy for each dataset Furthermore, to provide a thorough evaluation of models concerning their recognition performance on both regular and irregular text, as per [2], we introduce an average score denoted as "Avg." This score represents the accuracy across the combined set of samples from all six benchmark datasets (IIIT, SVT, IC13, IC15, SVTP, and CUTE).
According to Baek et al [3], STR is performed in four stages: Transformation (Trans.), Feature extraction (Feat.), Sequence modeling (Seq.), Prediction (Pred.), as il- lustrated in Fig 4.1 For our experiments, we adopt two widely-used models from the STR benchmark [3]: CRNN [60] (None, VGG, BiLSTM, CTC) and TRBA [3] (TPS [26], ResNet, BiLSTM, Attention).
Two STR models, CRNN [3] and TRBA [3], are employed to assess the effective- ness of the proposed framework using their default configurations Both STR models are derived from the synthetic-trained (source) model (baseline model) in [2] The reason we chose CRNN and TRBA is that they are widely adopted models in domain adaptation work for STR Hence, it is an appropriate choice for a fair and comprehensive comparison.
Prediction ope mage jormealized fils ‘Contextual feature
Fully Connected Input image malized image Visual feature
FIGURE 4.1: Visualization of an example flow of scene text recognition.
Domain Discriminator is based on the architecture of the STR model.
Domain Discriminator (DD) employs a binary classifier with a feature extractor (Feat.) from the baseline STR model combined with a fully connected layer at the last layer, as shown in Fig 4.1.
TABLE 4.2: Discriminator architecture configuration for the Harmonic Domain Gap Estimator Here, c, k, s, and p stand for no of channels, filter size, stride, and padding, respectively.
Additional Training Techniques
Label Sharpening We "sharpen" the soft labels to encourage the model to up- date its parameters Consequently, during the training process in Stage 2, we utilize the model’s predictions on unlabeled data as definitive pseudo-labels rather than relying on their probabilities.
Regularization Regularization is a significant factor in self-training Without reg- ularization, the model is not incentivized to change during self-training Therefore, we also incorporate it into our model training process.
Our work focuses on addressing the domain gap problem between the source do- main, which is synthetic data, and the target domain, which is real data in scene text recognition.
Dataset ee en 35
To overcome limiting factors highlighted by [2, 28] on the effectiveness of synthetic- trained models in the real domain, our study aims to boost the performance of the STR model in real scenarios without incurring human-annotation costs We leverage two types of data during the training process: synthetic data (SynthText (ST) [20] and MJSynth (MJ) [27]) and real data without labels.
For real scene text data, we collected them from public real training datasets, in- cluding ArT [13], COCO [71], LSVT [68], MLT19 [53], RCTW17 [62], ReCTS [89], Uber [92], TextOCR [65] Similar to [2], we exclude vertical text (height > width) and images whose width is greater than 25 times the height Consequently, we discard their labels and acquire an unlabeled dataset consisting of around 1M real samples, denoted as real unlabeled data (RU).
Real datasets IHTSk [51] BMVC 2012 2,000 - 3,000 SVT [74] ICCV 2011 257 - 647 IC13 [30] ICDAR 2013 848 - 1,015 1C15 [31] ICDAR 2015 4.468 - 2,077 SVTP [55] ICCV 2013 - - 645
ArT [13] ICDAR 2019 32,349 - 35,149 ReCTS [89] ICDAR 2019 25,328 - 2,592 LSVT [68,67] ICDAR 2019 43,244 - - MLT19 [53] ICDAR 2019 56,937 - - RCTWI17 [62] ICDAR 2017 10,509 -
TABLE 4.1: Summary of dataset usage Numbers indicate how many samples were used from each dataset "t” refers to splits that were repurposed as training data.
Six real-world STR datasets have been widely used for evaluating a trained STR model Baek ef al [3] introduce the datasets by categorizing them into regular and ir- regular datasets The benchmark datasets are given the distinction of being “regular” or
“irregular” datasets [63, 82, 11], according to the difficulty and geometric layout of the texts First, regular datasets contain text images with horizontally laid-out characters that have even spacing between them These represent relatively easy cases for STR: IHHTSK-Words (IIIT) [51], Street View Text (SVT) [74], ICDAR2013 (1C13) [30]. Second, irregular datasets typically contain harder corner cases for STR, such as curved and arbitrarily rotated or distorted texts [63, 82, II]: ICDAR2015 (IC15) [31], SVT Perspective (SVTP) [55], and CUTE80 (CUTE) [57].
These benchmark datasets are categorized as either "regular" or "irregular" datasets, and detailed information about them can be found in prior studies [3, 63, 82] We utilize the re-annotated datasets by [6] to address inconsistencies in our evaluation It is im- portant to note that both IC13 and IC15 have two versions of their respective test splits, which are commonly referenced in the literature—857 and 1,015 for IC13, and 1,811 and 2,077 for IC15 To avoid any confusion, we refer to the benchmark as the combination of IHTSk, CUTE, SVT, SVTP, IC13 (1,015), and IC15 (2,077)
To have a more comprehensive comparison, we extend our evaluation to include four larger and more challenging datasets: COCO [71] (13.4k samples; occluded text, low-resolution), Uber [92] (36.2k samples; low-resolution, difficult font), ArT [13] (3k2 samples; perspective, curved text), and ReCTS [89] (2k5 samples; difficult font layout).
Our approach leverages labeled synthetic data and unlabeled real data, as shown in Tab 4.1 We discard the labels of real datasets to align with the experiments The
"Train." data we report is slightly different from [2, 6] because we use raw images (with labels discarded) and do not employ any filtering approaches related to labels.
In accordance with standard conventions [3], we present word-level accuracy for each dataset Furthermore, to provide a thorough evaluation of models concerning their recognition performance on both regular and irregular text, as per [2], we introduce an average score denoted as "Avg." This score represents the accuracy across the combined set of samples from all six benchmark datasets (IIIT, SVT, IC13, IC15, SVTP, and CUTE).
EvaluaionmetriC Ặ Ặ Q ee ee 37
STRmodel Ặ Qua 38
According to Baek et al [3], STR is performed in four stages: Transformation (Trans.), Feature extraction (Feat.), Sequence modeling (Seq.), Prediction (Pred.), as il- lustrated in Fig 4.1 For our experiments, we adopt two widely-used models from the STR benchmark [3]: CRNN [60] (None, VGG, BiLSTM, CTC) and TRBA [3] (TPS [26], ResNet, BiLSTM, Attention).
Two STR models, CRNN [3] and TRBA [3], are employed to assess the effective- ness of the proposed framework using their default configurations Both STR models are derived from the synthetic-trained (source) model (baseline model) in [2] The reason we chose CRNN and TRBA is that they are widely adopted models in domain adaptation work for STR Hence, it is an appropriate choice for a fair and comprehensive comparison.
Prediction ope mage jormealized fils ‘Contextual feature
Fully Connected Input image malized image Visual feature
FIGURE 4.1: Visualization of an example flow of scene text recognition.
Domain Discriminator is based on the architecture of the STR model.
Domain Discriminator (DD) employs a binary classifier with a feature extractor (Feat.) from the baseline STR model combined with a fully connected layer at the last layer, as shown in Fig 4.1.
Domain Discrimmator
Harmonic Domain Gap Estimator
TABLE 4.2: Discriminator architecture configuration for the Harmonic Domain Gap Estimator Here, c, k, s, and p stand for no of channels, filter size, stride, and padding, respectively.
To create a pair of discriminators Ds and Dr with the ability to assess out-of- distribution (OOD) levels effectively, we designed a learning strategy inspired by [97]. Our discriminators (Ds and Dr) are described in Tab 4.2.
Input images are resized to 32 x 100 For both stages, we employ the AdamW [46] optimizer with a weight decay of 0.005 We also utilize the one-cycle learning rate sched- uler [66] with a maximum learning rate of 0.0005 The training batch size is consistently set to 128, and gradient clipping is applied with a magnitude of 5 For Stage 1, we train Domain Discriminator and Harmonic Domain Gap Estimator for 20 epochs In Stage
2, the STR models are trained for 50K iterations All experiments are performed on an NVIDIA GeForce RTX 2080 Ti (11GB VRAM).
Source domain Target domain Target domain Target domain ource do Round 1 Round 2 Round 3
' UDA: SABBATO UDA: DISTNCT UDA: CIRNICERIA
StrDAipor: SABBATO StrDAypgp: DISTRICT usm
UDA: Psychelogy UDA: AUSTRAUA UDA: Mucotic
StrDAypce: Psychelogy | StrDAypgrp: AUSTRALIA ; StrDAhpeg: Marcotte
FIGURE 4.2: The Stratified Domain Adaptation (StrDA) approach partitions the data from the target domain into three distinct groups, with the disparity across domains gradually rising, as shown in the image The next two lines depict the pseudo-labels employed in the self-training process of UDA and StrDAypce, respectively The pseudo-labels generated by UDA are prone to noise as the extent of domain shift escalates On the other hand, StrDA produces pseudo- labels with higher accuracy.
UDA: organisme UDA: GRIFFNI : UDA: References:
StDAypge: organisme + StrDAyypgp: GRIFFIN | StrDAypgGE: Budweiser
We conduct experiments using two STR models (CRNN [3] and TRBA [3]) We employ pre-trained models (trained only on synthetic data) from [2] We aim to enhance the model’s performance on the real test set using domain adaptation approaches To demonstrate the effectiveness of our stratified domain adaptation approach compared to traditional unsupervised domain adaptation (UDA) (Sec 3.1), we ran independent exper- iments with the same protocols.
As illustrated in Tab 4.3, all domain adaptation experiments (UDA, StrDApp, StrDAppceE) surpass the baseline models across ten public benchmarks Despite rely- ing solely on additional unlabeled real data and self-training with pseudo-labels with- out employing any other advanced techniques, the experiments remarkably enhanced the STR model’s performance on both regular and irregular datasets Specifically, CRNN
Chapter 4 Experiment and Analysis and TRBA exhibit remarkable improvements on the CUTE dataset (+12.3% and +9.0%) when applying StrDApp These outstanding results emphasize the significance of in- tegrating real images into training STR models and showcase the advantages of our self-training framework Furthermore, our experiments have demonstrated an impres- sive improvement in performance on irregular text benchmarks, such as IC15, SVTP and CUTE80 We have observed that the increase in performance on these irregular text benchmarks is significantly higher than on regular text benchmarks.
Notably, both of our StrDA approaches outperform traditional UDA We observed that UDA does not perform well without any domain sequences, although it shows slight improvement over the source model (improved by 2.2% for STR and 3.0% for CRNN on Avg.) Through organizing and partioning data based on the progressive increase in domain shift, our progressive self-training framework demonstrates strong effectiveness. mmm 1-—baseline mas 2—UDA mmm 3-— StrDApp mmm 4 — StrDAnpce
FIGURE 4.3: Illustration of Experiment Results (Tab 4.3).
“HOCH Y qs ÁJ[eI2edse ‘sfapour
9uI[os#q JO soURULIOJIod oY} ứAoIduIr APUBIYIUSIS SpOYJOW ING “oUTTIOpUN YIM UMOYS ST [9DOUI YORO UT 1[nS91 ySoq 9U] pue ‘pjoq Ul UMOYS ST 1[S91 1S9q 9] ‘UUINTOS 209 UT ‘JoseJep (209 19AO SJI9UI9AOIdUI ỉ†E2IpuI 2218 UT ,V,, Áq pojousp s1oqunnu ayy, 'ÁJ9AI29dSe1 'IOJEUIISI dey urewog 2[UIOUHEH DUE 1O1EUIUIL2SI( u†eulo(1 ẩuIsn spoyjow uonqe)depv urewiog peynens ome 2đHV(1qS pue Idyquyjs ‘uoneidepe urewop poSIA1odnsun [euonipey 0} S19J91 , VN, 'IJ2pOUI aUITISeq 9U] 1OJ spoyjoul uoneIdepy uleWOG IMO Jo sJ[nseI oy) 1uoseid ứAA '([Z] Joded [eu18o u[ poyoder sJ[nse1)
Byep 2I9U1uÁS uo ATUO p@u[E1] SỉDOUI SUE9UI ,(OUT[ISeq),, “OULU S}I YIM pojsl] SI 19SE}EP YORI UL SPIOAA JO JoquINU oy],
'ĐJ9s01Êp 21Iqnd ua) 1ứAO (%) Á201n220 UONTUSOSAI 3X} BUDIG 'SIEUI2U94 JX2[, 21926 2W] UO SJ[nS9j :€'b ATAVL §'p+ /'9+ Sct Tó+ Let 0 'ó+ ost ppt vit set cet M
Si rss C}9 CÓC OS PÓS TLS LỊỂ IOL SỳỐ ƑŒ6 SH MVS +
Lt s+ C+ Plt cet ost §'ct Lt L3 lào 6'¿+ 6'¿t V
Si ess 9E9 UÓC EBS 688 CS EC§ P6 t6 816 SH OS
9'c+ 9'c+ 61+ set Ct Ltt Lt+ §'J+ 6]+ ect cct M a! cys GIÓ 28: ES OLS | 0B 62L 56 TG EH VN
9°08 OLS 9È c0s €8 c8 có," Lyk €6 688 L6 (ouJseq) VaR ppt Tst+ 61+ Ppt poo Sant 61+ Let §'0+ Tết c+ M ose wes 0P Ih OL 9E L99 (G9 HB OT BB OVS +
L’st+ Ort 0£+ pst Set | €0lI+ 87+ Spt 9'0+ cet Ort Vv fol oes HE St SOL | SIL OL 099 - ros TTR EĐR ơ—
6+ Let eit ect 0 '€+ sot 9'e+ 0'c+ 9'I+ 0'€+ 6c+ M ơ- gee BIS Pớc LểC BBL | |8 Ƒ8ủQ tỳ PO 6183 VN
VEL [8 Lúc VLE SSL C19 879 cỊ9 888 68L c+†s (ourjeseq) NNM2 £6€@J22N SSP IW $109%12q1 €2%5O2OO “Sav %%HL12 SPaELAS “stor SMletor “LAS TTT poyp sjasejep uOIIDDV sjaseyep 10I,2u9 XIS
Method Fraction of unlabeled data
TRBA-UDA 87.6 +1.9 88.1 †2.4 87.9 †2.2 TRBA-StDApp 88.1†2.4 88.3 2.6 88.9 $3.2 TRBA-StDAupoE 87.9 + 2.2 88.6 2.9 89.4 $3.7
TABLE 4.4: Domain Adaptation performance We sample three orders of scales (200K, 500K, and 1M) of data from RU (1M) "*" denotes the performance of the baseline model, which does not utilize domain adaptation with real data.
Beside performing well across various STR models, our approach also yields good results even with a limited quantity of unlabeled real data, as presented in Tab 4.4 How- ever, when the amount of data decreases to only 200K, our StrDA approach does not demonstrate significant superiority over UDA This may occur for two reasons The first reason is that the domain gap between the source and target domains is small, making the division of the domain gap still effective but not distinctly clear Moreover, the source and target domains are still large; however, there is not much data lying between two data dis- tributions, so the sequence of sub-domains may not help the model adapt progressively.
In this section, we compare our approaches with current state-of-the-art (SOTA) methods on six test benchmarks As shown in Tab 4.5, the original TRBA had a relatively low average score, but after applying our approach, it increased by 3.7% (from 85.7% to 89.4%), surpassing ESIR, DAN, SE-ASTER, RobustScanner, and TextScanner However, while our experiments significantly boosted the performance of CRNN and TRBA, our models could not outperform current advanced techniques Nevertheless, we believe that our framework can also be utilized to augment SOTA models, thereby yielding further comparative results.
For previous domain adaptation methods, due to the variations in the baseline model and training environments, as each method is trained and tested on different datasets, we find it challenging to compare fairly The previous works are really commendable It is worth mentioning that our framework can be thought of as a sequence of unsupervised domain adaptation (UDA) rounds Therefore, it would be highly advantageous to integrate advanced UDA techniques into Stage 2 of our framework.
Domain Regular Text Irregular Text
Adaptation Wp $VT ICI3 ICIS SVTP CUTE
CRNN-UDA ⁄ 87.2 81.9 904 645 684 681 788 CRNN-StrDApp ⁄ §83 82.1 894 660 676 715 79.5 CRNN-StrDAiper ⁄ 88.0 81.0 896 652 66.7 736 79.1 Š TRBA-UDA ⁄ 943 912 950 765 83.6 80.9 87.9 © TRBA-StrDApp ⁄ 950 918 934 794 833 837 88.9
TABLE 4.5: Comparison with SOTA methods on STR test accuracy.
3 TRBA_UDA By " TRBA_UDA
Number of groups Hyper-parameter B
(A) Ablation study on data division (B) Ablation on HDGE’s formula Eq 4.1
FIGURE 4.4: Ablation study on data division and HDGE’s formula
An important question arises for our approach: "Given the source domain data and the arranged target domain data, how should we partition the data? What is a reasonable number of groups? How many data points should be in each group?" In this section, we proceed to compare the performance of StrDA with different number-of-group settings. The total number of unlabeled data remains 1M (RU), and the experiments are conducted with the same number of iterations and environment The decision to use TRBA for all ablation tests is based on its exceptional performance.
According to Fig 4.4a, StrDA is more successful for both methods when the data division is smaller This phenomenon is comprehensible due to the reduction in domain disparity as a result of smaller data partitions, hence facilitating the model’s adaptability. StrDA with the Harmonic Domain Gap Estimator method regularly outperforms Domain Discriminator, showcasing its clear superiority.
4.6.2 Ablation on Harmonic Domain Gap Estimator
In this section, we investigate the hyper-parameter ổ in the formula of the HDGE: i= (1 + B?).ds(x/).dr(x})
, where ds, dr are the distances from the data point x} to the data distribution of the source domain and target domain, respectively 0 < ổ < 1, With a smaller f, the distance d; tends to be biased toward the source domain.