This thesis introduces the field of scene text recognition, exploring the out-of-vocabulary problem and outlining some of the most recent methods used to dress it.. Scene text recognitio
Trang 1VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
FACULTY OF COMPUTER SCIENCE
BACHELOR THESIS
A Study on Handling Out-of-Vocab Problem
in Scene Text Recognition
Bachelor of Computer Science (Honors degree)
HỨA THANH TÂN - 19520257
Supervised by
DR THANH DUC NGO
HO CHI MINH CITY, 2023
Trang 2VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
FACULTY OF COMPUTER SCIENCE
BACHELOR THESIS
A Study on Handling Out-of-Vocab Problem
in Scene Text Recognition
Bachelor of Computer Science (Honors degree)
HỨA THANH TÂN - 19520257
Supervised by
DR THANH DUC NGO
HO CHI MINH CITY, 2023
Trang 3The Thesis Defense Committee has been carefully established in accordance with
the Decision - , issued øn by the President of the
Univer-sity of Information Technology This committee is comprised of eminent
individ-uals who possess a great deal of expertise and knowledge in the specific area of
study that is relevant to the thesis defense To ensure that all aspects of the sis defense are properly addressed, the following personnel have been carefully
the-chosen to comprise the committee:
¢ Chairman:
® Secretary:
¢ Member:
¢ Member:
Trang 4Iam immensely grateful to the invaluable assistance of Dr Ngo Duc Thanh, Duc
Khanh, Minh Phuong, and An Tran in the successful completion of my thesis.
Their dedication and commitment to my project provided me with the support and guidance I needed to ensure the accuracy and quality of my work was up to
par.
Dr Ngo Duc Thanh has been an invaluable mentor to me throughout the course
of my research journey His guidance, support, and expertise have been able in teaching me the importance of critical thinking and providing valuable advice when needed He has been a constant source of inspiration, providing
invalu-me with the confidence to question and challenge my current understanding and push myself to reach my highest potential.
Duc Khanh and Minh Phuong were my peers in this research journey, and I am
thankful for their insightful suggestions and feedback which greatly improved
the cohesiveness of my work Their contribution to data collection and analysis was invaluable, and without their help I would have struggled to complete my
thesis.
Iam especially thankful to An Tran for her help in proofreading and refining my
thesis Her expertise and careful attention to detail helped ensure that my work
was of the highest standard, and I could not have achieved my goal without her help.
Without the dedicated effort of these individuals, I would never have been able
to complete my thesis, and I am profoundly thankful for their invaluable
contri-bution I am grateful for the time and energy they devoted to helping me reach
my goal.
Trang 6IAbstractl
1 Overview
1 Introduction] gy Gn .@ŒAẲ
1.2 Introduction to Scene Text ecogmition|
1.2.1 Scene Text
Recognition|l 1.2.2 Sub-problem in Scene Text Recognition
[25]} -1.2.2.1 Cropped Scene Text Recognition)
1.2.2.2 End-to-end Scene Text Recognition]
1.2.3
ApplicaionsofSTR| -1.24
ChallengesofSTR| -[[3_ Introduce to Out-of-vocabulary problem|
1.3.1 Out-outvocabulary
problem| -1.3.2 Challenges of OOV problem|
1.4 Scopeofthesils| Ặ TQ eee
2_ Related Wor!
12 12
13
15 16
17
Trang 72.1.1 SyntheticDatasets| - 17
(2.1.2 CurvedDatasel 17
2.13 _ Multi-oriented Datasetl 18
2.14 NormalDatasetsl 19
Se ee 21 [2.2 Scene Text Recognition Methodsl 21
2.2.1 Language-freemethodsl - 22
2.2.1.1 CTIC-basedmethodl] 22
2.2.1.2 _Segmentation-based method| 24
2.2.2 Language-based methodsl - 25
2.2.2.1 — Attenton-basedmethod] 26
2.2.2.2 — Transfomer- method] 28
2.2.2.3 Long-short Term Memory (LSTM) based method} 30 E3 Out-oF-vocabulary Problem| - 32
[2.3.1 Ensemble learning baselines approach| 33
2.3.1.1 Traditional Approach| 33
2.3.1.2 Modern Approachl - 33
[2.3.2 Reducing impact of linguistic information over the visual featureapproach| - 34
[2.3.3 Automatically making decision to hold visual and linguistic
approach} 1.6 eee 37
Trang 8B2 Experiment and
Evaluation| -3.21 OOV-ST Challenge Datasltl
43
44 45
49
51
Trang 9List of Figures
T.1 Scene TextRecognition| - 4
[[2_ Cropped Scene Text Recognition E2] Hi 5 1.3 End-to-end System for Scene Text Recognition [20] See 6 1.4 Autonomous vehicles recognize "STOP" term to decide next action DS Z⁄4@7 \ J 7
1.5 Variants in font[6]] 8
1.6 Variants in color[10]Ï - - 8
1.7 Variants in illumination 6 © đnớứếeeeeeẽẽẽ.-Ặee 9 1.8 Variants in orientation and distortion [4]} 10
[L9_ Variants in language [21] - 11
[1.10 Vertical texts [11]| - 12
[L-11_Out-of-vocabulary problem [17]| - 13
Words} 2 ee 13 1.13 Slang words - that mean throwing something| 14
1.14 Typo words - diferent| - 14
1.15 Symbol-$] 00.0 eee 14
Trang 102.1 Synthetic images dataset 5] Se ee 18
2.2 Curved images dataset [2]| - 19
[2.3 Street view text dataset [24]] 20
E4 Normal images dataset[16]] -.- 20
2.5 OOV-ST Challenge dataset (19 ¬ 21 [2.6 Four stages of STR frameworks Jð]] - 22
E7 T2C2W Architecture [26]] -. 23
[2.8 The combination of transformer and CTC makes the model robust in the presence of low resolution images [26]] ¬— 2 2.9_ Some failure cases of I2C2W [26|| 24
[2.10 Architecture of TextScanner [23]| 25
[2.11 Robust prediction on Chinese script [23] mom sss sss ees 25 2.12 Simple and Strong baseline - SAR[13j| - 27
2.13 SAR compare to other solutions [13]| - 27
[2.14 By using 2D Attention mechanism, this architecture localize char-[| _acters without character-level annotations [13]] - 28
ETB Some faiuirecasesofSAR[H3] 20
[2.16 ABINetarchitecture [7Ï] - 30
2.17 ABINet evaluation high accuracy on low-quality images [7]] 30
2.18 SCATTER architecture [14j| - 31
[2.19 SCATTER with different number of selective block decoders [14Ï| 31
2.20 Some failures of SCATTER [14|| 32
Trang 112.21 Ensemble models architecture [ HẢ| Se ee 33
2.22 OpenCCD architecture in simple version ft Se eee 35
[2.23 OpenCCD architecture with full components| 35
[2.24 Comparision OpenCCD with other methods| 36
2.25 OpenCCD robust result on non-latinh characters| 36
[2.26 The core architecture of VLAMD [12]| 37
41 42 -3 {is added into word "prepare"] 42
3.4 Result on various aspect of OOV Datasetl 46
3.5 Result our OpenCCD outputs on Vietnamese character with
spe-cific 3 types of errOr| ee 47
8.6 Result our OpenCCD outputs on Vietnamese character with
spe-cific 3 types of error] eee 47
Trang 13OOV Oout Of Vocabulary
STR Scene Text Recognition
IV In VocabularyCNN Convolution Neural Network
RNN Recurrent Neural Network
BiLSTM Bidirectional Long Short Term MemoryCTC Connectionist Temporal Classification
SOTA State Of The Art
CONDI WN FH
Trang 14This thesis introduces the field of scene text recognition, exploring the
out-of-vocabulary problem and outlining some of the most recent methods used to dress it Scene text recognition is a process of recognizing and extracting textfrom an image, transforming it into a machine-readable format for further pro-cessing The out-of-vocabulary problem is a challenge for recognizing text that
ad-is not included in a predefined vocabulary lad-ist, leading to the need for modern
methods to address this issue In this thesis:
¢ We discuss various approaches and techniques that have been used to tackle
the out-of-vocabulary problem These include those based on the use of
contextual information, those that use the prediction of characters not seen
in the training set, and those that leverage deep learning techniques to learn
representations of out-of-vocabulary words
e We present an evaluation of existing approaches, their effectiveness in
tack-ling the vocabulary problem, and a new dataset for assessing vocabulary words which have no common characteristics and unstructured
out-of-Additionally, we compare and analyze various state-of-the-art techniques to
iden-tify which works best for this purpose, discussing the results in the thesis and
ar-riving at a better understanding of the current approaches to the out-of-vocabularyproblem In order to guide future research, this thesis further suggests potentialdirections and strategies to further improve existing methods and address the
out-of-vocabulary problem These include exploring different ways of rating contextual information into deep learning models and leveraging transfer
incorpo-learning to pre-train models on related tasks In addition, further research should
Trang 15be conducted on the development of robust and efficient approaches for nizing out-of-vocabulary words.
Trang 16recog-Chapter 1
Overview
1.1 Introduction
Scene text recognition is a rapidly growing and increasingly important field of
re-search, due to its many potential applications in areas such as image recognition,computer vision, and natural language processing In recent years, the develop-
ment of deep learning techniques has enabled researchers to create models that
can accurately recognize text from natural images This has opened up a world
of possibilities, with researchers now able to extend the scope of text recognition
to a range of areas, including but not limited to document analysis, autonomousnavigation, and even medical image processing With the rapid advances in deeplearning, computer vision, and natural language processing, the possibilities forscene text recognition have grown exponentially, making it an ever-more vitalresearch field In this thesis, we focus on the so-called "Out of vocabulary prob-lem," another facet of scene text recognition Finally, we study and evaluate a few
solutions to this issue
Trang 17Chapter 1 Overview
1.2 Introduction to Scene Text Recognition
1.2.1 Scene Text Recognition
Scene Text Recognition (STR) is a form of computer vision technology that is used
to recognize text in natural scenes It is a challenging task due to the variations
in font, color, illumination, occlusion, orientation, size, and language that can be
present within an image In addition, the context of the image as a whole must
be taken into account by STR algorithms in order to achieve a high degree of
ac-curacy in text recognition This is because the context of the image may provide
additional clues that can help to identify the text For example, if the image
con-tains a signpost, the context of the image can help to identify the language of
the text on the sign Therefore, in order to accurately recognize text from naturalscenes, STR algorithms must be designed to consider all of these factors
Trang 18Chapter 1 Overview
1.2.2 Sub-problem in Scene Text Recognition [25]
1.2.2.1 Cropped Scene Text Recognition
Cropped Text Recognition is the process of recognizing text from images which
have been cropped around the text itself This process is especially challenging
due to the fact that the background of the image is usually cluttered and can
con-tain other objects that may interfere with the recognition process To address this
issue, multiple techniques have been proposed such as incorporating contextual
information into the recognition process, employing region-based convolutional
neural networks (R-CNNs), and using multi-scale feature extraction
® Input: a single image
¢ Output:: a word containing inside the input image
1.2.2.2 End-to-end Scene Text Recognition
End-to-End Text Recognition is a more recent approach in which a single deeplearning model is used to directly recognize text from an image This approach
has the advantage of being able to recognize text from more complex images
which contain multiple lines of text and even curved text To achieve this, many
end-to-end text recognition models employ a combination of convolutional
lay-ers for feature extraction, recurrent laylay-ers for sequence modeling, and attention
mechanisms for sequence decoding
Trang 19Chapter 1 Overview
¢ Input: a single image
¢ Output:: a list of words and polygon area indicated the position of the word
inside the input image
End-to-End Spotting
CAVE SPRING Recognition
HIGH SCHOOL
FIGURE 1.3: End-to-end System for Scene Text Recognition
1.2.3 Applications of STR
Scene Text Recognition (STR) has a variety of applications that are revolutionizing
the way we use technology These include:
¢ Autonomous vehicles: Scene Text Recognition technology is playing a major
role in making these vehicles smarter, more efficient, and far safer than ever
before By recognizing street signs and other relevant information, Scene
Text Recognition can help make navigation smoother, easier, and far more
secure for self-driving cars As this technology continues to evolve, the tential for self-driving vehicles is growing exponentially and is likely to rev-
po-olutionize the way we get around
¢ Mobile phone applications: Scene Text Recognition can be used to power
applications such as scanning bar codes, allowing users to quickly and ily access information in an instant
eas-6
Trang 20Chapter 1 Overview
STOP
Image Text detection Text recognition
FIGURE 1.4: Autonomous vehicles recognize "STOP" term to decide
e Variations in font can make text difficult to recognize, as different fonts can
have unique shapes and features that make them hard to distinguish This
can be especially true for text written in a stylized font, which may contain
unique characters that are not found in more traditional typefaces The size
and spacing of the font can also make it difficult to differentiate from other
fonts Furthermore, the font style can also make a difference, as some fonts
are designed to be more ornate or decorative, while others are designed to
be more plain and simple All of these factors can combine to make text
challenging to recognize, especially when different fonts are used in thesame document
Trang 21Chapter 1 Overview
e Variations in color can also make text recognition difficult, as different
col-ors can impact the ability of the reader to discern individual characters and
words For example, light colors can often blend together, making it
chal-lenging to distinguish one letter from another or one word from another.Additionally, colors such as yellow, green, and red can all make words and
characters less visible, particularly when being viewed from a distance
Fur-thermore, colors that are too dark can also be problematic, as they can
ob-scure the text and make it difficult to read
LINCOLN PACcéstlac fa et
Sài USHIP Cú:
*NgHMÉ “Flectronique ROCHELLE
FIGURE 1.6: Variants in color
® Variations in illumination and blur can make text recognition difficult, as
shadows or glare can obscure the written words and make them difficult todiscern Additionally, occlusion can present a significant challenge for text
8
Trang 22Chapter 1 Overview
recognition, as objects in front of the text can obscure the words and make
it virtually impossible to read
(d)_ (6) — (f)
® Variations in orientation and distortion can also present difficulty for text
recognition, as text that is rotated or skewed off from its normal orientation
may be difficult to interpret, even for the most sophisticated algorithms
Such issues can also be compounded by the presence of noise and other
artifacts that may interfere with the recognition process In order to
en-sure successful recognition, it is important to use algorithms that are robustenough to handle these types of distortions
Trang 23Chapter 1 Overview
® Variations in size can be a major obstacle to text recognition, as small text
can be much harder to distinguish from other elements on the page This
can be particularly difficult for readers with vision impairments, who may
not have the same level of visual acuity Even with the aid of magnificationtools, small text can be hard to make out, making text recognition a difficulttask Additionally, text size can vary significantly from one document to thenext, making the task of text recognition even more challenging
FIGURE 1.8: Variants in orientation and distortion
¢ Variations in language can also make text recognition difficult, as different
languages have different characters and symbols that can make it
challeng-ing for a computer program to accurately interpret the text For example, the
complex hieroglyphics of ancient Egypt, the diverse pictograms of the
Chi-nese language, and the intricate writing systems of other Asian countries,
such as Japan and Korea, all require special programming to accurately readand interpret Additionally, scripts like Cyrillic and Greek have their own
unique symbols, further complicating the task of text recognition While
10
Trang 24Chapter 1 Overview
current text recognition software is quite advanced, it’s still a challenge to
accurately interpret the full spectrum of language diversity
® Vertical texts have long been a challenge for existing Structural Text
Recog-nition models, because they are mostly designed to interpret horizontal textimages As a result, these models are structurally unable to process ver-
tical texts, which limits their practical applications In order to be able to
interpret vertical texts, STR models must be adapted to incorporate the
ad-ditional structural complexities, such as the orientation of text lines and text
blocks, which are present in vertical text images
¢ The out-of-vocabulary problem is a major challenge when it comes to text
recognition, as most model-based predictions rely on dictionary-based
meth-ods that are limited to recognizing meaningful words This is particularly
problematic when dealing with unfamiliar words, dialects, or slang, as these
won't be recognized by the model
Finally, the context of the image as a whole must be considered, as different
as-pects of the scene can have an influence on the text recognition process
11
Trang 25Chapter 1 Overview
FIGURE 1.10: Vertical texts
1.3 Introduce to Out-of-vocabulary problem
1.3.1 Out-out-vocabulary problem
The Out of Vocabulary (OOV) problem is a major challenge that is widely
encoun-tered in Scene Text Recognition due to its ability to significantly reduce
recog-nition accuracy and prevent the model from generalizing OOV is a common
problem in the field of STR, and there are various strategies to combat it, such as
increasing the size of the vocabulary or utilizing a more sophisticated languagemodel Additionally, other approaches such as data augmentation, transfer learn-
ing, and character-based models can also help to mitigate the OOV problem thermore, using a recurrent neural network in combination with language modelscan help to improve the accuracy of the model by better capturing complex lan-
Fur-guage patterns Ultimately, understanding the OOV problem and the strategies
to address it are crucial for improving the accuracy and generalization of STR
models
12
Trang 26Chapter 1 Overview
1.3.2 Challenges of OOV problem
The OOV problem is particularly challenging because it can manifest in many
different ways It can include rare words, unseen words, slang words, typos,
math symbols, foreign language, and names of people or places:
® Rare words are words which occur very infrequently in the language, mak
ing them hard to recognize
® Unseen words are words which have never been seen in the language
be-fore and are therebe-fore not included in the vocabulary of any speech or textrecognition model
e Slang words are words which are used in informal contexts, and can be
difficult to distinguish from their formal counterparts
¢ Typos are mistakes in writing, such as misspellings or misused punctuation,
which can lead to the wrong word being identified
13
Trang 27Chapter 1 Overview
FIGURE 1.14: Typo words - different
¢ Math symbols are a particular challenge for speech recognition systems, as
they are often difficult to distinguish from words, and are rarely included
in a model’s vocabulary
FIGURE 1.15: Symbol - $
¢ Foreign language words can also be difficult to identify, as they may not be
familiar to the speaker or the model
¢ Finally, names of people and places are often difficult to identify, as they
may not be included in the model’s vocabulary
14
Trang 28Chapter 1 Overview
All of these challenges can lead to incorrect or incomplete recognition of speech
or text To address the OOV problem, speech and text recognition models must
be trained on a large and diverse dataset which includes examples of all of these
types of words and phrases Furthermore, models need to be able to recognizecontext and use contextual information to determine the correct word or phrase,
even if it is not included in the model’s vocabulary Finally, models should be
able to recognize and correct typos, as these are a common source of OOV errors
With the right training and techniques, the OOV problem can be addressed and
speech and text recognition accuracy can be improved
1.4 Scope of thesis
In this thesis, we have contributions as following:
¢ Exploration and study of new approaches to deal with the OOV problem:
We explored a number of approaches to solving the Out-of-Vocabulary (OOV)problem, including the use of contextualized embeddings and OOV dictio-naries We evaluated the performance of these approaches to determine
which can be used most effectively to handle OOV words
e Evaluating OOV approaches: We conducted a enhanced evaluation of the
available OOV approaches, considering their accuracy, speed, and
scalabil-ity We compared the performance of each method to determine which proach is best suited for the task of OOV handling
ap-® Providing a new language dataset for evaluation: We generated a new
lan-guage dataset for evaluation purposes This dataset contains a variety of
languages and OOV words, allowing us to evaluate the performance of the
OOV approaches in a wide range of contexts
15
Trang 29Chapter 1 Overview
1.5 Structure of thesis
Chapter 1: Overview and motivation of this thesis This chapter will explore the
background of the research and the purpose of the thesis It will discuss why this
topic is important and how it contributes to the field It will provide a summary
of why this research was conducted and what the expected outcomes are
Chapter 2: Exploration of recent approaches to problems This chapter will
provide an overview of the current state of the research, including prior work
that has been done It will provide a detailed analysis of the methods used in the
past and the results that were achieved It will also discuss the limitations of the
prior approaches
Chapter 3: Evaluation of new approach This chapter will present the research
conducted in the thesis It will include the results that were achieved, and a
com-parison to existing approaches It will discuss the strengths and weaknesses of
the new approach, specify OpenCCD [1], as well as possible future directions for
the research
Chapter 4: Conclusion This chapter will provide a summary of the findings
of the thesis It will discuss the impact of the research It will also provide adiscussion of how the results of the research could be used in the future
16
Trang 30Recog-2.1.1 Synthetic Datasets
e SynthText has been extensively utilized for scene text recognition studies,
accomplished by segmenting text image portions based on the annotated
text boxes provided
¢ Synth90K (Synth90K) contains 9 million synthetic text images that have
been widely used for training scene text recognition models
2.1.2 Curved Datasets
¢ CUTE80 (CUTE) is an image dataset consisting of 288 word images, most
of which feature curved scene text All of these text images were carefully
17
Trang 31Chapter 2 Related Work
ES A GANS -: -, ETAT
BIN ¿/¡nc(:⁄e Syd FREE |
FIGURE 2.1: Synthetic images dataset j9]
cropped from the larger CUTE datasets, which contained a total of 80 scene
text images.
¢ Total-Text (TOTAL) contains 1,253 training images and 300 test images which
have been widely used for the research of arbitrary-shaped scene text
detec-tion.
® SCUT-CTW1500 (SCUT-CTW) is a dataset of 1,000 training images and 500
test images, comprising a total of 1,500 images in total This dataset is
par-ticularly useful for research and development in the field of computer sion, as it provides a comprehensive set of data for training and testingalgorithms As a result, the SCUT-CTW1500 dataset is widely used by re-
vi-searchers and practitioners in the computer vision domain
2.1.3 Multi-oriented Datasets
¢ ICDAR-2015 (IC15) contains incidental scene text images that are captured
without preparation before capturing It contains 4,468 text patches fortraining and 1,811 patches for test which are cropped from the original
dataset
18
Trang 32Chapter 2 Related Work
¢ Street View Text (SVT) consists of 647 word images that are cropped from
249 street view images from Google Street View and most cropped word
images are almost horizontal
® Street View Text-Perspective (SVTP) contains 645 word images that are also
cropped from Google Street View and many of them suffer from perspective
distortions
2.1.4 Normal Datasets
e IIT 5K-words (HTIT5K) contains 2,000 training and 3,000 test word patches
cropped from born-digital images collected from Google image search
¢ ICDAR-2003 (IC03) contains 860 images of cropped word from the the
Ro-bust Reading Competition in the International Conference on Document
Analysis and Recognition (ICDAR) 2003 dataset
® ICDAR-2013 (IC13) is used in the Robust Reading Competition in the
IC-DAR 2013 which contains 848 word images for training and 1,095 for testing
19
Trang 33Chapter 2 Related Work
Videos avaible here
PAH bar videos are loaned for ote 11A
20