Khóa luận tốt nghiệp Khoa học máy tính: Nghiên cứu phương pháp xử lý từ nằm ngoài từ điển trong bài toán nhận diện văn bản ngoại cảnh

This thesis introduces the field of scene text recognition, exploring the out-of-vocabulary problem and outlining some of the most recent methods used to dress it.. Scene text recognitio

Trang 1

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

FACULTY OF COMPUTER SCIENCE

BACHELOR THESIS

A Study on Handling Out-of-Vocab Problem

in Scene Text Recognition

Bachelor of Computer Science (Honors degree)

HỨA THANH TÂN - 19520257

Supervised by

DR THANH DUC NGO

HO CHI MINH CITY, 2023

Trang 2

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

FACULTY OF COMPUTER SCIENCE

BACHELOR THESIS

A Study on Handling Out-of-Vocab Problem

in Scene Text Recognition

Bachelor of Computer Science (Honors degree)

HỨA THANH TÂN - 19520257

Supervised by

DR THANH DUC NGO

HO CHI MINH CITY, 2023

Trang 3

The Thesis Defense Committee has been carefully established in accordance with

the Decision - , issued øn by the President of the

Univer-sity of Information Technology This committee is comprised of eminent

individ-uals who possess a great deal of expertise and knowledge in the specific area of

study that is relevant to the thesis defense To ensure that all aspects of the sis defense are properly addressed, the following personnel have been carefully

the-chosen to comprise the committee:

¢ Chairman:

® Secretary:

¢ Member:

Trang 4

Iam immensely grateful to the invaluable assistance of Dr Ngo Duc Thanh, Duc

Khanh, Minh Phuong, and An Tran in the successful completion of my thesis.

Their dedication and commitment to my project provided me with the support and guidance I needed to ensure the accuracy and quality of my work was up to

par.

Dr Ngo Duc Thanh has been an invaluable mentor to me throughout the course

of my research journey His guidance, support, and expertise have been able in teaching me the importance of critical thinking and providing valuable advice when needed He has been a constant source of inspiration, providing

invalu-me with the confidence to question and challenge my current understanding and push myself to reach my highest potential.

Duc Khanh and Minh Phuong were my peers in this research journey, and I am

thankful for their insightful suggestions and feedback which greatly improved

the cohesiveness of my work Their contribution to data collection and analysis was invaluable, and without their help I would have struggled to complete my

thesis.

Iam especially thankful to An Tran for her help in proofreading and refining my

thesis Her expertise and careful attention to detail helped ensure that my work

was of the highest standard, and I could not have achieved my goal without her help.

Without the dedicated effort of these individuals, I would never have been able

to complete my thesis, and I am profoundly thankful for their invaluable

contri-bution I am grateful for the time and energy they devoted to helping me reach

my goal.

Trang 6

IAbstractl

1 Overview

1 Introduction] gy Gn .@ŒAẲ

1.2 Introduction to Scene Text ecogmition|

1.2.1 Scene Text

Recognition|l 1.2.2 Sub-problem in Scene Text Recognition

[25]} -1.2.2.1 Cropped Scene Text Recognition)

1.2.2.2 End-to-end Scene Text Recognition]

1.2.3

ApplicaionsofSTR| -1.24

ChallengesofSTR| -[[3_ Introduce to Out-of-vocabulary problem|

1.3.1 Out-outvocabulary

problem| -1.3.2 Challenges of OOV problem|

1.4 Scopeofthesils| Ặ TQ eee

2_ Related Wor!

12 12

13

15 16

17

Trang 7

2.1.1 SyntheticDatasets| - 17

(2.1.2 CurvedDatasel 17

2.13 _ Multi-oriented Datasetl 18

2.14 NormalDatasetsl 19

Se ee 21 [2.2 Scene Text Recognition Methodsl 21

2.2.1 Language-freemethodsl - 22

2.2.1.1 CTIC-basedmethodl] 22

2.2.1.2 _Segmentation-based method| 24

2.2.2 Language-based methodsl - 25

2.2.2.1 — Attenton-basedmethod] 26

2.2.2.2 — Transfomer- method] 28

2.2.2.3 Long-short Term Memory (LSTM) based method} 30 E3 Out-oF-vocabulary Problem| - 32

[2.3.1 Ensemble learning baselines approach| 33

2.3.1.1 Traditional Approach| 33

2.3.1.2 Modern Approachl - 33

[2.3.2 Reducing impact of linguistic information over the visual featureapproach| - 34

[2.3.3 Automatically making decision to hold visual and linguistic

approach} 1.6 eee 37

Trang 8

B2 Experiment and

Evaluation| -3.21 OOV-ST Challenge Datasltl

43

44 45

49

51

Trang 9

List of Figures

T.1 Scene TextRecognition| - 4

[[2_ Cropped Scene Text Recognition E2] Hi 5 1.3 End-to-end System for Scene Text Recognition [20] See 6 1.4 Autonomous vehicles recognize "STOP" term to decide next action DS Z⁄4@7 \ J 7

1.5 Variants in font[6]] 8

1.6 Variants in color[10]Ï - - 8

1.7 Variants in illumination 6 © đnớứếeeeeeẽẽẽ.-Ặee 9 1.8 Variants in orientation and distortion [4]} 10

[L9_ Variants in language [21] - 11

[1.10 Vertical texts [11]| - 12

[L-11_Out-of-vocabulary problem [17]| - 13

Words} 2 ee 13 1.13 Slang words - that mean throwing something| 14

1.14 Typo words - diferent| - 14

1.15 Symbol-$] 00.0 eee 14

Trang 10

2.1 Synthetic images dataset 5] Se ee 18

2.2 Curved images dataset [2]| - 19

[2.3 Street view text dataset [24]] 20

E4 Normal images dataset[16]] -.- 20

2.5 OOV-ST Challenge dataset (19 ¬ 21 [2.6 Four stages of STR frameworks Jð]] - 22

E7 T2C2W Architecture [26]] -. 23

[2.8 The combination of transformer and CTC makes the model robust in the presence of low resolution images [26]] ¬— 2 2.9_ Some failure cases of I2C2W [26|| 24

[2.10 Architecture of TextScanner [23]| 25

[2.11 Robust prediction on Chinese script [23] mom sss sss ees 25 2.12 Simple and Strong baseline - SAR[13j| - 27

2.13 SAR compare to other solutions [13]| - 27

[2.14 By using 2D Attention mechanism, this architecture localize char-[| _acters without character-level annotations [13]] - 28

ETB Some faiuirecasesofSAR[H3] 20

[2.16 ABINetarchitecture [7Ï] - 30

2.17 ABINet evaluation high accuracy on low-quality images [7]] 30

2.18 SCATTER architecture [14j| - 31

[2.19 SCATTER with different number of selective block decoders [14Ï| 31

2.20 Some failures of SCATTER [14|| 32

Trang 11

2.21 Ensemble models architecture [ HẢ| Se ee 33

2.22 OpenCCD architecture in simple version ft Se eee 35

[2.23 OpenCCD architecture with full components| 35

[2.24 Comparision OpenCCD with other methods| 36

2.25 OpenCCD robust result on non-latinh characters| 36

[2.26 The core architecture of VLAMD [12]| 37

41 42 -3 {is added into word "prepare"] 42

3.4 Result on various aspect of OOV Datasetl 46

3.5 Result our OpenCCD outputs on Vietnamese character with

spe-cific 3 types of errOr| ee 47

8.6 Result our OpenCCD outputs on Vietnamese character with

spe-cific 3 types of error] eee 47

Trang 13

OOV Oout Of Vocabulary

STR Scene Text Recognition

IV In VocabularyCNN Convolution Neural Network

RNN Recurrent Neural Network

BiLSTM Bidirectional Long Short Term MemoryCTC Connectionist Temporal Classification

SOTA State Of The Art

CONDI WN FH

Trang 14

This thesis introduces the field of scene text recognition, exploring the

out-of-vocabulary problem and outlining some of the most recent methods used to dress it Scene text recognition is a process of recognizing and extracting textfrom an image, transforming it into a machine-readable format for further pro-cessing The out-of-vocabulary problem is a challenge for recognizing text that

ad-is not included in a predefined vocabulary lad-ist, leading to the need for modern

methods to address this issue In this thesis:

¢ We discuss various approaches and techniques that have been used to tackle

the out-of-vocabulary problem These include those based on the use of

contextual information, those that use the prediction of characters not seen

in the training set, and those that leverage deep learning techniques to learn

representations of out-of-vocabulary words

e We present an evaluation of existing approaches, their effectiveness in

tack-ling the vocabulary problem, and a new dataset for assessing vocabulary words which have no common characteristics and unstructured

out-of-Additionally, we compare and analyze various state-of-the-art techniques to

iden-tify which works best for this purpose, discussing the results in the thesis and

ar-riving at a better understanding of the current approaches to the out-of-vocabularyproblem In order to guide future research, this thesis further suggests potentialdirections and strategies to further improve existing methods and address the

out-of-vocabulary problem These include exploring different ways of rating contextual information into deep learning models and leveraging transfer

incorpo-learning to pre-train models on related tasks In addition, further research should

Trang 15

be conducted on the development of robust and efficient approaches for nizing out-of-vocabulary words.

Trang 16

recog-Chapter 1

Overview

1.1 Introduction

Scene text recognition is a rapidly growing and increasingly important field of

re-search, due to its many potential applications in areas such as image recognition,computer vision, and natural language processing In recent years, the develop-

ment of deep learning techniques has enabled researchers to create models that

can accurately recognize text from natural images This has opened up a world

of possibilities, with researchers now able to extend the scope of text recognition

to a range of areas, including but not limited to document analysis, autonomousnavigation, and even medical image processing With the rapid advances in deeplearning, computer vision, and natural language processing, the possibilities forscene text recognition have grown exponentially, making it an ever-more vitalresearch field In this thesis, we focus on the so-called "Out of vocabulary prob-lem," another facet of scene text recognition Finally, we study and evaluate a few

solutions to this issue

Trang 17

Chapter 1 Overview

1.2 Introduction to Scene Text Recognition

1.2.1 Scene Text Recognition

Scene Text Recognition (STR) is a form of computer vision technology that is used

to recognize text in natural scenes It is a challenging task due to the variations

in font, color, illumination, occlusion, orientation, size, and language that can be

present within an image In addition, the context of the image as a whole must

be taken into account by STR algorithms in order to achieve a high degree of

ac-curacy in text recognition This is because the context of the image may provide

additional clues that can help to identify the text For example, if the image

con-tains a signpost, the context of the image can help to identify the language of

the text on the sign Therefore, in order to accurately recognize text from naturalscenes, STR algorithms must be designed to consider all of these factors

Trang 18

Chapter 1 Overview

1.2.2 Sub-problem in Scene Text Recognition [25]

1.2.2.1 Cropped Scene Text Recognition

Cropped Text Recognition is the process of recognizing text from images which

have been cropped around the text itself This process is especially challenging

due to the fact that the background of the image is usually cluttered and can

con-tain other objects that may interfere with the recognition process To address this

issue, multiple techniques have been proposed such as incorporating contextual

information into the recognition process, employing region-based convolutional

neural networks (R-CNNs), and using multi-scale feature extraction

® Input: a single image

¢ Output:: a word containing inside the input image

1.2.2.2 End-to-end Scene Text Recognition

End-to-End Text Recognition is a more recent approach in which a single deeplearning model is used to directly recognize text from an image This approach

has the advantage of being able to recognize text from more complex images

which contain multiple lines of text and even curved text To achieve this, many

end-to-end text recognition models employ a combination of convolutional

lay-ers for feature extraction, recurrent laylay-ers for sequence modeling, and attention

mechanisms for sequence decoding

Trang 19

Chapter 1 Overview

¢ Input: a single image

¢ Output:: a list of words and polygon area indicated the position of the word

inside the input image

End-to-End Spotting

CAVE SPRING Recognition

HIGH SCHOOL

FIGURE 1.3: End-to-end System for Scene Text Recognition

1.2.3 Applications of STR

Scene Text Recognition (STR) has a variety of applications that are revolutionizing

the way we use technology These include:

¢ Autonomous vehicles: Scene Text Recognition technology is playing a major

role in making these vehicles smarter, more efficient, and far safer than ever

before By recognizing street signs and other relevant information, Scene

Text Recognition can help make navigation smoother, easier, and far more

secure for self-driving cars As this technology continues to evolve, the tential for self-driving vehicles is growing exponentially and is likely to rev-

po-olutionize the way we get around

¢ Mobile phone applications: Scene Text Recognition can be used to power

applications such as scanning bar codes, allowing users to quickly and ily access information in an instant

eas-6

Trang 20

Chapter 1 Overview

STOP

Image Text detection Text recognition

FIGURE 1.4: Autonomous vehicles recognize "STOP" term to decide

e Variations in font can make text difficult to recognize, as different fonts can

have unique shapes and features that make them hard to distinguish This

can be especially true for text written in a stylized font, which may contain

unique characters that are not found in more traditional typefaces The size

and spacing of the font can also make it difficult to differentiate from other

fonts Furthermore, the font style can also make a difference, as some fonts

are designed to be more ornate or decorative, while others are designed to

be more plain and simple All of these factors can combine to make text

challenging to recognize, especially when different fonts are used in thesame document

Trang 21

Chapter 1 Overview

e Variations in color can also make text recognition difficult, as different

col-ors can impact the ability of the reader to discern individual characters and

words For example, light colors can often blend together, making it

chal-lenging to distinguish one letter from another or one word from another.Additionally, colors such as yellow, green, and red can all make words and

characters less visible, particularly when being viewed from a distance

Fur-thermore, colors that are too dark can also be problematic, as they can

ob-scure the text and make it difficult to read

LINCOLN PACcéstlac fa et

Sài USHIP Cú:

*NgHMÉ “Flectronique ROCHELLE

FIGURE 1.6: Variants in color

® Variations in illumination and blur can make text recognition difficult, as

shadows or glare can obscure the written words and make them difficult todiscern Additionally, occlusion can present a significant challenge for text

8

Trang 22

Chapter 1 Overview

recognition, as objects in front of the text can obscure the words and make

it virtually impossible to read

(d)_ (6) — (f)

® Variations in orientation and distortion can also present difficulty for text

recognition, as text that is rotated or skewed off from its normal orientation

may be difficult to interpret, even for the most sophisticated algorithms

Such issues can also be compounded by the presence of noise and other

artifacts that may interfere with the recognition process In order to

en-sure successful recognition, it is important to use algorithms that are robustenough to handle these types of distortions

Trang 23

Chapter 1 Overview

® Variations in size can be a major obstacle to text recognition, as small text

can be much harder to distinguish from other elements on the page This

can be particularly difficult for readers with vision impairments, who may

not have the same level of visual acuity Even with the aid of magnificationtools, small text can be hard to make out, making text recognition a difficulttask Additionally, text size can vary significantly from one document to thenext, making the task of text recognition even more challenging

FIGURE 1.8: Variants in orientation and distortion

¢ Variations in language can also make text recognition difficult, as different

languages have different characters and symbols that can make it

challeng-ing for a computer program to accurately interpret the text For example, the

complex hieroglyphics of ancient Egypt, the diverse pictograms of the

Chi-nese language, and the intricate writing systems of other Asian countries,

such as Japan and Korea, all require special programming to accurately readand interpret Additionally, scripts like Cyrillic and Greek have their own

unique symbols, further complicating the task of text recognition While

10

Trang 24

Chapter 1 Overview

current text recognition software is quite advanced, it’s still a challenge to

accurately interpret the full spectrum of language diversity

® Vertical texts have long been a challenge for existing Structural Text

Recog-nition models, because they are mostly designed to interpret horizontal textimages As a result, these models are structurally unable to process ver-

tical texts, which limits their practical applications In order to be able to

interpret vertical texts, STR models must be adapted to incorporate the

ad-ditional structural complexities, such as the orientation of text lines and text

blocks, which are present in vertical text images

¢ The out-of-vocabulary problem is a major challenge when it comes to text

recognition, as most model-based predictions rely on dictionary-based

meth-ods that are limited to recognizing meaningful words This is particularly

problematic when dealing with unfamiliar words, dialects, or slang, as these

won't be recognized by the model

Finally, the context of the image as a whole must be considered, as different

as-pects of the scene can have an influence on the text recognition process

11

Trang 25

Chapter 1 Overview

FIGURE 1.10: Vertical texts

1.3 Introduce to Out-of-vocabulary problem

1.3.1 Out-out-vocabulary problem

The Out of Vocabulary (OOV) problem is a major challenge that is widely

encoun-tered in Scene Text Recognition due to its ability to significantly reduce

recog-nition accuracy and prevent the model from generalizing OOV is a common

problem in the field of STR, and there are various strategies to combat it, such as

increasing the size of the vocabulary or utilizing a more sophisticated languagemodel Additionally, other approaches such as data augmentation, transfer learn-

ing, and character-based models can also help to mitigate the OOV problem thermore, using a recurrent neural network in combination with language modelscan help to improve the accuracy of the model by better capturing complex lan-

Fur-guage patterns Ultimately, understanding the OOV problem and the strategies

to address it are crucial for improving the accuracy and generalization of STR

models

12

Trang 26

Chapter 1 Overview

1.3.2 Challenges of OOV problem

The OOV problem is particularly challenging because it can manifest in many

different ways It can include rare words, unseen words, slang words, typos,

math symbols, foreign language, and names of people or places:

® Rare words are words which occur very infrequently in the language, mak

ing them hard to recognize

® Unseen words are words which have never been seen in the language

be-fore and are therebe-fore not included in the vocabulary of any speech or textrecognition model

e Slang words are words which are used in informal contexts, and can be

difficult to distinguish from their formal counterparts

¢ Typos are mistakes in writing, such as misspellings or misused punctuation,

which can lead to the wrong word being identified

13

Trang 27

Chapter 1 Overview

FIGURE 1.14: Typo words - different

¢ Math symbols are a particular challenge for speech recognition systems, as

they are often difficult to distinguish from words, and are rarely included

in a model’s vocabulary

FIGURE 1.15: Symbol - $

¢ Foreign language words can also be difficult to identify, as they may not be

familiar to the speaker or the model

¢ Finally, names of people and places are often difficult to identify, as they

may not be included in the model’s vocabulary

14

Trang 28

Chapter 1 Overview

All of these challenges can lead to incorrect or incomplete recognition of speech

or text To address the OOV problem, speech and text recognition models must

be trained on a large and diverse dataset which includes examples of all of these

types of words and phrases Furthermore, models need to be able to recognizecontext and use contextual information to determine the correct word or phrase,

even if it is not included in the model’s vocabulary Finally, models should be

able to recognize and correct typos, as these are a common source of OOV errors

With the right training and techniques, the OOV problem can be addressed and

speech and text recognition accuracy can be improved

1.4 Scope of thesis

In this thesis, we have contributions as following:

¢ Exploration and study of new approaches to deal with the OOV problem:

We explored a number of approaches to solving the Out-of-Vocabulary (OOV)problem, including the use of contextualized embeddings and OOV dictio-naries We evaluated the performance of these approaches to determine

which can be used most effectively to handle OOV words

e Evaluating OOV approaches: We conducted a enhanced evaluation of the

available OOV approaches, considering their accuracy, speed, and

scalabil-ity We compared the performance of each method to determine which proach is best suited for the task of OOV handling

ap-® Providing a new language dataset for evaluation: We generated a new

lan-guage dataset for evaluation purposes This dataset contains a variety of

languages and OOV words, allowing us to evaluate the performance of the

OOV approaches in a wide range of contexts

15

Trang 29

Chapter 1 Overview

1.5 Structure of thesis

Chapter 1: Overview and motivation of this thesis This chapter will explore the

background of the research and the purpose of the thesis It will discuss why this

topic is important and how it contributes to the field It will provide a summary

of why this research was conducted and what the expected outcomes are

Chapter 2: Exploration of recent approaches to problems This chapter will

provide an overview of the current state of the research, including prior work

that has been done It will provide a detailed analysis of the methods used in the

past and the results that were achieved It will also discuss the limitations of the

prior approaches

Chapter 3: Evaluation of new approach This chapter will present the research

conducted in the thesis It will include the results that were achieved, and a

com-parison to existing approaches It will discuss the strengths and weaknesses of

the new approach, specify OpenCCD [1], as well as possible future directions for

the research

Chapter 4: Conclusion This chapter will provide a summary of the findings

of the thesis It will discuss the impact of the research It will also provide adiscussion of how the results of the research could be used in the future

16

Trang 30

Recog-2.1.1 Synthetic Datasets

e SynthText has been extensively utilized for scene text recognition studies,

accomplished by segmenting text image portions based on the annotated

text boxes provided

¢ Synth90K (Synth90K) contains 9 million synthetic text images that have

been widely used for training scene text recognition models

2.1.2 Curved Datasets

¢ CUTE80 (CUTE) is an image dataset consisting of 288 word images, most

of which feature curved scene text All of these text images were carefully

17

Trang 31

Chapter 2 Related Work

ES A GANS -: -, ETAT

BIN ¿/¡nc(:⁄e Syd FREE |

FIGURE 2.1: Synthetic images dataset j9]

cropped from the larger CUTE datasets, which contained a total of 80 scene

text images.

¢ Total-Text (TOTAL) contains 1,253 training images and 300 test images which

have been widely used for the research of arbitrary-shaped scene text

detec-tion.

® SCUT-CTW1500 (SCUT-CTW) is a dataset of 1,000 training images and 500

test images, comprising a total of 1,500 images in total This dataset is

par-ticularly useful for research and development in the field of computer sion, as it provides a comprehensive set of data for training and testingalgorithms As a result, the SCUT-CTW1500 dataset is widely used by re-

vi-searchers and practitioners in the computer vision domain

2.1.3 Multi-oriented Datasets

¢ ICDAR-2015 (IC15) contains incidental scene text images that are captured

without preparation before capturing It contains 4,468 text patches fortraining and 1,811 patches for test which are cropped from the original

dataset

18

Trang 32

¢ Street View Text (SVT) consists of 647 word images that are cropped from

249 street view images from Google Street View and most cropped word

images are almost horizontal

® Street View Text-Perspective (SVTP) contains 645 word images that are also

cropped from Google Street View and many of them suffer from perspective

distortions

2.1.4 Normal Datasets

e IIT 5K-words (HTIT5K) contains 2,000 training and 3,000 test word patches

cropped from born-digital images collected from Google image search

¢ ICDAR-2003 (IC03) contains 860 images of cropped word from the the

Ro-bust Reading Competition in the International Conference on Document

Analysis and Recognition (ICDAR) 2003 dataset

® ICDAR-2013 (IC13) is used in the Robust Reading Competition in the

IC-DAR 2013 which contains 848 word images for training and 1,095 for testing

19

Trang 33

Videos avaible here

PAH bar videos are loaned for ote 11A

20

Tiêu đề	Handling Out-of-Vocab Problem in Scene Text Recognition
Tác giả	Hứa Thanh Tân
Người hướng dẫn	Dr. Thanh Duc Ngo
Trường học	University of Information Technology
Chuyên ngành	Computer Science
Thể loại	Bachelor Thesis
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	66
Dung lượng	40,14 MB