From these analyses, this thesis proposesmethods for Vietnamese STR which are based mainly on Transformer architecture.2.1 Introduction of Scene Text Recognition STRText recognition is a
THEORETICAL BACKGROUND
Transformer Architecture
Vision Transformer (ViT)
1.2.1 Basics of Vision Transformer (ViT)
1.2.2 Self-supervised learning for ViT
Vision-Language Pretraining (VLP)
Conclusion
The first chapter introduces the definition and some techniques of optical flow.Subsequently, some outstanding ML algorithms are mentioned and the development history of DL is discussed Besides, CNNs with some basic concepts and well-knownCNNs are also introduced in this chapter.
SURVEY AND PROPOSED SOLUTION FOR VIETNAMESE
Introduction of Scene Text Recognition (STR)
Text recognition is a part of text reading task, which converts the image containing text into a machine-readable string Text reading begins with text detection, where text instances are located in the image, and then text recognition converts those instances into readable words
There are many types of the input image such as scanned text, a photo of a document or scene image which contain text The text on the scene image is called scene text and the scope of the thesis focuses on the scene text recognition Scene text recognition (STR) is a challenging problem because of many unique features of text in natural scenes, such as image noise, scene complexity, viewpoint and brightness variations, a large variety of font styles and shapes … So STR still is a stand-alone problem in the research arena.
Figure 1.1 Example of a scene text image
Scene text recognition has a lot of applications, for example, automatic driving, real- time translation, ID reading… There have been many studies in the world to address the difficulties of this task and they have made a great progress recently based on new technologies like large-scale pretrained model, Transformer architecture and the explosion of open large-scale datasets In spite of the development of STR on the world, there are not much studies and large open datasets for Vietnamese STR.Following is a survey for STR and Vietnamese STR.
Existing approaches and methods for STR
Structural Pattern Recognition (SPR) is an approach in pattern recognition that focuses on understanding and modelling the structure or relationships present within data This method is particularly useful when dealing with complex patterns that possess inherent relationships, hierarchies, or compositions This approach classifies characters or words on the image based on relationship between pattern structures and structures usually are extracted using pattern primitives such as: edge, contours or connect component geometry.
Graphical methods for text recognition often involve representing text as graphs, where nodes correspond to text components (characters, words, etc.), and edges denote the relationships or connections between these components Graphical models are widely used in text recognition tasks to model the structural information and relationships within text data Here are some graphical methods commonly used in text recognition:
Graph-based Representation: Text is represented as a graph, where nodes represent individual characters or words, and edges represent the relationships between them (e.g., spatial adjacency, linguistic context) The graph structure captures the spatial and contextual information among text elements.
Graph Convolutional Networks (GCNs): These neural network architectures operate on graph-structured data In text recognition, GCNs can be used to process graph representations of text, leveraging node embeddings and graph convolutions to extract features and infer relationships between characters or words.
Conditional Random Fields (CRFs): CRFs are a type of probabilistic graphical model used for structured prediction tasks in text recognition They model dependencies between labels assigned to neighbouring text components, considering context and spatial relationships to improve recognition accuracy.
Recurrent Neural Networks (RNNs) with Attention Mechanisms: Although not strictly graphical, attention mechanisms can be visualized as graphs where each node represents a part of the input sequence, and attention scores act as edges that determine the relevance or importance of different parts of the input during processing.
Graph Transformer Networks: Inspired by transformer architectures, graph transformer networks can operate on graph-structured data These models are capable of capturing global dependencies among text elements by attending over the entire graph structure.
Hierarchical Graph-based Models: These models incorporate hierarchies in the graph structure to capture multi-level relationships between text elements They allow for modelling both local and global contextual information in a more structured manner.
Graphical methods in text recognition aim to exploit the structural information inherent in text data, considering spatial layout, linguistic context, and relationships between text components These approaches often enhance recognition accuracy by leveraging the rich information encoded in the graph representations of text The selection of a specific method depends on the nature of the text data, the complexity of relationships between elements, and the available computational resources for model training and inference.
Grammar-based methods in text recognition involve utilizing formal grammars, rules, or syntax to model and recognize text patterns These methods rely on predefined linguistic rules or grammar structures to analyse and recognize text sequences Here are some common grammar-based approaches used in text recognition:
Regular Expressions: Regular expressions define patterns of text using a sequence of characters and special symbols to represent rules for matching strings They are used for simple pattern matching tasks where specific sequences need to be identified in the text.
Finite State Automata (FSA): FSA models consist of states and transitions between states based on input symbols They can recognize patterns defined by regular expressions and are used in text recognition for simple pattern matching and lexical analysis.
Context-Free Grammars (CFG): CFGs define a set of rules for generating valid sequences of text based on a defined grammar structure They consist of terminals (actual words or characters) and non-terminals (variables representing classes of elements) with production rules CFGs are commonly used in syntactic analysis and parsing in natural language processing tasks.
Parsing Algorithms: Parsing involves analysing the grammatical structure of a text based on a given grammar Algorithms such as Recursive Descent, Earley's Algorithm, or CYK Algorithm are used to parse text according to context-free or context-sensitive grammars.
Chomsky Hierarchy: This classification of grammars, proposed by Noam Chomsky, categorizes grammars into types such as regular grammars, context-free grammars, context-sensitive grammars, and unrestricted grammars, each with varying expressive power and complexity These classifications are used to define formal rules for recognizing different types of languages.
Proposed methods for Vietnamese STR
Permutation Language Modeling (PLM) was originally proposed for largescale language pretraining, but recent works have adapted it for learning Transformer-based generalized sequence models capable of different decoding schemes In this work, we adapt PLM for STR PLM can be considered a generalization of AR modeling, and aPLM-trained model can be seen as an ensemble of AR models with shared architecture and weights With the use of attention masks for dynamically specifying token dependencies, such a model, can learn and use conditional character probabilities given an arbitrary subset of the input context, enabling monotonic AR decoding, parallel non-AR decoding, and even iterative refinement In summary, state-of-the-art (SOTA) STR methods opted for a two-stage ensemble approach in order to use bidirectional language context The low word accuracy of their external LMs, despite increased training and runtime requirements, highlights the need for a more efficient approach To this end, propose a permuted autoregressive sequence (PARSeq) model for STR Trained with PLM, PARSeq is a unified STR model with a simple structure, but is capable of both context-free and context-aware inference, as well as iterative refinement using bidirectional (cloze) context.
2.4.1.1 Permuted Autoregressive Sequence Models a) Model Architecture
Multi-head Attention (MHA) is extensively used by PARSeq We denote it as MHA(q, k, v, m), where q, k, and v refer to the required parameters query, key, and value, while m refers to the optional attention mask.
PARSeq follows an encoder-decoder architecture, shown in Figure, commonly used in sequence modeling tasks The encoder has 12 layers while the decoder is only a single layer This deep-shallow configuration is a deliberate design choice which minimizes the overall computational requirements of the model while having a negligible impact in performance.
ViT Encoder Vision Transformer (ViT) is the direct extension of the Transformer to images A ViT layer contains one MHA module used for self - attention, i.e q = k = v. The encoder is a 12-layer ViT without the classification head and the [CLS] token An image x R ∈ W×H×C , with width W, height H, and number of channels C, is tokenized by evenly dividing it into pw×ph patches, flattening each patch, then linearly projecting them into dmodel - dimensional tokens using a patch embedding matrix W p∈ R pwphC×d , resulting in (WH)/(pwph) tokens Learned position embeddings of equal dimension are added to the tokens prior to being processed by the first ViT layer.
Visio-lingual Decoder The decoder follows the same architecture as the pre-
LayerNorm Transformer decoder but uses twice the number of attention heads, i.e. nhead = dmodel/32 It has three required inputs consisting of position, context, and image tokens, and an optional attention mask.
In the following equations, we omit LayerNorm and Dropout for brevity The first MHA module is used for context–position attention: h c = p+ MHA(p,c ,c,m )ϵR (T +1) ×d model where is the context length, T pϵR ( T +1)×d model are the position tokens, c ϵR (T +1 )×d model are the context embeddings with positional information, and m k ϵR(T +1)× (T +1) is the optional attention mask Note that the use of special delimiter tokens ([B] or [E]) increases the total sequence length to T + 1.
The position tokens encode the target position to be predicted, each one having a direct correspondence to a specific position in the output This parameterization is similar to the query stream of two-stream attention It decouples the context from the target. position, allowing the model to learn from PLM Without the position tokens, i.e if the context tokens are used as queries themselves like in standard Transformers, the model will not learn anything meaningful from PLM and will simply function like a standard
The supplied mask varies depending on how the model is used During training, masks are generated from random permutations At inference, it could be a standard left-to- right lookahead mask (AR decoding), a cloze mask (iterative refinement), or no mask at all (NAR decoding).
The second MHA is used for image–position attention: hi=hc+MHA(h c ,z , z)∈R (T +1)×d model where no attention mask is used The last decoder hidden state is the output of the
MLP, h dec=hi+MLP(h i )∈R ( T +1)×d model Finally, the output logits are y=Linear(hdec)∈R(T +1 )× (S+1 ) where S is the size of the character set (charset) used for training The additional character pertains to the [E] token (which marks the end of the sequence) In summary, given an attention mask m, the decoder is a function which takes the form: y(z , p ,c ,m)ϵR(T +1 )× (S +1 ) b) Permutation Language Modeling
Given an image x, we want to maximize the likelihood of its text label y = [y , y1 2, ,y ]T under the set of model parameters θ In standard AR modeling, the likelihood is factorized using the chain rule according to the canonical ordering [1,2, ,T], resulting in the model log p(y|x)=∑ t=1
T log p θ (y t |y ¿t,x) However, Transformers process all tokens in parallel, allowing the output tokens to access or be conditionally dependent on all the input tokens In order to have a valid AR model, past tokens cannot have access to future tokens The AR property is enforced in Transformers with the use of attention masks For example, a standard AR model for a three-element sequence y will have the attention mask shown.
The key idea behind PLM is to train on all T! factorizations of the likelihood: where Z denotes the set of all possible permutations of the index sequence [1,2, ,T],T and z and z denote the t-th element and the first t−1 elements, respectively, of at