Nom Optical Character Recognition using Pseudo-Skeleton Feature LE HONG TRANG Faculty of Information Technology University of Engineering and Technology Vietnam National University
Trang 1Nom Optical Character Recognition
using Pseudo-Skeleton Feature
LE HONG TRANG
Faculty of Information Technology
University of Engineering and Technology
Vietnam National University, Hanoi
Supervised by
Assoc Prof Dr NGUYEN NGOC BINH
A thesis submitted in fulfillment of the requirements for the degree of
Master of Computer Science
December, 2009
ĐẠI HỌC QUỐC GIA HA NOI
Teel iIAic® PAk a Tiere wiki mm
Trang 2Nom Optical Character Recognition
using Pseudo-Skeleton Feature
LE HONG TRANG
Faculty of Information Technology
University of Engineering and Technology
Vietnam National University, Hanoi
Supervised by
Assoc Prof Dr NGUYEN NGOC BINH
A thesis submitted in fulfillment of the requirements for the degree of
Master of Computer Science
December, 2009
ĐẠI HỌC QUỐC GIA HA NOI
Teel iIAic® PAk a Tiere wiki mm
Trang 3Table of Contents
1 Introduction
1.1
1.22 OCR and Nôm Characters Recognition
A Brief Introduction to N6m Characters
13 The Striuctiive ot THER 6 4 6h 5 KR CRRWERRE AS HER HES
212 ‘TesseraciOCR Bhgite 2: g@icawnea eee eee wns mores
22 Printed Chie OCr «1.50 s mormon Bee wee ee HM 2.2.1 Matching 0 0202 eee ee ee ee 2.2.2 Stricture AGaAlyai6 «<< <8 64% SSE RR ESM RAR RS REE S 22:0 P¥ojection POH 2a ia wine wes wines Hee wee wees 2.3 Problems to N6m OCRs .0 0 0000 2 eee eee
3 Pseudo-Skeleton Feature Extraction
a: Skeleton Featars cc 4 eS CER WED E MER S HERE HS HE es
$:1.1 Médial.Asae Trasisformationi: « as: 2 asc ee ee ee ee
3.1.1.1 Definition .02.2.2.2.02 0000004 3.1.1.2 ° Medial Axis Computation
TH TẾ HỆ «cane eee we ee Re KY BO we Pseudo-Skeleton Feature for Nôm OCR 3.2.1
Trang 4iv TABLE OF CONTENTS
4 Maximum Entropy Model for Nom OCR 20
42 Maximum Entropy Model << 225 ssie ewes weeawee wes 21 4.2.1 Data Training .2.2.2.2.2.2.00 205 22 4.2.2 Elements of Maximum entropy Model 22
ADA ‘Stasis < % 2 eee eR REE RESK OER RES 22 4.2.2.2 Features 2.0.0.0 2.002002 ee eee 22 A2.2:3: COnstreiite «ccc cesses eee EER RRR S 23 4.3 Maximum Entropy Model for NômOCR 24
SL: Syetein OVvVervieW 2c awe eeev ava eva wesw raw evra wea 25 5.2 Building the Sets of Data for N6m Characters 25
eo POEYWBHINICHM & s ác de vớ Vy HE v m KH 3E 8A HẢU HN Tả KẾ MÔ R Y là BE ar 5.3.1 Test of Printed Nôm Characters 27
5.3.2 Test of Kiéu Story 2 .0 02.2 2.002082 eee 30
A Source code of P-Skeleton Logical operators 33
B Source code of Pseudo-Skeleton Encoding Module 37
Trang 5List of Figures
tJ
1.2
2.1
2.2
2.3
3.1
3.2
3.3
3.4
3.5
3.6
w =~]
An example image for literary worksin Nôm 2
Nom overview 2 2
Components of Chinese characters 2 0.0 0 ee So 8 Examples of projection profiles 0 00 80025 ee 9 Survey tovelated OCRS: cs ae ea 88 HEN BEES Ge aOR ER eS 10 Some Ïlllustrations for Medial Axis 12
The computation of Medial Axis 13
Example of thinning proces 14
Example of a reduction operation that does not preserve the topology 15 Illustration of P-Skeleton 2 2 0 2 ee ee 16 An example of Pseudo-Skeleton of Nom character 17
A process of full feature of Nom character 19
SVSteL OVERVIEW Ǥ bc pe oa MRTG BERETA Ae He 26 Getting the set of Nom character from UniHan database 26
Converting N6m character from Unicode points to images 27
Testing of original NOm images 28
Noised Immages by erasing black dots 29
Noised images by adding blackdots 29
Experimental results of noised images by erasing black dots 30 Results of Kiéu story recognition 2.2.00 0.00 eee eee 31
Trang 6List of Tables
5.1 Statistics of recognition in common font styles 5.2 Comparision with neural network approach
Trang 7Chapter 1
Introduction
Noin characters are an obsolete writing system of the Vietnamese language Nom characters are among the ideograph word systems, and are based on Chinese char- acters The first Ném characters appeared in the 13th century and were used widely
to record historical and cultural documents of Vietnam Today, N6ém characters are replaced entirely by the writing system based on Latin alphabet
Nom characters are among the kind of words in square blocks - i.e the entire character is composed in a square They are built from some materials of Chinese characters and but are read in Vietnamese sound
Nom character became popular in the 13-15 century Since then many documents
in various fields such as literature, history, law have been written in N6m characters
A literary work that used Noém characters is in Figure 1.1
From the second half of the 19th century when Vietnam was invaded by French colonialism, French scholars have prevented to use of classical Chinese Gradually to the 1915 and 1918 to 1919, Chinese characters was eliminated, drag the exclusion of the N6ém character In the early 20th century, the national language is increasingly complete, popular and replaced completely the Ném character An overview of NOm characters shown as figure 1.2
In this figure, two Chinese characters (Han) are in the left The first one is
“nam” in Vietnamese (“year” in English) and the second is “nam” in Vietnamese
The Nom character in the middle is composed from the two Chinese characters and one represents the N6m sound and the other for the meaning The right word “ndm”
Trang 8Chapter 1 Introduction
Vịnh Người Chửa Hoang
Cả nề cho nên hóa dở dang Nổi lòng chàng có biết chăng chàng Duyên thiên chưa thấy nhỏ đầu dọc Phận liễu sao đà đây nét ngang
Chữ tình một khỏi thiếp xin mang
Quản bao miệng thể nhời chênh lệch
Không có nhưng mà có mới ngoan
Hồ Xuân Hương Figure 1.1: An example image for literary works in Nom
Trang 91.2 OCR and Nom Characters Recognition 3
is the current Vietnamese writing
Optical character recognition (OCR) is one of most important problems of paper- based or image-based documents digitalization Today there are many good OCR tools for Latin characters, after a long history of research and development Recently, many approaches and tools for recognizing logographic (ideograph) languages like Japanese and Chinese have been developed based on, for example, k-nearest neigh- bors (k-NN) (Dasarathy, 1991) and artificial neural network (ANN) (Arbib, 1995;
Barber, 2003)
In Vietnam, several groups have been researching and developing OCR software for the Vietnamese national language, whose characters are based on Latin char- acters One of the fñrst produects is VnDOCR (công nghệ thông tin, 1997 1998) , which can convert images in different formats to Microsoft Word files while preserv- ing characters’ format Recently, an open source project named VietOCR ! that uses
Tesseract OCR engine (Smith, 2007) as its back-end can also recognize very well the
Vietnamese national language
However, a lot of historical documents of Vietnamese were written in Nom, a lan- guage rooted in Chinese language Hence, Nom optical character recognition (OCR) problem is meaningful for preserving the national cultural heritage Up to now, re- search and development for the Nom OCR are still limited while many OCR tools for Chinese (Dan Klein, 2003) and Japanese languages have been developed with good quality Their successes encourage us to research and develop a Nom OCR software and we hope that it will be useful for digitalization of a lot of Nom doc- wmnents in libraries, for young generations to learn and understand our traditional culture We also aim at building the software that can run on handheld devices so that Vietnamese and foreigners can install on their mobile phone and use it when visiting historical sites
Toward these goals, we propose an approach to No6m OCR based on maximum entropy (Dan Klein, 2003) with pseudo-skeleton feature and then present some ini-
tial experimental results We use 8488 common N6m characters as the training data
and test recognition on the several paragraphs in “Kiéu Story” (the 1866 edition)
!http://vietocr.sourceforge.net /
Trang 104 Chapter 1 Introduction
The initial result is promising, and gives some us conclusions about the advantages and disadvantages of this approach so that we can improve the quality of the recog- nition for practical application
The rest of this thesis is organized as follows
Chapter 2 introduces related works which consist of approaches to Nom OCR with two developed method are using Neural Network and Tesseract engine There are some works in offline Chinese OCRs which are among kind of the same character with N6ém character
Chapter 3 represents pseudo-skeleton feature extraction In this chapter, first we deal with term of skeleton and its classical extraction method In addition, it represent of pseudo-skeleton feature and operations to extract it from character images
Chapter 4 represents the maximum entropy model which is applied to building Ném OCR system The contents of this chapter focus in brief of concepts of maximum entropy model such as modeling, training data, statistics, features, constrains and principle of maximum entropy model The important part of this content is building the maximum entropy model for Nom OCR that is represented in the last chapter
Chapter 5 shows the implementation and experimental results
Chappter 6 is the conclusion; This chapter indicates the archieved results of thesis
and it also raises some problems and works
Appendices A and B show the source code of main functions in Nom OCR such
as P-Skeleton operators and feature coding function
Trang 11Chapter 2
Related Works
Several groups have been researching and reserving N6m heritage, and noteably 4232 characters have been accepted in Unicode standard! Based on this standard source,
we can save a lot of effort in building training data with different styles and fonts
In our previous work (Pham, 2008) we have applied several common methods
for Nôm character recognition such as Tesseract OCR engine and neural network,
and tested recognition with Kiéu story The results of those methods are reasonable with the accuracy at approximately 90%
2.1.1 Neural Network Approach
In neural network approach, a Multi-Layer Perceptron (MLP) network is built for
recognition The network structure consists of three layers The input layer has 32x32 signals represented by a 32x32 bit array and it is built from analysis of a Ném char- acter The output layer has 16 signals represented by an array of 16 bits.The hidden
layers is further divided into several hidden (sub)layers with different numbers of
neural nodes per each hidden layer The hyperbolic tangent function is used as the activation function of the network as follows:
Trang 126 Chapter 2 Related Works
(X,.Ys) where X, is a bit array of 32x32 elements representing a Ném character image and Y, is an array of 16 bits representing the Unicode index of the character The purpose of training process is to find an array W of weights such that out, = f(X,,W) = Y, for all learning samples After training process is completed, the
weight set is stored for later recognition Recognition is a simple process which
conwerts input a sample X to an output sample Y based on the weight set W created by the training process The output sample Y is recognised if it matches
a stiandard sample output that has been used to train the network Otherwise, the network cannot recognise the sample
2.1.2 Tesseract OCR Engine
Tesseract OCR (Smith, 2007) is an open source optical character recognition engine developed by HP between 1984 and 1994, now it is sponsored by Google * A
feature extraction of Tesseract is described as follows The first, it decomposes an input image into set of subparts which are called components The components after are analyses and stored in outline form Blobs are generated by gathering from Outlines, and organized into text lines Then text lines are broken into words based
on spacing between words Since, a cell which contains the word is identified by
chopping on broken line
Recognition then processes in two phases First, it attempts to recognize each
word in cell Then these recognized words are passed into a classifier as data training
IIn training phase, Tesseract uses image processing methods to analyze input imayge into lines, words Feature extraction then is performed These data are con- sidered as data training and stored in KD tree data structure
‘Tesseract OCR uses k-NN technique for objects classification With one vector
that: has n features: (A;(x), Ao(r), ,An(x)) then the distance from object x to y
is calculated by:
n
D(z,y) = is (Ai(x) — Ai(y))’ i=1 (2.2)
“The object that is closest to sample will has the general distance on every minimal featiures
Iin two these approaches, they have been experimented on the same data set
“litt p: / /code.google.com/p/tesseract-ocr
Trang 132.2 Printed Chinese OCR 7
which has 4232 Nom characters as input for training Neural networks are fast but the quality is not so good when the size of learning set increases We need to extract features and train the network these features as well Also, we can increase the number of input and hidden nodes to obtain better recognition quality Tesseract OCR has many good pre-processing modules and they can be applied to other approaches to increase their quality There are also rooms for improvement since the original design of Tesseract OCR only optimises for English characters, whose the number of characters is much smaller than that of Ném
Nom characters have strong relation with Chinese characters in both structure and representation As relevant references, we have survey some the printed Chinese optical character recognitions approaches In Chinese recognition approaches (Min-
grui Wu, 1999; Long-Wen Chang & Shang-Shungyu, 1994; Kai Yu & Zhuang, 2008;
Sargur N Srihari, 2007; Richard Romero, 1995), perhaps the most important area
of investigation deals with printed characters An effective printed character recog- nition system would permit the rapid processing of vast amount of printed Chi-
nese material Thus, most of Chinese recognition has focused on printed characters (Stallings, 1976)
There are two basic approaches to Chinese recognition problems based on matching The first is template matching In this approach, both the set of training images and input images are analyzed into an extra matrix with the values 0 and 1, these
matrices are same size The matrix of the training set is called the template and
stored as referenced database The classification is based on comparing the input matrix with all templates, which counts the values 0, 1 at positions that match The recognition results obtain at a best match template with input
In second approach, the recognition is based on the searching for geometric
shapes within the characters In this method, a discrete Fourier transformation is
used, and then the recognition of shape is done by matching the average value of
that Fourier transformation Process of recognition includes three stages The first stage, it find one of 36 geometric shapes in the quarter-left corner of the character
Trang 148 Chapter 2 Related Works
Tlhe second phase, it find a shape of 32 bottom-right The two phases generate a
suibset of characters in the entire of Chinese processing characters The third phase
recognizes a character in the obtained subset
2.2.2 Structure Analysis
In this approach, the characters are divided into the sub-parts, which are called components The components are independent among each other in a character; they are often distinguished by the properties as horizontal, vertical and surround strokes Set of components are then placed in a tree structure
Characters are digitized into matrix with 0-1 values Based on this matrix, the cOnaponents are achieved by repeated searching on the matrix until a segment of the stroke and its end-point are defined For each component is found, we extend a tree node This process ends when all components are approved Then a two-dimensional array shows the relationship between the components which represent the character
0024620634426
ic} Gropp of comport {eo} Code generation
Figure 2.1: Components of Chinese characters
In fact, the recognition of this method is the comparison of the input code with the codes in the set of samples In particular, the character code obtained from the browsing on the built tree structure The generating of chain of code will end when
Trang 15of projection profiles is shown in figure 2.2
Why don’t we use the previous Nom OCR approaches or simply use some Chinese
OCR methods to the Nom OCR problem? An intuitive, we consider these issues
under the figure 2.3
First, most of Ném characters are composed from Chinese ones, but with addi- tional strokes As a result N6ém characters are usually more complex than Chinese ones and it is hard to apply these techniques to Nom OCRs
Trang 1610 Chapter 2 Related Works
Chinese (Hán) Nôm National Language
`,
- Structure Analysis - Tesseract Engine ass
- Projection Profile
~
Figure 2.3: Survey to related OCRs
We have tried two approaches for Nom OCR The first one uses Tesseract OCR engine but this engine is optimized for Latin characters so it did not work well for Nom OCR The second one we use neural network but its performance is not suitable for handheld devices In addition both these approaches require a lot of memory to run and store the referenced database
Because of the above main reasons, we try to combine two light-weight tech- niques: pseudo-skeleton features and maximum entropy model for the Nom OCR problem
Trang 17of the shape This information usually emphasizes geometrical and topological prop- erties Thus, with skeleton we can reconstruct the shape based on information that
it included There are some variants of skeleton such as Straight skeletons, mor-
phological skeletons, and skeletons by influence zones (SKIZ) Skeletons have been used in several applications in computer vision, image analysis, and digital image processing, including optical character recognition, fingerprint recognition, visual inspection, pattern recognitio
There are two main classes of skeletonization methods The first category is
“Medial Axis Transformation”, center lines are extracted by connecting the centers
of blocks that cover the object The center of a block is usually found by maximizing
some distance functions The second category consists of “Thinning Algorithms”
(T.Pavlidis, 1980; Y.S.Chen & W.H.Hsu, 1987) which are constructed by removing
of outer layers of pixels from an object while retaining any pixel whose removal would alter connectivity or shorten the legs of the skeleton
Trang 1812 Chapter 3 Pseudo-Skeleton Feature Extraction
a simple form This abstract shape is represented as skeleton It provides the unique description of shape
Figure 3.1: Some Illustrations for Medial Axis
The term Medial Axis Transform is used to describe a method which reduces
Trang 193 -1 Skeleton Feature 13
the object to its medial axis and associated radius function Thus it provide a rep- resentation of the object
3.1.1.2 Medial Axis Computation
With an object, computation of Medial Axis uses the Delaunay triangulation to construct a set of triangles The obtained triangles the geometric dual of the Voronoi
diagram (Aurenhammer, 1991) of the object Each Delaunay triangle satisfies the
empty circumcircle property Therefore, if the triangulation can be constructed to conform the boundary of object, then the circumcircle approximates the maximal circles and their circumventers approximate points on the medial axis (see figure 3.2)
The thinning has some beneficial properties such as it preserves the topology, the shape, forces the "skeleton" being in the middle of the object and produces
one pixel/voxel width "skeleton" But, it does not preserve the topology, since the
object is disconnected and completely deleted While a cavity whith white connected component surrounded by an object is created and merged with the background Final, two cavities are merged
Trang 20S Chapter 3 Pseudo-Skeleton Feature Extraction
The figure 3.4 that illustrates the thinning approach does not preserve the topol- ogy of object
The general ideal for this approach that each thinning algorithm can be sketched
by the following pseudo-program:
repeat
removing outer layout of object
until no points are deleted
Trang 213.2 Pseudo-Skeleton Feature for Nom OCR 15
Figure 3.4: Example of a reduction operation that does not preserve the topology
preserve these information
In problem of cursive word recognition, it does not necessarily need an origi- nal skeleton as features Instead, we use information that preserves the important features We call a such as thin image is pseudo-skeleton
Definition A Pseudo skeleton of a cursive word is a thin image that satisfies the following requirements:
Each cursive stroke is associated with a thin stroke of the same shape propor-
tional size, and the same location
Each terminal stroke is associated with an end-point
Each intersected stroke is associated with a junction
The size and curvature of every stroke is preserved
Pseudo-skeleton has been applied for handwritten scripts (Emli-Mari Nel, 2009)
In this research, the using of logical operations performed on the binary value will reduce significantly the time to extract features over the loops Operations used of this study is SHIFT, AND, and XOR In there, SHIFT to make characters move
Trang 2216 Chapter 3 Pseudo-Skeleton Feature Extraction
images to the left and down in one bit AND and XOR operators do between the two
images for each pair of bit in the same position under the rules of traditional logic Moreover, with this operation what can be easily installed on many programming languages, and hardwares
3.2.2 P-Skeleton Extraction
A process of P-Skeleton feature extraction is shown as figure 3.5, First, as shown figure 3.5 (a), operator SHIFT made the shift image to right, down one bit Next, the operator AND (denoted A) was performed on the original image and the image have moved that as figure 3.5 (b) Finally, the XOR (denoted @) operator applied
to the original image with the image just made AND figure 3.5 (c) After making
these calculations, the only pixel on upper-left of the original image is retained as figure 3.5 (d) We call it as P-Skeleton feature
(a) Shift operation, (b} AND operation
si
^1
(c) XOR operetieon (d) Pseudo skeleton
Figure 3.5: Illustration of P-Skeleton
Formaly, the processing of P-Skeleton extraction of N6m character can be iden- tified by a mathematical formula as in 3.1
Qig = (Pig A Pe-ryg-1) © Pig (3.1)
Trang 233.2 Pseudo-Skeleton Feature for N6m OCR Sóc ¬ 17
P-Skeleton is generated by this formula until retains the characteristic of shape, the size of Nom character On other hand, these logical operators are so simple, therefore extracting P-Skeleton can save a lot of time This is an advantage of this method where we can apply the systems where resources are limited as the handheld devices, mobile etc
The code strings of N6m characters are extracted from the p-skeleton images
by using projection operations The horizontal and vertical projection histograms are generated to obtain two code strings C, and C, Therefore, the operation in above rules can be decomposed into two sub-operations one for horizontal and one for vertical p-skeletons
Vertical p-skeleton: Consider pixel P;,; in a character image, pixel Q? ,; of the vertical p-skeleton can be generated by modifying operation equation 3.1 as follows
OY; = Pig OP ig (XOR) for 0<1<24,,0597 <4, where /,, is the image width and J; is the image height
Horizontal p-skeleton: Similar to vertical p-skeleton, the horizontal p-skeleton
image Q?, can be generated as follows
1}
Trang 2418 Chapter 3 Pseudo-Skeleton Feature Extraction
3.2.3 Encoding the P-Skeleton Features
The last step of the features extraction is P-Skeleton images encoding The P- Skeleton after being extracted will be projected again in the direction that they were generated The histogram //,, //, corresponding obtained will be encoded into
a string of characters which includes three L, M, S The string is called the code string, and represents P-Skeleton feature Process of encode is described briefly as follows
Consider the projection pixel count in the histogram (/) of horizontal or vertical p-skeleton images, the pixel at location @ which the maximal value Hg in histogram
H such that Hg = maxjejoju)(H;) A segment G' is generated by iterative extending the neighbors from @ Initially, point @ is assigned to segment G of length 1 The left neighbor @; is added to segment G if the histogram value at @ is larger than that the value at the right neighbor 0,, i.e Ho, > Ho, Otherwise, the right neighbor is added to segment G, if /19, < [o, The extending process terminate when the length
of the segment is equal to a given threshold
3.2.4 Pre-filtering using statistical properties
In addition to two code strings C;, and C,, three statistical features are designed for pre-filtering process These features f,, fo, f3 can be generated according to the following formulas:
fi(the relative pizel count of p-skeleton) = Dieu, Hiflw + Nien, Hil tn
+2
fo(the weighted value of code string C,) = 4 x| the count of code L in Ch
x| the count of code M in C)| + | the count of code S in Cy| = Neen, w(G)
f3(the weighted value of code string C,) = 4 x| the count of code L in C,| + 2
x| the count of code M in C,| + | the count of code S in C,| = Ncen, w(G)
where /,, is the image width and J), is the image height Actually, features f,, fo, and f; are generated based on the statistical results of code strings
As figure 3.7 shown, a full p-skeleton feature of a Nom character that includes vertical and horizontal code strings, three pre-filtering properties Here, the code
Trang 253.2 Pseudo-Skeleton Feature for Nom OCR 19