Luận văn thạc sĩ nôm optical character recognition using pseudo skeleton feature luận văn ths công nghệ thông tin

Nom Optical Character Recognition using Pseudo-Skeleton Feature LE HONG TRANG Faculty of Information Technology University of Engineering and Technology Vietnam National University

Trang 1

Nom Optical Character Recognition

using Pseudo-Skeleton Feature

LE HONG TRANG

Faculty of Information Technology

University of Engineering and Technology

Vietnam National University, Hanoi

Supervised by

Assoc Prof Dr NGUYEN NGOC BINH

A thesis submitted in fulfillment of the requirements for the degree of

Master of Computer Science

December, 2009

ĐẠI HỌC QUỐC GIA HA NOI

Teel iIAic® PAk a Tiere wiki mm

Trang 2

Nom Optical Character Recognition

using Pseudo-Skeleton Feature

LE HONG TRANG

Faculty of Information Technology

University of Engineering and Technology

Vietnam National University, Hanoi

Supervised by

Assoc Prof Dr NGUYEN NGOC BINH

A thesis submitted in fulfillment of the requirements for the degree of

Master of Computer Science

December, 2009

ĐẠI HỌC QUỐC GIA HA NOI

Teel iIAic® PAk a Tiere wiki mm

Trang 3

Table of Contents

1 Introduction

1.1

1.22 OCR and Nôm Characters Recognition

A Brief Introduction to N6m Characters

13 The Striuctiive ot THER 6 4 6h 5 KR CRRWERRE AS HER HES

212 ‘TesseraciOCR Bhgite 2: g@icawnea eee eee wns mores

22 Printed Chie OCr «1.50 s mormon Bee wee ee HM 2.2.1 Matching 0 0202 eee ee ee ee 2.2.2 Stricture AGaAlyai6 «<< <8 64% SSE RR ESM RAR RS REE S 22:0 P¥ojection POH 2a ia wine wes wines Hee wee wees 2.3 Problems to N6m OCRs .0 0 0000 2 eee eee

3 Pseudo-Skeleton Feature Extraction

a: Skeleton Featars cc 4 eS CER WED E MER S HERE HS HE es

$:1.1 Médial.Asae Trasisformationi: « as: 2 asc ee ee ee ee

3.1.1.1 Definition .02.2.2.2.02 0000004 3.1.1.2 ° Medial Axis Computation

TH TẾ HỆ «cane eee we ee Re KY BO we Pseudo-Skeleton Feature for Nôm OCR 3.2.1

Trang 4

iv TABLE OF CONTENTS

4 Maximum Entropy Model for Nom OCR 20

42 Maximum Entropy Model << 225 ssie ewes weeawee wes 21 4.2.1 Data Training .2.2.2.2.2.2.00 205 22 4.2.2 Elements of Maximum entropy Model 22

ADA ‘Stasis < % 2 eee eR REE RESK OER RES 22 4.2.2.2 Features 2.0.0.0 2.002002 ee eee 22 A2.2:3: COnstreiite «ccc cesses eee EER RRR S 23 4.3 Maximum Entropy Model for NômOCR 24

SL: Syetein OVvVervieW 2c awe eeev ava eva wesw raw evra wea 25 5.2 Building the Sets of Data for N6m Characters 25

eo POEYWBHINICHM & s ác de vớ Vy HE v m KH 3E 8A HẢU HN Tả KẾ MÔ R Y là BE ar 5.3.1 Test of Printed Nôm Characters 27

5.3.2 Test of Kiéu Story 2 .0 02.2 2.002082 eee 30

A Source code of P-Skeleton Logical operators 33

B Source code of Pseudo-Skeleton Encoding Module 37

Trang 5

List of Figures

tJ

1.2

2.1

2.2

2.3

3.1

3.2

3.3

3.4

3.5

3.6

w =~]

An example image for literary worksin Nôm 2

Nom overview 2 2

Components of Chinese characters 2 0.0 0 ee So 8 Examples of projection profiles 0 00 80025 ee 9 Survey tovelated OCRS: cs ae ea 88 HEN BEES Ge aOR ER eS 10 Some Ïlllustrations for Medial Axis 12

The computation of Medial Axis 13

Example of thinning proces 14

Example of a reduction operation that does not preserve the topology 15 Illustration of P-Skeleton 2 2 0 2 ee ee 16 An example of Pseudo-Skeleton of Nom character 17

A process of full feature of Nom character 19

SVSteL OVERVIEW «§ bc pe oa MRTG BERETA Ae He 26 Getting the set of Nom character from UniHan database 26

Converting N6m character from Unicode points to images 27

Testing of original NOm images 28

Noised Immages by erasing black dots 29

Noised images by adding blackdots 29

Experimental results of noised images by erasing black dots 30 Results of Kiéu story recognition 2.2.00 0.00 eee eee 31

Trang 6

List of Tables

5.1 Statistics of recognition in common font styles 5.2 Comparision with neural network approach

Trang 7

Chapter 1

Introduction

Noin characters are an obsolete writing system of the Vietnamese language Nom characters are among the ideograph word systems, and are based on Chinese characters The first Ném characters appeared in the 13th century and were used widely

to record historical and cultural documents of Vietnam Today, N6ém characters are replaced entirely by the writing system based on Latin alphabet

Nom characters are among the kind of words in square blocks - i.e the entire character is composed in a square They are built from some materials of Chinese characters and but are read in Vietnamese sound

Nom character became popular in the 13-15 century Since then many documents

in various fields such as literature, history, law have been written in N6m characters

A literary work that used Noém characters is in Figure 1.1

From the second half of the 19th century when Vietnam was invaded by French colonialism, French scholars have prevented to use of classical Chinese Gradually to the 1915 and 1918 to 1919, Chinese characters was eliminated, drag the exclusion of the N6ém character In the early 20th century, the national language is increasingly complete, popular and replaced completely the Ném character An overview of NOm characters shown as figure 1.2

In this figure, two Chinese characters (Han) are in the left The first one is

“nam” in Vietnamese (“year” in English) and the second is “nam” in Vietnamese

The Nom character in the middle is composed from the two Chinese characters and one represents the N6m sound and the other for the meaning The right word “ndm”

Trang 8

Chapter 1 Introduction

Vịnh Người Chửa Hoang

Cả nề cho nên hóa dở dang Nổi lòng chàng có biết chăng chàng Duyên thiên chưa thấy nhỏ đầu dọc Phận liễu sao đà đây nét ngang

Chữ tình một khỏi thiếp xin mang

Quản bao miệng thể nhời chênh lệch

Không có nhưng mà có mới ngoan

Hồ Xuân Hương Figure 1.1: An example image for literary works in Nom

Trang 9

1.2 OCR and Nom Characters Recognition 3

is the current Vietnamese writing

Optical character recognition (OCR) is one of most important problems of paper- based or image-based documents digitalization Today there are many good OCR tools for Latin characters, after a long history of research and development Recently, many approaches and tools for recognizing logographic (ideograph) languages like Japanese and Chinese have been developed based on, for example, k-nearest neighbors (k-NN) (Dasarathy, 1991) and artificial neural network (ANN) (Arbib, 1995;

Barber, 2003)

In Vietnam, several groups have been researching and developing OCR software for the Vietnamese national language, whose characters are based on Latin characters One of the fñrst produects is VnDOCR (công nghệ thông tin, 1997 1998) , which can convert images in different formats to Microsoft Word files while preserving characters’ format Recently, an open source project named VietOCR ! that uses

Tesseract OCR engine (Smith, 2007) as its back-end can also recognize very well the

Vietnamese national language

However, a lot of historical documents of Vietnamese were written in Nom, a language rooted in Chinese language Hence, Nom optical character recognition (OCR) problem is meaningful for preserving the national cultural heritage Up to now, research and development for the Nom OCR are still limited while many OCR tools for Chinese (Dan Klein, 2003) and Japanese languages have been developed with good quality Their successes encourage us to research and develop a Nom OCR software and we hope that it will be useful for digitalization of a lot of Nom doc- wmnents in libraries, for young generations to learn and understand our traditional culture We also aim at building the software that can run on handheld devices so that Vietnamese and foreigners can install on their mobile phone and use it when visiting historical sites

Toward these goals, we propose an approach to No6m OCR based on maximum entropy (Dan Klein, 2003) with pseudo-skeleton feature and then present some ini-

tial experimental results We use 8488 common N6m characters as the training data

and test recognition on the several paragraphs in “Kiéu Story” (the 1866 edition)

!http://vietocr.sourceforge.net /

Trang 10

4 Chapter 1 Introduction

The initial result is promising, and gives some us conclusions about the advantages and disadvantages of this approach so that we can improve the quality of the recognition for practical application

The rest of this thesis is organized as follows

Chapter 2 introduces related works which consist of approaches to Nom OCR with two developed method are using Neural Network and Tesseract engine There are some works in offline Chinese OCRs which are among kind of the same character with N6ém character

Chapter 3 represents pseudo-skeleton feature extraction In this chapter, first we deal with term of skeleton and its classical extraction method In addition, it represent of pseudo-skeleton feature and operations to extract it from character images

Chapter 4 represents the maximum entropy model which is applied to building Ném OCR system The contents of this chapter focus in brief of concepts of maximum entropy model such as modeling, training data, statistics, features, constrains and principle of maximum entropy model The important part of this content is building the maximum entropy model for Nom OCR that is represented in the last chapter

Chapter 5 shows the implementation and experimental results

Chappter 6 is the conclusion; This chapter indicates the archieved results of thesis

and it also raises some problems and works

Appendices A and B show the source code of main functions in Nom OCR such

as P-Skeleton operators and feature coding function

Trang 11

Chapter 2

Related Works

Several groups have been researching and reserving N6m heritage, and noteably 4232 characters have been accepted in Unicode standard! Based on this standard source,

we can save a lot of effort in building training data with different styles and fonts

In our previous work (Pham, 2008) we have applied several common methods

for Nôm character recognition such as Tesseract OCR engine and neural network,

and tested recognition with Kiéu story The results of those methods are reasonable with the accuracy at approximately 90%

2.1.1 Neural Network Approach

In neural network approach, a Multi-Layer Perceptron (MLP) network is built for

recognition The network structure consists of three layers The input layer has 32x32 signals represented by a 32x32 bit array and it is built from analysis of a Ném character The output layer has 16 signals represented by an array of 16 bits.The hidden

layers is further divided into several hidden (sub)layers with different numbers of

neural nodes per each hidden layer The hyperbolic tangent function is used as the activation function of the network as follows:

Trang 12

6 Chapter 2 Related Works

(X,.Ys) where X, is a bit array of 32x32 elements representing a Ném character image and Y, is an array of 16 bits representing the Unicode index of the character The purpose of training process is to find an array W of weights such that out, = f(X,,W) = Y, for all learning samples After training process is completed, the

weight set is stored for later recognition Recognition is a simple process which

conwerts input a sample X to an output sample Y based on the weight set W created by the training process The output sample Y is recognised if it matches

a stiandard sample output that has been used to train the network Otherwise, the network cannot recognise the sample

2.1.2 Tesseract OCR Engine

Tesseract OCR (Smith, 2007) is an open source optical character recognition engine developed by HP between 1984 and 1994, now it is sponsored by Google * A

feature extraction of Tesseract is described as follows The first, it decomposes an input image into set of subparts which are called components The components after are analyses and stored in outline form Blobs are generated by gathering from Outlines, and organized into text lines Then text lines are broken into words based

on spacing between words Since, a cell which contains the word is identified by

chopping on broken line

Recognition then processes in two phases First, it attempts to recognize each

word in cell Then these recognized words are passed into a classifier as data training

IIn training phase, Tesseract uses image processing methods to analyze input imayge into lines, words Feature extraction then is performed These data are con- sidered as data training and stored in KD tree data structure

‘Tesseract OCR uses k-NN technique for objects classification With one vector

that: has n features: (A;(x), Ao(r), ,An(x)) then the distance from object x to y

is calculated by:

n

D(z,y) = is (Ai(x) — Ai(y))’ i=1 (2.2)

“The object that is closest to sample will has the general distance on every minimal featiures

Iin two these approaches, they have been experimented on the same data set

“litt p: / /code.google.com/p/tesseract-ocr

Trang 13

2.2 Printed Chinese OCR 7

which has 4232 Nom characters as input for training Neural networks are fast but the quality is not so good when the size of learning set increases We need to extract features and train the network these features as well Also, we can increase the number of input and hidden nodes to obtain better recognition quality Tesseract OCR has many good pre-processing modules and they can be applied to other approaches to increase their quality There are also rooms for improvement since the original design of Tesseract OCR only optimises for English characters, whose the number of characters is much smaller than that of Ném

Nom characters have strong relation with Chinese characters in both structure and representation As relevant references, we have survey some the printed Chinese optical character recognitions approaches In Chinese recognition approaches (Min-

grui Wu, 1999; Long-Wen Chang & Shang-Shungyu, 1994; Kai Yu & Zhuang, 2008;

Sargur N Srihari, 2007; Richard Romero, 1995), perhaps the most important area

of investigation deals with printed characters An effective printed character recognition system would permit the rapid processing of vast amount of printed Chi-

nese material Thus, most of Chinese recognition has focused on printed characters (Stallings, 1976)

There are two basic approaches to Chinese recognition problems based on matching The first is template matching In this approach, both the set of training images and input images are analyzed into an extra matrix with the values 0 and 1, these

matrices are same size The matrix of the training set is called the template and

stored as referenced database The classification is based on comparing the input matrix with all templates, which counts the values 0, 1 at positions that match The recognition results obtain at a best match template with input

In second approach, the recognition is based on the searching for geometric

shapes within the characters In this method, a discrete Fourier transformation is

used, and then the recognition of shape is done by matching the average value of

that Fourier transformation Process of recognition includes three stages The first stage, it find one of 36 geometric shapes in the quarter-left corner of the character

Trang 14

Tlhe second phase, it find a shape of 32 bottom-right The two phases generate a

suibset of characters in the entire of Chinese processing characters The third phase

recognizes a character in the obtained subset

2.2.2 Structure Analysis

In this approach, the characters are divided into the sub-parts, which are called components The components are independent among each other in a character; they are often distinguished by the properties as horizontal, vertical and surround strokes Set of components are then placed in a tree structure

Characters are digitized into matrix with 0-1 values Based on this matrix, the cOnaponents are achieved by repeated searching on the matrix until a segment of the stroke and its end-point are defined For each component is found, we extend a tree node This process ends when all components are approved Then a two-dimensional array shows the relationship between the components which represent the character

0024620634426

ic} Gropp of comport {eo} Code generation

Figure 2.1: Components of Chinese characters

In fact, the recognition of this method is the comparison of the input code with the codes in the set of samples In particular, the character code obtained from the browsing on the built tree structure The generating of chain of code will end when

Trang 15

of projection profiles is shown in figure 2.2

Why don’t we use the previous Nom OCR approaches or simply use some Chinese

OCR methods to the Nom OCR problem? An intuitive, we consider these issues

under the figure 2.3

First, most of Ném characters are composed from Chinese ones, but with addi- tional strokes As a result N6ém characters are usually more complex than Chinese ones and it is hard to apply these techniques to Nom OCRs

Trang 16

Chinese (Hán) Nôm National Language

`,

- Structure Analysis - Tesseract Engine ass

- Projection Profile

~

Figure 2.3: Survey to related OCRs

We have tried two approaches for Nom OCR The first one uses Tesseract OCR engine but this engine is optimized for Latin characters so it did not work well for Nom OCR The second one we use neural network but its performance is not suitable for handheld devices In addition both these approaches require a lot of memory to run and store the referenced database

Because of the above main reasons, we try to combine two light-weight techniques: pseudo-skeleton features and maximum entropy model for the Nom OCR problem

Trang 17

of the shape This information usually emphasizes geometrical and topological properties Thus, with skeleton we can reconstruct the shape based on information that

it included There are some variants of skeleton such as Straight skeletons, mor-

phological skeletons, and skeletons by influence zones (SKIZ) Skeletons have been used in several applications in computer vision, image analysis, and digital image processing, including optical character recognition, fingerprint recognition, visual inspection, pattern recognitio

There are two main classes of skeletonization methods The first category is

“Medial Axis Transformation”, center lines are extracted by connecting the centers

of blocks that cover the object The center of a block is usually found by maximizing

some distance functions The second category consists of “Thinning Algorithms”

(T.Pavlidis, 1980; Y.S.Chen & W.H.Hsu, 1987) which are constructed by removing

of outer layers of pixels from an object while retaining any pixel whose removal would alter connectivity or shorten the legs of the skeleton

Trang 18

12 Chapter 3 Pseudo-Skeleton Feature Extraction

a simple form This abstract shape is represented as skeleton It provides the unique description of shape

Figure 3.1: Some Illustrations for Medial Axis

The term Medial Axis Transform is used to describe a method which reduces

Trang 19

3 -1 Skeleton Feature 13

the object to its medial axis and associated radius function Thus it provide a representation of the object

3.1.1.2 Medial Axis Computation

With an object, computation of Medial Axis uses the Delaunay triangulation to construct a set of triangles The obtained triangles the geometric dual of the Voronoi

diagram (Aurenhammer, 1991) of the object Each Delaunay triangle satisfies the

empty circumcircle property Therefore, if the triangulation can be constructed to conform the boundary of object, then the circumcircle approximates the maximal circles and their circumventers approximate points on the medial axis (see figure 3.2)

The thinning has some beneficial properties such as it preserves the topology, the shape, forces the "skeleton" being in the middle of the object and produces

one pixel/voxel width "skeleton" But, it does not preserve the topology, since the

object is disconnected and completely deleted While a cavity whith white connected component surrounded by an object is created and merged with the background Final, two cavities are merged

Trang 20

S Chapter 3 Pseudo-Skeleton Feature Extraction

The figure 3.4 that illustrates the thinning approach does not preserve the topology of object

The general ideal for this approach that each thinning algorithm can be sketched

by the following pseudo-program:

repeat

removing outer layout of object

until no points are deleted

Trang 21

3.2 Pseudo-Skeleton Feature for Nom OCR 15

Figure 3.4: Example of a reduction operation that does not preserve the topology

preserve these information

In problem of cursive word recognition, it does not necessarily need an original skeleton as features Instead, we use information that preserves the important features We call a such as thin image is pseudo-skeleton

Definition A Pseudo skeleton of a cursive word is a thin image that satisfies the following requirements:

Each cursive stroke is associated with a thin stroke of the same shape propor-

tional size, and the same location

Each terminal stroke is associated with an end-point

Each intersected stroke is associated with a junction

The size and curvature of every stroke is preserved

Pseudo-skeleton has been applied for handwritten scripts (Emli-Mari Nel, 2009)

In this research, the using of logical operations performed on the binary value will reduce significantly the time to extract features over the loops Operations used of this study is SHIFT, AND, and XOR In there, SHIFT to make characters move

Trang 22

images to the left and down in one bit AND and XOR operators do between the two

images for each pair of bit in the same position under the rules of traditional logic Moreover, with this operation what can be easily installed on many programming languages, and hardwares

3.2.2 P-Skeleton Extraction

A process of P-Skeleton feature extraction is shown as figure 3.5, First, as shown figure 3.5 (a), operator SHIFT made the shift image to right, down one bit Next, the operator AND (denoted A) was performed on the original image and the image have moved that as figure 3.5 (b) Finally, the XOR (denoted @) operator applied

to the original image with the image just made AND figure 3.5 (c) After making

these calculations, the only pixel on upper-left of the original image is retained as figure 3.5 (d) We call it as P-Skeleton feature

(a) Shift operation, (b} AND operation

si

^1

(c) XOR operetieon (d) Pseudo skeleton

Figure 3.5: Illustration of P-Skeleton

Formaly, the processing of P-Skeleton extraction of N6m character can be identified by a mathematical formula as in 3.1

Trang 23

3.2 Pseudo-Skeleton Feature for N6m OCR Sóc ¬ 17

P-Skeleton is generated by this formula until retains the characteristic of shape, the size of Nom character On other hand, these logical operators are so simple, therefore extracting P-Skeleton can save a lot of time This is an advantage of this method where we can apply the systems where resources are limited as the handheld devices, mobile etc

The code strings of N6m characters are extracted from the p-skeleton images

by using projection operations The horizontal and vertical projection histograms are generated to obtain two code strings C, and C, Therefore, the operation in above rules can be decomposed into two sub-operations one for horizontal and one for vertical p-skeletons

Vertical p-skeleton: Consider pixel P;,; in a character image, pixel Q? ,; of the vertical p-skeleton can be generated by modifying operation equation 3.1 as follows

OY; = Pig OP ig (XOR) for 0<1<24,,0597 <4, where /,, is the image width and J; is the image height

Horizontal p-skeleton: Similar to vertical p-skeleton, the horizontal p-skeleton

image Q?, can be generated as follows

1}

Trang 24

3.2.3 Encoding the P-Skeleton Features

The last step of the features extraction is P-Skeleton images encoding The P- Skeleton after being extracted will be projected again in the direction that they were generated The histogram //,, //, corresponding obtained will be encoded into

a string of characters which includes three L, M, S The string is called the code string, and represents P-Skeleton feature Process of encode is described briefly as follows

Consider the projection pixel count in the histogram (/) of horizontal or vertical p-skeleton images, the pixel at location @ which the maximal value Hg in histogram

H such that Hg = maxjejoju)(H;) A segment G' is generated by iterative extending the neighbors from @ Initially, point @ is assigned to segment G of length 1 The left neighbor @; is added to segment G if the histogram value at @ is larger than that the value at the right neighbor 0,, i.e Ho, > Ho, Otherwise, the right neighbor is added to segment G, if /19, < [o, The extending process terminate when the length

of the segment is equal to a given threshold

3.2.4 Pre-filtering using statistical properties

In addition to two code strings C;, and C,, three statistical features are designed for pre-filtering process These features f,, fo, f3 can be generated according to the following formulas:

fi(the relative pizel count of p-skeleton) = Dieu, Hiflw + Nien, Hil tn

+2

fo(the weighted value of code string C,) = 4 x| the count of code L in Ch

x| the count of code M in C)| + | the count of code S in Cy| = Neen, w(G)

f3(the weighted value of code string C,) = 4 x| the count of code L in C,| + 2

x| the count of code M in C,| + | the count of code S in C,| = Ncen, w(G)

where /,, is the image width and J), is the image height Actually, features f,, fo, and f; are generated based on the statistical results of code strings

As figure 3.7 shown, a full p-skeleton feature of a Nom character that includes vertical and horizontal code strings, three pre-filtering properties Here, the code

Trang 25

3.2 Pseudo-Skeleton Feature for Nom OCR 19

Định dạng
Số trang	50
Dung lượng	8,33 MB