Optical character recognition using neural networks

BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI - THEODOR CONSTANTINESCU OPTICAL CHARATER RECONGNITION USING NEURAL NETWORKS LUẬN VĂN THẠC SĨ KHOA HỌC CHUYÊN NGÀNH: XỬ LÝ THÔNG TIN VÀ TRUYỀN THÔNG NGƯỜI HƯỚNG DẪN KHOA HỌC: Hà Nội - 2009 BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI *********♦********* THEODOR CONSTANTINESCU OPTICAL CHARATER RECONGNITION USING NEURAL NETWORKS LUẬN VĂN THẠC SĨ KHOA HỌC CHUYÊN NGÀNH: XỬ LÝ THÔNG TIN VÀ TRUYỀN THÔNG NGƯỜI HƯỚNG DẪN KHOA HỌC: NGUYÊN LINH GIANG HÀ NỘI 2009 Contents I Introduction II Pattern recognition III Optical character recognition (OCR) 26 IV Neural networks 34 V The program 55 VI Conclusions 71 I Introduction The difficulty of the dialogue between man and machine comes on one hand from the flexibility and variety of modes of interaction that we are able to use: gesture, speech, writing, etc and also the rigidity of those classically offered by computer systems Part of the current research in IT is therefore a design of applications best suited to different forms of communication commonly used by man This is to provide the computer systems with features for handling the information that humans manipulate themselves currently every day In general the information to process is very rich It can be text, tables, images, words, sounds, writing, and gestures In this paper I treat the case of writing, to be more precise, printed character recognition By the application and personal contexts the way to represent this information and transmit it is very variable Just consider for example the variety of styles of writing that it is between different languages and even for the same language Moreover because of the sensitivity of the sensors and the media used to acquire and transmit, the information to be processed is often different from the originals It is therefore characterized by either intrinsic to the phenomena to which they are either related to their transmission ways inaccuracies Their treatment requires the implementation of complex analysis and decision systems This complexity is a major limiting factor in the context of the dissemination of the informational means This remains true despite the growth of calculation power and the improvement of processing systems since the research is at the same time directed towards the resolution of more and more difficult tasks and to the integration of these applications in cheaper and therefore low capacity mobile systems Optical character recognition represents the process through which a program converts the image of a character (usually acquired by a scanner machine) into the code associated to that character, thus enabling the computer to “understand” the character, which heretofore was just a cluster of pixels It turns the image of the character (or of a string of characters – text) into selectable strings of text that you can copy, as you would any other computer generated document In its modern form, it is a form of artificial intelligence pattern recognition OCR is the most effective method available for transferring information from a classical medium (usually, paper) to an electronic one The alternative would be a human reading the characters in the image and typing them into a text editor, which is obviously a stupid, Neanderthal approach when we possess the computers with enough power to this mind-numbing task The only thing we need is the right OCR software Before OCR can be used, the source material must be scanned using an optical scanner (and sometimes a specialized circuit board in the PC) to read in the page as a bitmap (a pattern of dots) Software to recognize the images is also required The OCR -3- software then processes these scans to differentiate between images and text and determine what letters are represented in the light and dark areas The approach in older OCR programs was still animal It was simply to compare the characters to be recognized with the sample characters stored in a data base Imagine the numbers of comparisons, considering how many different fonts exist Modern OCR software use complex neural-network-based systems to obtain better results – much more exact identification – actually close to100% Today's OCR engines add the multiple algorithms of neural network technology to analyze the stroke edge, the line of discontinuity between the text characters, and the background Allowing for irregularities of printed ink on paper, each algorithm averages the light and dark along the side of a stroke, matches it to known characters and makes a best guess as to which character it is The OCR software then averages or polls the results from all the algorithms to obtain a single reading Advances have made OCR more reliable; expect a minimum of 90% accuracy for average-quality documents Despite vendor claims of one-button scanning, achieving 99% or greater accuracy takes clean copy and practice setting scanner parameters and requires you to "train" the OCR software with your documents The first step toward better recognition begins with the scanner The quality of its charge-coupled device light arrays will affect OCR results The more tightly packed these arrays, the finer the image and the more distinct colors the scanner can detect Smudges or background color can fool the recognition software Adjusting the scan's resolution can help refine the image and improve the recognition rate, but there are tradeoffs For example, in an image scanned at 24-bit color with 1,200 dots per inch (dpi), each of the 1,200 pixels has 24 bits' worth of color information This scan will take longer than a lower-resolution scan and produce a larger file, but OCR accuracy will likely be high A scan at 72 dpi will be faster and produce a smaller file—good for posting an image of the text to the Web—but the lower resolution will likely degrade OCR accuracy Most scanners are optimized for 300 dpi, but scanning at a higher number of dots per inch will increase accuracy for type under points in size Bilevel (black and white only) scans are the rule for text documents Bilevel scans are faster and produce smaller files, because unlike 24-bit color scans, they require only one bit per pixel Some scanners can also let you determine how subtle to make the color differentiation The accurate recognition of Latin-based typewritten text is now considered largely a solved problem Typical accuracy rates exceed 99%, although certain applications demanding even higher accuracy require human review for errors Other areas - including recognition of cursive handwriting, and printed text in other scripts (especially those with a very large number of characters) - are still the subject of active research Today, OCR software can recognize a wide variety of fonts, but handwriting and script fonts that mimic handwriting are still problematic Developers are taking different approaches to improve script and handwriting recognition OCR software from ExperVision Inc first identifies the font and then runs its character-recognition algorithms -4- Which method will be more effective depends on the image being scanned A bilevel scan of a shopworn page may yield more legible text But if the image to be scanned has text in a range of colors, as in a brochure, text in lighter colors may drop out On-line systems for recognizing hand-printed text on the fly have become wellknown as commercial products in recent years Among these are the input devices for personal digital assistants such as those running Palm OS The algorithms used in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known Also, the user can be retrained to use only specific letter shapes These methods cannot be used in software that scans paper documents, so accurate recognition of hand-printed documents is still largely an open problem Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications Whereas commercial and even open source OCR software performs well for, let's say, usual images, a particularly difficult problem for computers and humans is that of the old religious registers of baptisms and marriages, which contain mainly the names, where the pages can be damaged by weather, water or fire, and the names can be obsolete or written by former spellings Character recognition is an active area of research for computer science since the late 1950s Initially, it was thought to be an easy problem, but it appeared that this was a much more interesting It will take many decades to computers to read any document with the same precision as human beings All the commercial software is quite complex My aim was to create a simple and reliable program to perform the same tasks -5- II Pattern recognition Pattern recognition is a major area of computing in which searches are particularly active There are a very large number of applications that may require a recognition module in processing systems designed to automate certain tasks for humans Among those handwriting recognition systems are a difficult issue to handle as they are grouped alone much of the difficulties encountered in pattern recognition In this chapter I give a general presentation of the main pattern recognition techniques Pattern recognition is the set of the methods and techniques with which we can achieve a classification in a set of objects, processes or phenomena This is accomplished by comparison with models In memory of the computer a set of models (prototypes), one for each class is stored The new, unknown input (not classified yet) is compared in turn with each prototype, classifying them into one of the classes being based on a selection criterion: if the unknown best suits well with the "i" then it will belong to class "i" The difficulties that arise are related to the selection of a representative model, which best characterizes a form class, as well as defining an appropriate selection criterion, able to univocally classify each unknown form Pattern recognition techniques can be divided into two main groups: generative and discriminant There have been long standing debates on generative vs discriminative methods The discriminative methods aim to minimize a utility function (e.g classification error) and it does not need to model, represent, or “understand” the pattern explicitly For example, nowadays we have very effective discriminative methods They can detect 99.99% faces in real images with low false alarms, and such detectors not “know” explicitly that a face has two eyes Discriminative methods often need large training data, say 100,000 labeled examples, and can hardly be generalized We should use them if we know for sure that the recognition is all we need in an application, i.e we don’t expect to generalize the algorithm to much broader scope or utility functions In comparison, generative methods try to build models for the underlying patterns, and can be learned, adapted, and generalized with small data BAYESIAN INFERENCE The logical approach for calculating or revising the probability of a hypothesis is called Bayesian inference This is governed by the classic rules of probability combination, from which the Bayes theorem derives In the Bayesian perspective, probability is not interpreted as the transition to the limit of a frequency, but rather as the digital translation of a state of knowledge (the degree of confidence in a hypothesis) The Bayesian inference is based on the handling of probabilistic statements The Bayesian inference is particularly useful in the problems of induction Bayesian methods -6- differ from standard methods known by the systematic application of formal rules of transformation of probabilities Before proceeding to the description of these rules, let's review the notations used The rules of probability There are only two rules for combining probabilities, and on them the theory of Bayesian analysis is built These rules are the addition and multiplication rules The addition rule The multiplication rule The Bayes theorem can be derived simply by taking advantage of the symmetry of the multiplication rule This means that if one knows the consequences of a case, the observation of effects allows you to trace the causes Evidence notation In practice, when probability is very close to or 1, elements considered themselves as very improbable should be observed to see the probability change Evidence is defined as: for clarity purposes, we often work in decibels (dB) with the following equivalence: An evidence of -40 dB corresponds to a probability of 10-4, etc Ev stands for weight of evidence Comparison with classical statistics The difference between the Bayesian inference and classical statistics is that: • Bayesian methods use impersonal methods to update personal probability, known as subjective (probability is always subjective, when analyzing its fundamentals), • statistical methods use personal methods in order to treat impersonal frequencies The Bayesian and exact conditional approaches to the analysis of binary data are very different, both in philosophy and implementation Bayesian inference is based on the posterior distributions of quantifies of interest such as probabilities or parameters of logistic models Exact conditional inference is based on the discrete distributions of estimators or test statistics, conditional on certain other statistics taking their observed values The Bayesians thus choose to model their expectations at the beginning of the process (nevertheless revising this first assumption made at the beginning of the experience in light of the subsequent observations), while classical statisticians fix a -7- priori an arbitrary method and assumption and don't treat the data until after that Bayesian methods, because they not require fixed prior hypothesis, have paved the way for the automatic data mining, there is indeed no more need to use Prior human intuition to generate hypotheses before we can start working When should we use one or the other? The two approaches are complementary; the statistic is generally better when information is abundant and low cost of collection, Bayesian where it is poor and /or costly to collect In case of abundance data, the results are asymptotically the same for each method, the Bayesian calculation being simply more costly In contrast, the Bayesian can handle cases where statistics would not have enough data to apply the limit theorems Actually, Altham in 1969 discovered a remarkable result, relating the two forms of inference for the analysis of a x contingency table This result is hard to generalise to more complex examples The Bayesian psy-test (which is used to determine the plausibility of a distribution compared to the observations) asymptotically converges to the χ ² in classical statistics as the number of observations becomes large The seemingly arbitrary choice of a Euclidean distance in the χ ² is perfectly justified a posteriori by the Bayesian reasoning Example: From which bowl is the cookie? To illustrate, suppose there are two full bowls of cookies Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each Our friend Fred picks a bowl at random, and then picks a cookie at random We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies The cookie turns out to be a plain one How probable is it that Fred picked it out of bowl #1? Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1 The precise answer is given by Bayes's theorem Let H1 correspond to bowl #1, and H2 to bowl #2 It is given that the bowls are identical from Fred's point of view, thus P(H1) = P(H2), and the two must add up to 1, so both are equal to 0.5 The event E is the observation of a plain cookie From the contents of the bowls, we know that P(E | H1) = 30 / 40 = 0.75 and P(E | H2) = 20 / 40 = 0.5 Bayes's formula then yields Before we observed the cookie, the probability we assigned for Fred having chosen bowl #1 was the prior probability, P(H1), which was 0.5 After observing the cookie, we must revise the probability to P(H1 | E), which is 0.6 HIDDEN MARKOV MODEL Hidden Markov models are a promising approach in different application areas -8- where it intends to deal with quantified data that can be partially wrong for example recognition of images (characters, fingerprints, search for patterns and sequences in the genes, etc.) The data production model A hidden Markov chain is a machine with states that we will note We denote the state of the automaton at the moment The probability of transition from one state to a state is given, we call it a(m,n) We have We also have d(m), the probability that the automaton is in state m at the initial moment: We obviously have When the automaton passes through the state m it emits a piece of information yt that can take N values The probability that the automaton emits a signal n when it is in this state, m, will be noted: we have The word "hidden'' used to characterize the model reflects the fact that the emission of a certain state is random This random nature of measures which, added to the properties of Markov processes yields the flexibility and strength of this approach A variant of this approach found a renewed interest in the field of error correcting codes in digital transmission where it is used in turbocodes -9- Levenberg-Marquardt backpropagation (trainlm) Memory requirements way too high for my computer The same for trainoss So it was impossible to experiment with these two algorithms The learning functions used by the networks (accompanied by the MATLAB images showing the training process for the networks which use this function) learnlv2 learnlv2 is the LVQ2 weight learning function learnlv2 implements Learning Vector Quantization 2.1, which works as follows: For each presentation, if the winning neuron i should not have won, and the runner up j should have, and the distance di between the winning neuron and the input p is roughly equal to the distance dj from the runner up neuron to the input p according to the given window, min(di/dj, dj/di) > (1-window)/(1+window) then move the winning neuron i weights away from the input vector, and move the runner up neuron j weights toward the input according to: dw(i,:) = - lp.lr*(p'-w(i,:)) dw(j,:) = + lp.lr*(p'-w(j,:)) net_1 net_2 - 59 - net_3 net_4 learnsom learnsom is the self-organizing map weight learning function learnsom calculates the weight change dW for a given neuron from the neuron's input P, activation A2, and learning rate LR: dw = lr*a2*(p'-w) where the activation A2 is found from the layer output A and neuron distances D and the current neighborhood size ND: a2(i,q) = 1, if a(i,q) = = 0.5, if a(j,q) = and D(i,j)

Định dạng
Số trang	74
Dung lượng	1,47 MB