Optical character recognition using neural networks Optical character recognition using neural networks Optical character recognition using neural networks luận văn tốt nghiệp,luận văn thạc sĩ, luận văn cao học, luận văn đại học, luận án tiến sĩ, đồ án tốt nghiệp luận văn tốt nghiệp,luận văn thạc sĩ, luận văn cao học, luận văn đại học, luận án tiến sĩ, đồ án tốt nghiệp
Introduction
The difficulty of the dialogue between man and machine comes on one hand from the flexibility and variety of modes of interaction that we are able to use: gesture, speech, writing, etc and also the rigidity of those classically offered by computer systems Part of the current research in IT is therefore a design of applications best suited to different forms of communication commonly used by man This is to provide the computer systems with features for handling the information that humans manipulate themselves currently every day
In general the information to process is very rich It can be text, tables, images, words, sounds, writing, and gestures In this paper I treat the case of writing, to be more precise, printed character recognition By the application and personal contexts the way to represent this information and transmit it is very variable Just consider for example the variety of styles of writing that it is between different languages and even for the same language Moreover because of the sensitivity of the sensors and the media used to acquire and transmit, the information to be processed is often different from the originals It is therefore characterized by either intrinsic to the phenomena to which they are either related to their transmission ways inaccuracies Their treatment requires the implementation of complex analysis and decision systems This complexity is a major limiting factor in the context of the dissemination of the informational means This remains true despite the growth of calculation power and the improvement of processing systems since the research is at the same time directed towards the resolution of more and more difficult tasks and to the integration of these applications in cheaper and therefore low capacity mobile systems
Optical character recognition represents the process through which a program converts the image of a character (usually acquired by a scanner machine) into the code associated to that character, thus enabling the computer to “understand” the character, which heretofore was just a cluster of pixels It turns the image of the character (or of a string of characters – text) into selectable strings of text that you can copy, as you would any other computer generated document In its modern form, it is a form of artificial intelligence pattern recognition
OCR is the most effective method available for transferring information from a classical medium (usually, paper) to an electronic one The alternative would be a human reading the characters in the image and typing them into a text editor, which is obviously a stupid, Neanderthal approach when we possess the computers with enough power to do this mind-numbing task The only thing we need is the right OCR software
Before OCR can be used, the source material must be scanned using an optical scanner (and sometimes a specialized circuit board in the PC) to read in the page as a bitmap (a pattern of dots) Software to recognize the images is also required The OCR software then processes these scans to differentiate between images and text and determine what letters are represented in the light and dark areas
The approach in older OCR programs was still animal It was simply to compare the characters to be recognized with the sample characters stored in a data base Imagine the numbers of comparisons, considering how many different fonts exist Modern OCR software use complex neural-network-based systems to obtain better results – much more exact identification – actually close to100%
Today's OCR engines add the multiple algorithms of neural network technology to analyze the stroke edge, the line of discontinuity between the text characters, and the background Allowing for irregularities of printed ink on paper, each algorithm averages the light and dark along the side of a stroke, matches it to known characters and makes a best guess as to which character it is The OCR software then averages or polls the results from all the algorithms to obtain a single reading
Advances have made OCR more reliable; expect a minimum of 90% accuracy for average-quality documents Despite vendor claims of one-button scanning, achieving 99% or greater accuracy takes clean copy and practice setting scanner parameters and requires you to "train" the OCR software with your documents
The first step toward better recognition begins with the scanner The quality of its charge-coupled device light arrays will affect OCR results The more tightly packed these arrays, the finer the image and the more distinct colors the scanner can detect
Smudges or background color can fool the recognition software Adjusting the scan's resolution can help refine the image and improve the recognition rate, but there are trade- offs
For example, in an image scanned at 24-bit color with 1,200 dots per inch (dpi), each of the 1,200 pixels has 24 bits' worth of color information This scan will take longer than a lower-resolution scan and produce a larger file, but OCR accuracy will likely be high A scan at 72 dpi will be faster and produce a smaller file—good for posting an image of the text to the Web—but the lower resolution will likely degrade OCR accuracy Most scanners are optimized for 300 dpi, but scanning at a higher number of dots per inch will increase accuracy for type under 6 points in size
Bilevel (black and white only) scans are the rule for text documents Bilevel scans are faster and produce smaller files, because unlike 24-bit color scans, they require only one bit per pixel Some scanners can also let you determine how subtle to make the color differentiation
The accurate recognition of Latin-based typewritten text is now considered largely a solved problem Typical accuracy rates exceed 99%, although certain applications demanding even higher accuracy require human review for errors Other areas - including recognition of cursive handwriting, and printed text in other scripts (especially those with a very large number of characters) - are still the subject of active research
Today, OCR software can recognize a wide variety of fonts, but handwriting and script fonts that mimic handwriting are still problematic Developers are taking different approaches to improve script and handwriting recognition OCR software from ExperVision Inc first identifies the font and then runs its character-recognition algorithms
Which method will be more effective depends on the image being scanned A bilevel scan of a shopworn page may yield more legible text But if the image to be scanned has text in a range of colors, as in a brochure, text in lighter colors may drop out On-line systems for recognizing hand-printed text on the fly have become well- known as commercial products in recent years Among these are the input devices for personal digital assistants such as those running Palm OS The algorithms used in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known Also, the user can be retrained to use only specific letter shapes These methods cannot be used in software that scans paper documents, so accurate recognition of hand-printed documents is still largely an open problem Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications
Whereas commercial and even open source OCR software performs well for, let's say, usual images, a particularly difficult problem for computers and humans is that of the old religious registers of baptisms and marriages, which contain mainly the names, where the pages can be damaged by weather, water or fire, and the names can be obsolete or written by former spellings
Pattern recognition
Pattern recognition is a major area of computing in which searches are particularly active There are a very large number of applications that may require a recognition module in processing systems designed to automate certain tasks for humans Among those handwriting recognition systems are a difficult issue to handle as they are grouped alone much of the difficulties encountered in pattern recognition In this chapter
I give a general presentation of the main pattern recognition techniques
Pattern recognition is the set of the methods and techniques with which we can achieve a classification in a set of objects, processes or phenomena This is accomplished by comparison with models In memory of the computer a set of models (prototypes), one for each class is stored The new, unknown input (not classified yet) is compared in turn with each prototype, classifying them into one of the classes being based on a selection criterion: if the unknown best suits well with the "i" then it will belong to class "i" The difficulties that arise are related to the selection of a representative model, which best characterizes a form class, as well as defining an appropriate selection criterion, able to univocally classify each unknown form
Pattern recognition techniques can be divided into two main groups: generative and discriminant There have been long standing debates on generative vs discriminative methods The discriminative methods aim to minimize a utility function (e.g classification error) and it does not need to model, represent, or “understand” the pattern explicitly For example, nowadays we have very effective discriminative methods They can detect 99.99% faces in real images with low false alarms, and such detectors do not
“know” explicitly that a face has two eyes Discriminative methods often need large training data, say 100,000 labeled examples, and can hardly be generalized We should use them if we know for sure that the recognition is all we need in an application, i.e we don’t expect to generalize the algorithm to much broader scope or utility functions In comparison, generative methods try to build models for the underlying patterns, and can be learned, adapted, and generalized with small data
The logical approach for calculating or revising the probability of a hypothesis is called Bayesian inference This is governed by the classic rules of probability combination, from which the Bayes theorem derives In the Bayesian perspective, probability is not interpreted as the transition to the limit of a frequency, but rather as the digital translation of a state of knowledge (the degree of confidence in a hypothesis) The Bayesian inference is based on the handling of probabilistic statements The Bayesian inference is particularly useful in the problems of induction Bayesian methods differ from standard methods known by the systematic application of formal rules of transformation of probabilities Before proceeding to the description of these rules, let's review the notations used
There are only two rules for combining probabilities, and on them the theory of Bayesian analysis is built These rules are the addition and multiplication rules The addition rule
The Bayes theorem can be derived simply by taking advantage of the symmetry of the multiplication rule
This means that if one knows the consequences of a case, the observation of effects allows you to trace the causes
In practice, when probability is very close to 0 or 1, elements considered themselves as very improbable should be observed to see the probability change Evidence is defined as: for clarity purposes, we often work in decibels (dB) with the following equivalence:
An evidence of -40 dB corresponds to a probability of 10 -4 , etc Ev stands for weight of evidence
The difference between the Bayesian inference and classical statistics is that:
• Bayesian methods use impersonal methods to update personal probability, known as subjective (probability is always subjective, when analyzing its fundamentals),
• statistical methods use personal methods in order to treat impersonal frequencies
The Bayesian and exact conditional approaches to the analysis of binary data are very different, both in philosophy and implementation Bayesian inference is based on the posterior distributions of quantifies of interest such as probabilities or parameters of logistic models Exact conditional inference is based on the discrete distributions of estimators or test statistics, conditional on certain other statistics taking their observed values
The Bayesians thus choose to model their expectations at the beginning of the process (nevertheless revising this first assumption made at the beginning of the experience in light of the subsequent observations), while classical statisticians fix a priori an arbitrary method and assumption and don't treat the data until after that Bayesian methods, because they do not require fixed prior hypothesis, have paved the way for the automatic data mining, there is indeed no more need to use Prior human intuition to generate hypotheses before we can start working When should we use one or the other? The two approaches are complementary; the statistic is generally better when information is abundant and low cost of collection, Bayesian where it is poor and /or costly to collect In case of abundance data, the results are asymptotically the same for each method, the Bayesian calculation being simply more costly In contrast, the Bayesian can handle cases where statistics would not have enough data to apply the limit theorems
Actually, Altham in 1969 discovered a remarkable result, relating the two forms of inference for the analysis of a 2 x 2 contingency table This result is hard to generalise to more complex examples
The Bayesian psy-test (which is used to determine the plausibility of a distribution compared to the observations) asymptotically converges to the χ ² in classical statistics as the number of observations becomes large The seemingly arbitrary choice of a Euclidean distance in the χ ² is perfectly justified a posteriori by the Bayesian reasoning
Example: From which bowl is the cookie?
To illustrate, suppose there are two full bowls of cookies Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each Our friend Fred picks a bowl at random, and then picks a cookie at random We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies The cookie turns out to be a plain one How probable is it that Fred picked it out of bowl #1?
Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1 The precise answer is given by Bayes's theorem Let H 1 correspond to bowl #1, and H 2 to bowl #2 It is given that the bowls are identical from Fred's point of view, thus P(H 1) = P(H 2), and the two must add up to 1, so both are equal to 0.5 The event E is the observation of a plain cookie From the contents of the bowls, we know that P(E | H 1) = 30 / 40 = 0.75 and P(E | H 2) = 20 / 40 = 0.5 Bayes's formula then yields
Before we observed the cookie, the probability we assigned for Fred having chosen bowl
#1 was the prior probability, P(H 1), which was 0.5 After observing the cookie, we must revise the probability to P(H 1 | E), which is 0.6
Hidden Markov models are a promising approach in different application areas where it intends to deal with quantified data that can be partially wrong for example - recognition of images (characters, fingerprints, search for patterns and sequences in the genes, etc.)
A hidden Markov chain is a machine with states that we will note
We denote the state of the automaton at the moment The probability of transition from one state to a state is given, we call it a(m,n)
We also have d(m), the probability that the automaton is in state m at the initial moment:
When the automaton passes through the state m it emits a piece of information yt that can take N values The probability that the automaton emits a signal n when it is in this state, m, will be noted: we have
The word "hidden'' used to characterize the model reflects the fact that the emission of a certain state is random This random nature of measures which, added to the properties of Markov processes yields the flexibility and strength of this approach A variant of this approach found a renewed interest in the field of error correcting codes in digital transmission where it is used in turbocodes
Probability of transition and emission of data in the hidden Markov model
The important property of Markov process is that the automaton evolution after moment depends only on the value of the state it is in at this moment and orders which are then applied and not what he suffered before arriving at this state In other words, the future does not depend on the manner in which the automaton arrived in that state The M States and the N possible values of measures and also the probabilities a(m,m'), b(m,n) and d(m), characterize the model We have to address three problems
We have observed Y = [y0,…,yt,…,yT] [a(m,m'), b(m,n), d(m)] is given What is the most likely state sequence S = [s0, ,st,…,sT] that created it?
We have observed a sequence of measures Y = [y0,…,yt,…,yT] What is the probability that the automaton characterized by the parameters [a(m,m'), b(m,n), d(m)] has led to this sequence?
We have observed Y = [y0,…,yt,…,yT] How to calculate (or rather update) the model's parameters [a(m,m'), b(m,n), d(m)] in order to maximize the probability of observing?
The following algorithm aims to find the sequence of states most likely have produced the measured sequence Y = [y0,…,yt,…,yT] At moment t we calculate recursively for each state
The maximum being calculated for all possible state sequences S = [s0, ,st-1] Initialization: At the moment t = 0
Recurrence: let's assume that the moment t-1 we calculated rt-1(m) for each state We then have
Optical character recognition (OCR)
The first OCR machine was created by Gustav Tauschek, a German engineer, in
In 1950, Frank Rowlett, who had broken the Japanese diplomatic code PURPLE, asked David Shepard, a cryptanalyst with AFSA (predecessor of NSA), to work with Louis Tordella in order to devise data automatization procedures for the agency The question included the problem of converting printed messages into machine language for computer processing Shepard decided it should be possible to construct a machine to do it, and with the help of Harvey Cook, a friend, built "Gismo" in his attic during evenings and weekends
Shepard then founded Intelligent Machines Research Corporation (IMR), which delivered the first OCR systems in the world to be exploited by private companies The first private system was installed for Reader's Digest in 1955, and many years later, was donated by Readers Digest to the Smithsonian, where it was put on display Other systems sold by IMR during the late 1950s included a billing slip reader to the Ohio Bell Telephone Company and a scanner to the U.S Air Force for the reading and transmission of typed messages by telex IBM and others later also used the Shepard patents Since 1965, the United States Post uses OCR machines whose principle of operation has been designed by Jacob Rabinow, a prolific inventor, to sort mail Canada Post uses OCR systems since 1971 OCR systems read the name and address at the first automated sorting center, and print on the envelope a barcode The letters can then be sorted in the following centers by less expensive sorters which need only to read the barcodes To avoid any interference with the read address which can be anywhere on the letter, special ink is used, which is clearly visible under UV light This ink appears orange in normal lightning conditions
The next steps that are generally followed in a character recognition system:
1 Encoding requires a data representation method compatible with the computer’s understanding needs, e.g., a camera or a scanner
2 Pre-processing: noise cancellation, data size reduction, normalizations, recovery of slanted or distorted images, contrast corrections, the switch to two-color (black and white "or rather paper and ink), the detection of contours etc
3 Segmentation – i.e to isolate the lines of text in the image and characters within the lines It's also possible to detect underlined text, frames, and images
4 Learning is building up a classifying model and assigning a class to each element in the training set
5 Analysis and decision consist in attributing a previously unknown object to one of the classes previously determined
1 Classification by features: a shape to recognize is represented by a vector of numerical values – the features - calculated from this shape The number of features is from around 100 to 300 If the features are well chosen, a character class is represented by a "cloud" of adjacent points in the vector space of features The role of the classifier is to determine to which cloud (i.e what class of characters) the shape to recognize most likely belongs In this method class various types of artificial neural networks trained on databases of possible forms are included
2 Metric methods: consists in directly comparing the shape to recognize, using distance algorithms, with a series of learned models This kind of method is rarely used and little valued by researchers, because it poor results
3 Statistical methods: Statistical methods, such as Bayesian nets and Markov chains are often used, especially in the field of handwriting recognition
6 Post-processing means the possible validation of the made recognition decision For example, use of linguistic and contextual rules to reduce the number of recognition errors: dictionaries of words, syllables, ideograms In industrial systems, specialized techniques for certain areas of text (names, addresses) can use databases to eliminate the incorrect solutions
OCR used to work only with previously computer generated writing, it could only recognize typed letters but recently, handwritten text recognition also improved enormously, mainly due to the demand for the so-called “online” recognition, in which the user writes directly into the device (mobile phone, PC tablet, etc.) by hand The other,
“classical” recognition, working with already complete documents (images), is called
“offline” recognition This is the part of OCR studied in this paper Another element to consider is that OCR needs to scan the text, and to do that it needs that text to be of a relatively reasonable size If the font is too small OCR will not be able to determine the separation between letters and will only see a blob of shapes
This can be largely dependent on the quality of the scanner, as higher quality scanners tend to get considerably sharper images that make Optical Character Recognition easier
Key elements for a successful OCR system
1 It takes a complimentary merging of the input document stream with the processing requirements of the particular application with a total system concept that provides for convenient entry of exception type items with an output that provides cost effective entry to complete the system To show a successful example, let's review the early credit card OCR applications 1 Input was a carbon imprinted document However, if the carbon was wrinkled, the imprinter was misaligned, or any one of a variety of reasons existed, the imprinted characters were impossible to read accurately
2 To compensate for this problem, the processing system permitted direct key entry of the fail to read items at a fairly high speed Directly keyed items from the misread document were under intelligent computer control which placed the proper data in the right location for the data record Important considerations in designing the system encouraged the use of modulus controlled check digits for the embossed credit card account number This, coupled with tight monetary controls by batch totals, reduced the chance of read substitutions
3 The output of these early systems provided a "country club" type of billing That is, each of the credit card sales slips was returned to the original purchaser This provided the credit card customer with the opportunity to review his own purchases to insure the final accuracy of billing This has been a very successful operation through the years Today's systems improve the process by increasing the amount of data to be read, either directly or through reproduction of details on the sales draft This provides customers with a "descriptive" billing statement which itemizes each transaction Attention to the details of each application step is a requirement for successful OCR systems
Recognition of cursive text is an active area of research, with recognition rates even lower than that of hand-printed text Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script Reading the Amount line of a cheque (which is always a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy The shapes of individual cursive characters themselves simply do not contain enough information to accurately (higher than 98%) recognize all handwritten cursive script
This has been a subject of intensive research in the last 10 years Significant improvements in the performance of recognition systems have been achieved Current systems are capable of transcribing handwriting with average recognition rates of 50-99 percent, depending on the constraints imposed (e.g., size of vocabulary, writer dependence, writing style, etc.) and also on the experimental conditions The improvements in performance have been achieved by different means Some researchers have combined different feature sets or used optimized feature sets Better modeling of reference patterns and adaptation have also contributed to improve the performance However, one of the most successful approaches to achieving better performance is the combination of classifiers This stream has been used especially in application domains where the size of the lexicon is small Combination of classifiers relies on the assumption that different classification approaches have different strengths and weaknesses which can compensate for each other through the combination Verification can be considered as a particular case of combination of classifiers The term verification is encountered in other contexts, but there is no consensus about its meaning Oliveira defines verification as the postprocessing of the results produced by recognizers
Madhvanath defines verification as the task of deciding whether a pattern belongs to a given class Cordella defines verification as a specialized type of classification devoted to ascertaining in a dependable manner whether an input sample belongs to a given category Cho defines verification as the validation of hypotheses generated by recognizers during the recognition process In spite of different definitions, some common points can be identified and a broader definition of verification could be a postprocessing procedure that takes as input hypotheses produced by a classifier or recognizer and which provides as output a single reliable hypothesis or a rejection of the input pattern In this paper, the term verification is used to refer to the postprocessing of the output of a handwriting recognition system resulting in rescored word hypotheses
In handwriting recognition, Takahashi and Griffin are among the earliest to mention the concept of verification and the goal was to enhance the recognition rate of an OCR algorithm They have designed a character recognition system based on a multilayer perceptron (MLP) which achieves a recognition rate of 94.6% for uppercase characters of the NIST database Based on an error analysis, verification by linear tournament with one-to-one verifiers between two categories was proposed and such a verification scheme increased the recognition rate by 1.2 percent Britto used a verification stage to enhance the recognition of a handwritten numeral string HMM-based system The verification stage, composed of 20 numeral HMMs, has improved the recognition rate for strings of different lengths by about 10% (from 81.65% to 91.57 percent) Powalka proposed a hybrid recognition system for online handwritten word recognition where letter verification is introduced to improve disambiguation among word hypotheses
Neural networks
Among the many pattern recognition methods I chose to study the Neural Network approach
A neural network is a wonderful tool that can help to resolve OCR type problems
Of course, the selection of appropriate classifiers is essential The NN is an information- processing paradigm inspired by the way the human brain processes information Neural Networks are collections of mathematical models that represent some of the observed properties of biological nervous systems and draw on the analogies of adaptive biological learning The key element of an neural network is its topology Unlike the original Perceptron model, shown by Minsky and Papert to have limited computational capability, the neural networks of today consists of a large number of highly interconnected processing elements (nodes) that are tied together with weighted connections (links) Learning in biological systems involves adjustments to the synaptic connections that exist between the neurons This is true for neural networks as well Learning typically occurs by example through training, or exposure to a set of input/output data (pattern) where the training algorithm adjusts the link weights The link weights store the knowledge necessary to solve specific problems In recent years neural computing has emerged as a practical technology, with successful applications in many fields The majority of these applications are concerned with problems in pattern recognition, and make use of feed- forward network architectures such as the multi-layer perceptron and the radial basis function network Also, it has also become widely acknowledged that successful applications of neural computing require a principled, rather than ad hoc, approach From the perspective of pattern recognition, neural networks can be regarded as an extension of the many conventional techniques which have been developed over several decades Artificial neural networks (as opposed to the biological neural networks, which they were modeled after) are made up of interconnecting artificial neurons (programming constructs that mimic the properties of biological neurons – simple processing units) Artificial neural networks are used for solving artificial intelligence problems without necessarily creating a model of a real biological system They just emulate the capacity of the living systems to LEARN and ADAPT Their main feature is their ability to learn from examples
In a network, each sub-group is treated independent of the others and transmit the results of its analysis to subgroups The information network will spread layer by layer, from input layer to output layer, from or through any one or more intermediate layers (called hidden layers) It should be noted that according to the learning algorithm, it is also possible to have a spread of information backwards (back propagation) Usually (except for layers of input and output), each neuron in a layer is connected to all neurons in the previous layer and the next layer
The RNA has the ability to store of empirical knowledge and make it available for use Processing skills (and thus knowledge) of the network will be stored in synaptic weights, obtained by processes of adaptation or learning In this sense, the RNA thus resembles the brain not only because the knowledge is acquired through learning but more, this knowledge is stored in the connections between the entities, either in the synaptic weight
Neurologists Warren McCulloch and Walter Pitts led the early work on neural networks They formed a simplified model of biological neuron called formal neurons They also showed theoretically that formal neural networks can perform simple logic, arithmetic and symbolic functions
The function of formal neural networks, like the biological model, is to solve problems Unlike traditional methods of computer resolution, we should not build a program step by step The most important parameters of this model are the synaptic coefficients They are the ones who build the model resolution in the information given to the network We must therefore find a mechanism to calculate them from the data we can acquire for the problem at hand This is the fundamental principle of learning In a model of formal neural networks, learning is to first calculate the values of synaptic coefficients using the available examples
The work of McCulloch and Pitts gave no indication of a method for adapting the synaptic coefficients This issue at the heart of thinking about learning has experienced an initial response through the work of Canadian physiologist Donald Hebb on learning in
1949 described in his book The Organization of Behavior Hebb proposed a simple rule that lets you change the value of the synaptic coefficients depending on the activity of units they connect This rule now known as the "Hebb rule" is almost always present in current models, even the most sophisticated
From this article, the idea is sowed over time in people's minds, and it germinated in the mind of Frank Rosenblatt in 1957 with the model of the perceptron This is the first artificial system capable of learning from experience, including where his instructor commits few errors
In 1969, a serious blow was dealt to the scientific community around neural networks: Lee Marvin Minsky and Seymour Papert published a book highlighting some limitations of the theoretical Perceptron, including the inability to handle nonlinear problems or related They extended the limitations implicit in all models of artificial neural networks Then appear at an impasse, research on neural networks lost much of its public funding, and industry away from them too The funds for artificial intelligence were redirected instead to the formal logic and research walked for ten years However, the solid qualities of certain neural networks in adaptive material, (e.g Adaline), enabling them to model in evolutionary phenomena themselves changing the lead to be integrated into more or less explicit in the corpus of adaptive systems used in telecommunications or process control industries
In 1982, John Joseph Hopfield, recognized physicist, gave a new impetus to the neuron with an article introducing a new model of neural network (fully recurrent) This article was success for several reasons, including the main color was the theory of neural networks rigor own physicists The neuronal became a subject of study acceptable, although the Hopfield model suffers major limitations of the model year 1960, including the inability to treat non-linear problems
At the same time, algorithmic approaches to artificial intelligence was the subject of disillusionment, their applications do not meet expectations This disillusionment motivated a reorientation of research in artificial intelligence to neural networks (although these relate to the perception that artificial intelligence artificial, strictly speaking) The search was restarted and the industry took some interest in neuronal (especially for applications such as guided cruise missiles) In 1984, The backpropagation the gradient of the error concept was introduced
A revolution occurs in the field of artificial neural networks: a new generation of neural networks, capable of successfully treating non-linear phenomena: the multilayer perceptron has no defects highlighted by Marvin Minsky First proposed by Werbos, the Multi-Layer Perceptron appears in 1986, introduced by Rumelhart and, simultaneously, under a nearby home Yann Le Cun These systems rely on backpropagating the gradient of the error in systems with several layers
Neural networks have subsequently increased significantly, and were part of the first systems to receive the light of the theory of statistical regularization introduced by Vladimir Vapnik in the Soviet Union and popularized in the West since the fall of the wall This theory, one of the largest in the field of statistics, allows anticipating, regulating and investigating phenomena related to over-learning It can regulate a system of learning to referee between the best modeling poor (e.g the average) and modeling too rich to be optimized so unrealistic on a number of examples too small, and would be ineffective on examples not yet learned even close examples learned Over-learning is a challenge faced every system of learning by example, they use methods of direct optimization (e.g linear regression), iterative (e.g gradient descent), or iterative semi- direct (conjugated gradient, expectation-maximization ) and that they are applied to conventional statistical models, the hidden Markov models or networks of formal neurons
As we have seen, the starting point in history for the neural network science can be placed in the early 1940s when Warren McCulloch and Walter Pitts proposed the first formal model of the neuron (1943), emphasizing its computing capabilities and the possibility to imitate its operating mode by electronic circuits
In the late 1949, Hebb, based on Pavlov’s research, stated the synaptic permeability adapting principle according to which, every time a synaptic connection is used, its permeability in creases Upon this principle is founded the synaptic weights altering adapting
In 1957 Rosenblatt developed a hardware-implemented network, called perceptron, to recognize printed characters
1950-1960 – Windrow and Hoff developed algorithms based on minimizing the error for the training set for one-level networks
The program
The program suite is composed of: processare.m extract_lines.m extract_letter.m prelucrare.m train_net.m kiem_tra.m ocr.m extragere.m and the folders to store the training images of the various characters (7 different fonts – Arial, Baskerville, Book Antiqua, Courier, Tempus, Times New Roman, and Verdana, each training set for each font containing the capital letters from A to Z and the numbers from 0 to 9)
1 The first piece, processare.m, is simple image processing for eliminating noise, making the image of the text more clear for analysis, turning the image into a binary image transforming the image such as the letters are white and everything else, black
2 In order to get to the individual characters, we must first break the text into lines
To that end, we use extract_lines.m The function extract_lines takes the image of the text, and returns the images of each line and the numbers of lines found The images of individual lines are saved in a cell array, lines, with 1 row and the number of columns equal to the number of lines The first element is initialized with the whole picture Further, the program searches for the rows in the image composed only of zeros (remember that everything except the characters are black) The first such row is the beginning of the first separation line (space) and the end of the first character line, and so on The end of a separation line is determined by testing the following rows for having the pixels sum null The first row which doesn’t fulfill this criterion is, obviously, the first pixel row of the next text line Thus, the initial image is fragmented and the fragments (the lines) are saved The average character line and separation line heights are also calculated We wouldn’t need this if we had only basic English characters But my program extract the letters from Vietnamese too; which have tones So, because of the separation algorithm, some of the characters might be separated from their tones We use these averages to compare each line’s height with, so as to be able to determine whether a line is actually a separate line of characters, or just tones The lines found to be tones are joined to the corresponding character lines A new cell array, final_lines, is created and the final lines stored in it
3 Once we have the text, separated into lines, we apply the extract_letter function to each line to extract the individual letters This function, found in extract_letter.m, takes in the line image and returns: each separate main character’s cropped image, the tone(s)' associated to that character (if any) cropped image, the number of main characters, a vector storing the maximum label of the small objects(tones) associated to a certain character, the indexes of the characters after which a space follows, and the index of certain characters, like "!", ".", which, due to their resemblance to other characters, might be confused by the neural network with them
The main MATLAB functions used in this program are bwlabel in association with find, which identifies and labels all separate objects in a binary image from this set of objects, we first take interest in the larger ones, which represent the main characters, which we store in chu{1,a} Then, for each main, object, we chack if there are any object within its vertical limits If there are, they might be tones, but also parts of a main character which are not connected, e.g the dots in question and exclamation marks The code determines if these secondary objects are tones or integral part of the character If they are tones, they are stored in symb{a,b}, where a represents the main character and b the tone; if not, we store the two objects in chu{1,a}
There are more thing to take care of: detect the inverted commas, i.e join the two apostrophe marks if they're close enough to form the aforementioned character; and determining the characters followed by space
4 prelucrare.m processes the images in an attempt to keep just the significant pixels in each, so as not to confuse the network with irrelevant information To this end, it resizes the image to 13 x 10, after which it applies the thinning morphological transform This code is used by the program used to train the network, train_net.m, by the program used to check how many characters are recognized by the trained network, kiem_tra.m, and finally, by ocr.m itself!
5 Now, we get to the actual recognition part First, we have to train the network – train_net.m I studied feed-forward, cascade forward, Elman, Hopfield, self- organizing-map and radial basis neural networks The ones which provided the best results were the cascade-forward, feed-forward, Elman, general regression and probabilistic neural networks as it can be seen in the tables below The program forms the training batch out of the seven folders containing the training fonts Then the value of each pixel in the image is stored in the vector Q, which is in its turn appended column- wise to the matrix P, which represents the batch of training data Each identical letter from the different fonts are introduced in successive columns (7-column group) Thus, the training batch is a 130 x 252 matrix Obviously, then, we have 130 input neurons (a
13 x 10 pixel image is fed to the network as a 13*10 = 130-row vector); and 46 (the number of different characters) * 7 (number of different fonts) = 322 The targets batch is a 46 x 252 matrix Why? Because the network output is a 46-row vector, the nth element being 1 for the nth character and the rest 0 And the first seven columns are identical, representing the desired response for the first element, "A", next seven also identical, representing the response for the second one, and so on The characters for which the networks are trained are the capital letters form A to Z, the numbers, the punctuation marks, and the two special Vietnamese characters, Ư and Ơ.
Review of the three training algorithms I used for the networks:
Gradient descent with momentum and adaptive learning rate backpropagation
(traingdx) traingdx is a network training function that updates weight and bias values according to gradient descent momentum and an adaptive learning rate
If a network's weight, net input, and transfer functions have can be derived, trhen the network can be trained with traingdx
The derivatives of performance perf are calculated using backpropagation, with respect to the weight and bias variables X Each variable is adjusted according to gradient descent with momentum, dX = mc*dXprev + lr*mc*dperf/dX where dXprev is the previous change to the weight or bias, lr the learning rate, and mc, the momentum constant
For each epoch, if performance decreases toward the goal, then the learning rate is increased by the factor lr_inc If performance increases by more than the factor max_perf_inc, the learning rate is adjusted by the factor lr_dec and the change, which increased the performance, is not made
Training stops when any of these conditions occur:
The maximum number of epochs (repetitions) is reached;
The maximum amount of time has been exceeded;
Performance has been minimized to the goal;
The performance gradient falls below mingrad;
Validation performance has increased more than max_fail times since the last time it decreased (if you use validation)
The conjugate gradient algorithms, in particular trainscg, seem to work well on a wide variety of problems, particularly for networks with a big number of weights Each of the conjugate gradient algorithms except this one requires a line search at each iteration This line search is computationally expensive, since it requires that the network response to all training inputs be computed several times for each search The scaled conjugate gradient algorithm (SCG), developed by Moller, was designed to avoid the time-consuming line search The basic idea is to combine the model-trust region approach (used in the Levenberg-Marquardt algorithm), with the conjugate gradient approach
The trainscg routine may require more iterations to converge than the other conjugate gradient algorithms, but the number of computations in each iteration is significantly reduced because no line search is performed The algorithm trainscg is almost as fast as the LM algorithm on problems of function approximation (faster for large networks) and is almost as fast as trainrp on problems of pattern recognition Its performance doesn’t degrade as quickly as the performance of trainrp does when the error is small Conjugate gradient algorithms have relatively modest requirements of memory
Multilayer networks typically use sigmoid transfer functions in the hidden layers These functions are often called "squashing" functions, since they compress an infinite input range into a finite output range Sigmoid functions are characterized by the fact that their slope must approach zero as the input gets large This causes a problem when using steepest descent to train a multilayer network with sigmoid functions, since the gradient can have a very small magnitude; and therefore, cause small changes in the weights and biases, even though the weights and biases are far from their optimal values
The purpose of the resilient backpropagation (Rprop) training algorithm is to eliminate these harmful effects of the magnitudes of the partial derivatives Only the sign of the derivative is used to determine the direction of the weight update; the magnitude of the derivative has no effect on the weight update The size of the weight change is determined by a separate update value The update value for each weight and bias is increased by a factor delt_inc whenever the derivative of the performance function with respect to that weight has the same sign for two successive iterations The update value is decreased by a factor delt_dec whenever the derivative with respect that weight changes sign from the previous iteration If the derivative is zero, then the update value remains the same Whenever the weights are oscillating the weight change will be reduced If the weight continues to change in the same direction for several iterations, then the magnitude of the weight change will be increased A complete description of the Rprop algorithm is given in
The performance of Rprop is not very sensitive to the settings of the training parameters Rprop generally converges much faster than other algorithms
Rprop is generally much faster than the standard steepest descent algorithm It also has the nice property that it requires only a modest increase in memory requirements
We do need to store the update values for each weight and bias, which is equivalent to storage of the gradient