1. Trang chủ
  2. » Luận Văn - Báo Cáo

Optial character recognition using neural networks

74 2 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Optical Character Recognition Using Neural Networks
Tác giả Theodor Constantinescu
Người hướng dẫn Nguyên Linh Giang
Trường học Trường Đại Học Bách Khoa Hà Nội
Chuyên ngành Xử Lý Thông Tin Và Truyền Thông
Thể loại Luận Văn Thạc Sĩ
Năm xuất bản 2009
Thành phố Hà Nội
Định dạng
Số trang 74
Dung lượng 2,77 MB

Nội dung

With their ability of classification and generalization, neural networks are generally used in problems of statistical nature, such as automatic classification of postcodes, or character

Trang 1

BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI -

THEODOR CONSTANTINESCU OPTICAL CHARATER RECONGNITION USING NEURAL

NETWORKS

CHUYÊN NGÀNH: XỬ LÝ THÔNG TIN VÀ TRUYỀN

Trang 2

BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI

Trang 3

Contents

I Introduction 3

II Pattern recognition 6

III Optical character recognition (OCR) 2 6

IV Neural networks 34

V The program 55

VI Conclusions 71

Trang 4

I Introduction

The difficulty of the dialogue between man and machine comes on one hand from the flexibility and variety of modes of interaction that we are able to use: gesture, speech, writing, etc and also the rigidity of those classically offered by computer systems Part of the current research in IT is therefore a design of applications best suited to different forms of communication commonly used by man This is to provide the computer systems with features for handling the information that humans manipulate themselves currently every day

In general the information to process is very rich It can be text, tables, images, words, sounds, writing, and gestures In this paper I treat the case of writing, to be more precise, printed character recognition By the application and personal contexts the way to represent this information and transmit it is very variable Just consider for example the variety of styles of writing that it is between different languages and even for the same language Moreover because of the sensitivity of the sensors and the media used to acquire and transmit, the information to be processed is often different from the originals It is therefore characterized by either intrinsic to the phenomena to which they are either related to their transmission ways inaccuracies Their treatment requires the implementation of complex analysis and decision systems This complexity is a major limiting factor in the context of the dissemination of the informational means This remains true despite the growth of calculation power and the improvement of processing systems since the research is at the same time directed towards the resolution of more and more difficult tasks and to the integration of these applications in cheaper and therefore low capacity mobile systems

Optical character recognition represents the process through which a program converts the image of a character (usually acquired by a scanner machine) into the code associated to that character, thus enabling the computer to “understand” the character, which heretofore was just a cluster of pixels It turns the image of the character (or of a string of characters – text) into selectable strings of text that you can copy, as you would any other computer generated document In its modern form, it is a form of artificial intelligence pattern recognition

OCR is the most effective method available for transferring information from a classical medium (usually, paper) to an electronic one The alternative would be a human reading the characters in the image and typing them into a text editor, which is obviously

a stupid, Neanderthal approach when we possess the computers with enough power to do this mind-numbing task The only thing we need is the right OCR software

Before OCR can be used, the source material must be scanned using an optical scanner (and sometimes a specialized circuit board in the PC) to read in the page as a bitmap (a pattern of dots) Software to recognize the images is also required The OCR

Trang 5

software then processes these scans to differentiate between images and text and determine what letters are represented in the light and dark areas

The approach in older OCR programs was still animal It was simply to compare the characters to be recognized with the sample characters stored in a data base Imagine the numbers of comparisons, considering how many different fonts exist Modern OCR software use complex neural-network-based systems to obtain better results – much more exact identification – actually close to100%

Today's OCR engines add the multiple algorithms of neural network technology

to analyze the stroke edge, the line of discontinuity between the text characters, and the background Allowing for irregularities of printed ink on paper, each algorithm averages the light and dark along the side of a stroke, matches it to known characters and makes a best guess as to which character it is The OCR software then averages or polls the results from all the algorithms to obtain a single reading

Advances have made OCR more reliable; expect a minimum of 90% accuracy for average-quality documents Despite vendor claims of one button scanning, achieving -99% or greater accuracy takes clean copy and practice setting scanner parameters and requires you to "train" the OCR software with your documents

The first step toward better recognition begins with the scanner The quality of its charge-coupled device light arrays will affect OCR results The more tightly packed these arrays, the finer the image and the more distinct colors the scanner can detect

Smudges or background color can fool the recognition software Adjusting the scan's resolution can help refine the image and improve the recognition rate, but there are trade-offs

For example, in an image scanned at 24-bit color with 1,200 dots per inch (dpi), each of the 1,200 pixels has 24 bits' worth of color information This scan will take longer than a lower-resolution scan and produce a larger file, but OCR accuracy will likely be high A scan at 72 dpi will be faster and produce a smaller file good for —posting an image of the text to the Web but the lower resolution will likely degrade —OCR accuracy Most scanners are optimized for 300 dpi, but scanning at a higher number

of dots per inch will increase accuracy for type under 6 points in size

Bilevel (black and white only) scans are the rule for text documents Bilevel scans are faster and produce smaller files, because unlike 24-bit color scans, they require only one bit per pixel Some scanners can also let you determine how subtle to make the color differentiation

The accurate recognition of Latin-based typewritten text is now considered largely a solved problem Typical accuracy rates exceed 99%, although certain applications demanding even higher accuracy require human review for errors Other areas - including recognition of cursive handwriting, and printed text in other scripts (especially those with a very large number of characters) - are still the subject of active research

Today, OCR software can recognize a wide variety of fonts, but handwriting and script fonts that mimic handwriting are still problematic Developers are taking different approaches to improve script and handwriting recognition OCR software from ExperVision Inc first identifies the font and then runs its character-recognition algorithms

Trang 6

Which method will be more effective depends on the image being scanned A bilevel scan of a shopworn page may yield more legible text But if the image to be scanned has text in a range of colors, as in a brochure, text in lighter colors may drop out

On line systems for recognizing hand printed text on the fly have become well- - known as commercial products in recent years Among these are the input devices for personal digital assistants such as those running Palm OS The algorithms used in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known Also, the user can be retrained to use only specific letter shapes These methods cannot be used in software that scans paper documents, so accurate recognition of hand-printed documents is still largely an open problem Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications

Whereas commercial and even open source OCR software performs well for, let's say, usual images, a particularly difficult problem for computers and humans is that of the old religious registers of baptisms and marriages, which contain mainly the names, where the pages can be damaged by weather, water or fire, and the names can be obsolete or written by former spellings

Character recognition is an active area of research for computer science since the late 1950s Initially, it was thought to be an easy problem, but it appeared that this was a much more interesting It will take many decades to computers to read any document with the same precision as human beings

All the commercial software is quite complex My aim was to create a simple and reliable program to perform the same tasks

Trang 7

II Pattern recognition

Pattern recognition is a major area of computing in which searches are particularly active There are a very large number of applications that may require a recognition module in processing systems designed to automate certain tasks for humans Among those handwriting recognition systems are a difficult issue to handle as they are grouped alone much of the difficulties encountered in pattern recognition In this chapter

I give a general presentation of the main pattern recognition techniques

Pattern recognition is the set of the methods and techniques with which we can achieve a classification in a set of objects, processes or phenomena This is accomplished

by comparison with models In memory of the computer a set of models (prototypes), one for each class is stored The new, unknown input (not classified yet) is compared in turn with each prototype, classifying them into one of the classes being based on a selection criterion: if the unknown best suits well with the " " then it will belong to class "i i" The difficulties that arise are related to the selection of a representative model, which best characterizes a form class, as well as defining an appropriate selection criterion, able to univocally classify each unknown form

Pattern recognition techniques can be divided into two main groups: generative and discriminant There have been long standing debates on generative vs discriminative methods The discriminative methods aim to minimize a utility function (e.g.classification error) and it does not need to model, represent, or “understand” the pattern explicitly For example, nowadays we have very effective discriminative methods They can detect 99.99% faces in real images with low false alarms, and such detectors do not

“know” explicitly that a face has two eyes Discriminative methods often need large training data, say 100,000 labeled examples, and can hardly be generalized We should use them if we know for sure that the recognition is all we need in an application, i.e we don’t expect to generalize the algorithm to much broader scope or utility functions In comparison, generative methods try to build models for the underlying patterns, and can

be learned, adapted, and generalized with small data

BAYESIAN INFERENCE

The logical approach for calculating or revising the probability of a hypothesis is called Bayesian inference This is governed by the classic rules of probability combination, from which the Bayes theorem derives In the Bayesian perspective, probability is not interpreted as the transition to the limit of a frequency, but rather as the digital translation of a state of knowledge (the degree of confidence in a hypothesis) The Bayesian inference is based on the handling of probabilistic statements The Bayesian inference is particularly useful in the problems of induction Bayesian methods

Trang 8

differ from standard methods known by the systematic application of formal rules of transformation of probabilities Before proceeding to the description of these rules, let's review the notations used

The rules of probability

There are only two rules for combining probabilities, and on them the theory of Bayesian analysis is built These rules are the addition and multiplication rules The addition rule

The multiplication rule

The Bayes theorem can be derived simply by taking advantage of the symmetry of the multiplication rule

This means that if one knows the consequences of a case, the observation of effects allows you to trace the causes

Evidence notation

In practice, when probability is very close to 0 or 1, elements considered themselves as very improbable should be observed to see the probability change Evidence is defined as:

for clarity purposes, we often work in decibels (dB) with the following equivalence:

An evidence of 40 dB corresponds to a probability of 10- -4, etc Ev stands for weight of evidence

Comparison with classical statistics

The difference between the Bayesian inference and classical statistics is that:

• Bayesian methods use impersonal methods to update personal probability, known as subjective (probability is always subjective, when analyzing its fundamentals),

• statistical methods use personal methods in order to treat impersonal frequencies

The Bayesian and exact conditional approaches to the analysis of binary data are very different, both in philosophy and implementation Bayesian inference is based on the posterior distributions of quantifies of interest such as probabilities or parameters of logistic models Exact conditional inference is based on the discrete distributions of estimators or test statistics, conditional on certain other statistics taking their observed values

The Bayesians thus choose to model their expectations at the beginning of the process (nevertheless revising this first assumption made at the beginning of the experience in light of the subsequent observations), while classical statisticians fix a

Trang 9

priori an arbitrary method and assumption and don't treat the data until after that Bayesian methods, because they do not require fixed prior hypothesis, have paved the way for the automatic data mining, there is indeed no more need to use Prior human intuition to generate hypotheses before we can start working When should we use one or the other? The two approaches are complementary; the statistic is generally better when information is abundant and low cost of collection, Bayesian where it is poor and /or costly to collect In case of abundance data, the results are asymptotically the same for each method, the Bayesian calculation being simply more costly In contrast, the Bayesian can handle cases where statistics would not have enough data to apply the limit theorems

Actually, Altham in 1969 discovered a remarkable result, relating the two forms

of inference for the analysis of a 2 x 2 contingency table This result is hard to generalise

to more complex examples

The Bayesian psy-test (which is used to determine the plausibility of a distribution compared to the observations) asymptotically converges to the χ ² in classical statistics as the number of observations becomes large The seemingly arbitrary choice of a Euclidean distance in the χ ² is perfectly justified a posteriori by the Bayesian reasoning

Example: From which bowl is the cookie?

To illustrate, suppose there are two full bowls of cookies Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each Our friend Fred picks

a bowl at random, and then picks a cookie at random We may assume there is no reason

to believe Fred treats one bowl differently from another, likewise for the cookies The cookie turns out to be a plain one How probable is it that Fred picked it out of bowl #1? Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1 The precise answer is given by Bayes's theorem Let H1

correspond to bowl #1, and H2 to bowl #2 It is given that the bowls are identical from Fred's point of view, thus P H( 1) = P H( 2), and the two must add up to 1, so both are equal

to 0.5 The event is the observation of a plain cookie From the contents of the bowls, E

we know that P E H( | 1) = 30 / 40 = 0.75 and P E H( | 2) = 20 / 40 = 0.5 Bayes's formula then yields

Before we observed the cookie, the probability we assigned for Fred having chosen bowl

#1 was the prior probability, (P H1), which was 0.5 After observing the cookie, we must revise the probability to (P H1 ), which is 0.6 | E

HIDDEN MARKOV MODEL

Hidden Markov models are a promising approach in different application areas

Trang 10

where it intends to deal with quantified data that can be partially wrong for example - recognition of images (characters, fingerprints, search for patterns and sequences in the genes, etc.)

The data production model

A hidden Markov chain is a machine with states that we will note

Trang 11

1 – Recognition

We have observed Y = [y0,…,yt,…,yT] [a(m,m'), b(m,n), d(m)] is given What is the most likely state sequence S = [s0, ,st,…,sT] that created it?

2 - Probability of observing a sequence

We have observed a sequence of measures Y = [y0,…,yt,…,yT] What is the probability that the automaton characterized by the parameters [a(m,m'), b(m,n), d(m)] has led to this sequence?

3 - Learning

We have observed Y = [y0,…,yt,…,yT] How to calculate (or rather update) the model's parameters [a(m,m'), b(m,n), d(m)] in order to maximize the probability of observing?

The following algorithm aims to find the sequence of states most likely have produced the measured sequence Y = [y0,…,yt,…,yT] At moment t we calculate recursively for each state

The maximum being calculated for all possible state sequences S = [s0, ,st-1] Initialization: At the moment t = 0

Recurrence: let's assume that the moment t 1 we calculated rt 1(m) for each state We - then have

-The state m most likely occupied at the moment t-1 from which the automaton has evolved into the state m' at the moment t is the state for which rt-1(m)a(m,m')b(m',yt) is maximum For each state m', we thus calculate rt(m'); each of these states has a predecessor qt(m') This predecessor can be used to recover the state sequence most likely

to have created the measurements [y0,…,yt,…,yT] End of the algorithm: the state retained at the moment is that for which rT(m) is maximum The probability for the

Trang 12

measured sequence to have been been emitted by the automaton is rT(m) We can find the sequence of statements by finding the predecessor of rt(m), and recursively, in the same way on to s0

The probability that a state sequence S = [s0, ,st,…,sT] has generated Y is obtained by using the property of Markovian sources

Trang 13

1 A Markov System Has N states, called s1, s2 sN

2 There are discrete timesteps, t=0, t=1, …

3 On the t’th timestep the system is in exactly one of the available states Let's call it qt (qt ∈ {s1, s2 sN })

4 Between each timestep, the next state is chosen randomly

5 The current state determines the probability distribution for the next state Applications

Each form appears as a point in the form space This space - noted H x - can be described

by the indices matrix x (i, j):

H x = [ x ( i, j ) ; i = 1, 2, ., N ; j = 1, 2, ., m ] = [ x kT ; k = 1, 2, ., N ]

where N is the number of forms

Form classification can be understood as a partitioning of the form space in mutually

Trang 14

exclusive domains, each domain belonging to a class:

, where is the set of points F that constitute the boundaries between classes

From the mathematical problem of this kind of classification can be defined as a discriminant function D j ( x ) associated with the form class h j , ( j = 1, 2, , n ) with the property that if the form represented by the vector belongs to x hi, then the value of

DECISION THEORY CLASSIFIERS (FREE DISTRIBUTION)

Trang 15

It is based on the evaluation of the distances between the form to be classified and

a set of reference vectors from the forms space If we assume that n reference vectors are known, noted R

where Mj is the number of forms in class hi

The distance between the form x and the vector Ri of the class hi is:

For r = 2 we obtain the square discriminant function:

, with k1, k2, ., kr = 1, ., m and n1, n2 = 0 or 1

Discriminant functions of linear and non-linear type are represented below, in the case of

Trang 16

a two-dimensional space (the form space is a plane)

LINEAR CLASSIFICATION

For an observation vector x, in , the output of the classifier is given by:

where is a vector of weights, w0 is the bias, and f is a function that converts the scalar product of two vectors into the desired output The weight vector w is learned from a labeled learning set The function f is often a simple threshold function, such as the sign function, the Heaviside function, or more complex functions such as hyperbolic tangent,

or the sigmoid function A more complex decision function could yield the probability that a sample belongs to a certain class

For a problem of discrimination in two classes, the operation performed by a linear classifier can be seen as the separation of a large space by a hyperplane: all points on one side of the hyperplane are classified as 1, the others are classified as -1 This hyperplane

is called separating hyperplane

The linear classifiers are often used in situations where low complexity is desired, since they are the simplest and therefore fastest classifiers, especially when the observation vector x is hollow However, methods of decision tree can be faster still The linear classifiers often get good results when N, the number of dimensions of the space of observations is large, such as text search, where each element of x is the number

of words in a document

Generative vs discriminant model

There are two main types of method to estimate the parameters of the vector of a linear classifier:

The first is to model the conditional probability These are called generative models Examples of algorithms of this type are:

Trang 17

• Linear discriminant analysis (or Fisher's linear discriminant) (FLD) It implies the existence of a discriminant model based on a probability distribution function of Gaussian type

• naive Bayesian classifier implies a conditional probability distribution of binomial type The second approach regroups the discriminant analysis model; it first seeks to maximize the classification quality In a second step, a cost function will produce the adaptation of the final classification model (minimizing the errors) Some examples of training the classifiers by the linear discriminant method:

• Perceptron An algorithm that seeks to correct all errors encountered when processing the training sets (and thus improve the learning and the model created from the training sets)

• Support Vector Machine algorithm that maximizes the separating hyperplanes margin

of the classifier using the training sets for learning

It is generally accepted that the models trained by a discriminant method (SVM, logistic regression) are more accurate than generic training with conditional probabilities (naive Bayesian classifier or linear) It is considered that the Generative classifiers are more suited for the classification process with a lot of missing data (e.g text classification with little learning data)

LINEAR DISCRIMINANT ANALYSIS

The linear discriminant analysis is one of the techniques of predictive discriminant analysis This is about explaining and predicting an individual belonging to

a class (group) from predefined characteristics measured using predictive variables The variable to predict is necessarily categorical (discrete) The linear discriminant analysis can be compared to supervised methods developed in machine learning and logistic regression in statistics

We have a sample of observations across groups Let be the variable to predict, it takes its values in We have predictive variables

We note the centers of gravity of the conditional point clouds, their variance covariance matrix.-

The aim is to produce a rule of assignment that can predict, for a given observation ω, its associated value in Y from the values taken by X The Bayesian rule is to produce an estimate of the a posteriori probability of assignment

is the a priori probability of belonging to a class represents the density function of X conditional to the class

Trang 18

The allocation rule for an individual ω to be classified becomes

, if and only if The whole issue of discriminant analysis is then to provide an estimate of

There are two main approaches to estimating the distribution :

• The non-parametric approach makes no assumption about the distribution and proposes a procedure for local estimation of the probabilities in the vicinity of the observation ω to be classified The procedures are the best known Parzen kernels and method of nearest neighbors The main challenge is to adequately define the neighborhood

In statistics, the estimation by kernel (or method of Parzen-Rozenblatt) is a nonparametric method for estimating the probability density of a random variable It is based

-on a sample of a statistic populati-on and estimates the density at any point of the support

In this sense, this cleverly generalizes the method of estimation by histogram

The idea behind the Parzen method is a generalization of the method of histogram estimation In the second method, the density at a point x is estimated by the proportion

of observations x1, x2, ., xN in the vicinity of x To do this, draw a box with x and whose width is governed by a smoothing parameter h; then count the number of observations that fall into this box This estimate, which depends on the smoothing parameter h, has good statistical properties but is non-continuous The kernel method is to recover continuity: for this, it replaces the box centered at

x and width h by a bell curve centered in x the More an observation is close to the point

of support x, the higher the numerical value the bell curve will give it In contrast, observations too far from x are assigned a negligible numerical value The estimator is formed by the sum (or rather the average) of bell curves As shown in the picture below,

it is clearly continuous

Trang 19

Six Gaussian bell curves (red) and their sum (blue)

The kernel estimator of the density f (x) is the average (for the number of bell curves, 6) The variance of the normals is set to 0.5 Finally, the more observations are in the neighborhood of a point, the higher the density is

It can be shown that under weak assumptions, there is no nonparametric estimator which converges faster than the kernel estimator

The practical use of this method requires two things:

• the kernel K (usually the density of a statistical law);

• the smoothing parameter h

If the choice of kernel is known as little influence on the estimator, it is not the same for the smoothing parameter A parameter too low causes the appearance of artificial details appearing on the plot of the estimator For a value of h too large, the majority of features on the contrary erased The choice of h is therefore a central issue in the estimation of density

• The second approach makes assumptions about the distribution of conditional point clouds; we speak in this case parametric discriminant analysis The most commonly

used hypothesis is undoubtedly the multinormal law, which takes values in

In the case of multidimensional normal distribution, the distribution of conditional point clouds is

where is the determinant of the matrix variance co-variance in condition

The objective is to determine the maximum of the a posteriori probability of assignment,

we can ignore everything that does not depend on k Passing to the logarithm, we obtain the discriminant score which is proportional to

The assignment rule becomes

If we fully develop the discriminant score, then we have quadratic discriminant analysis Widely used in research because it behaves very well in terms of performance, compared to other methods, it is less widespread in practice Indeed, the expression of the discriminant score is rather complex, it is difficult to discern clearly the direction of

Trang 20

causality between predictive variables and class It is especially difficult to distinguish the truly determinant variables in the classification, so that the interpretation of the results

The estimated variance covariance matrix is in this case the variance covariance matrix - intra-classes calculated using the following expression

-Again, we can drop in the discriminant score all that no longer depends on k; it then becomes

By developing the expression of the discriminant score after introduction of approximately constant variability, we see that it is expressed linearly in relation to the predictive variables We have therefore as many classification functions as variables to predict; they are linear combinations of the following form:

Trang 21

SUPPORT VECTOR MACHINE

The Support Vector Machine, SVM are a set of supervised learning techniques to solve problems of discrimination and regression The SVMs have been developed in the 1990s from theoretical considerations by Vladimir Vapnik on developing a statistical theory of learning: the theory of Vapnik Chervonenkis The SVM were quickly adopted -for their ability to work with large amount of data, low number of hyper parameters, the fact that they are well theoretically-founded, and their good practical results The SVMs have been applied to many fields (bio-informatics, information retrieval, computer vision, finances) The data indicates that the performance of support vector machines is similar or even superior to that of a neural network

Separators with large margins are based on two key ideas: the concept of maximum margin and the concept of kernel function These two concepts existed for several years before they were pooled to construct the SVM

The first key idea is the notion of maximum margin The margin is the distance between the border of separation and the closest samples These samples are called support vectors In SVM, the boundary of separation is chosen as that which maximizes the margin This is justified by the theory of Vapnik Chervonenkis (or - statistical theory

of learning), which shows that the boundary of maximum margin of separation has the smallest capacity The problem is finding the optimal separating boundary, from a learning set This is done by formulating the problem as a quadratic optimization problem for which there are known algorithms

In order to address cases where data are not linearly separable, the second key idea of SVM is to transform the representation of data into a larger (possibly infinite dimension), in which it is likely that there is a linear separator This is done through a kernel function, which must meet certain conditions, and has the advantage of not requiring explicit knowledge of the transformation to apply for change of space (like from Cartesian to radial) Kernel functions allow us to transform a scalar product in a large space, which is expensive, a simple evaluation of a function This technique is known as the kernel trick

The SVM can be used to solve problems of discrimination, i.e to decide to which class a sample belongs, or regression, i.e predict the numerical value of a variable The resolution of these problems is through a function h which for an input vector x produces

a corresponding output y: y = h (x)

NEURAL NETWORKS

An artificial neural network is a computational model whose design is roughly based on the functioning of real neurons (human or not) Neural networks are usually optimized by methods of statistical learning style, so they are both in the family of statistical applications, and partly in the family of methods of artificial intelligence that they improve to making decisions based more on perception than on formal logic Neural networks are built on a biological paradigm, that of formal neuron (in the same way as genetic algorithms are the natural selection) These types of biological metaphors have become common with the ideas of cybernetics

Trang 22

Neural networks, as a system capable of learning, implement the principle of induction, i.e learning by experience By interacting with isolated situations, they infer

an integrated decision system where the generic character is based on the number of encountered cases of learning and their complexity compared to the complexity of the problem to solve By contrast, symbolic systems capable of learning, if they also implement induction, are based on algorithmic logic, by a complex set of deductive rules (e.g PROLOG)

With their ability of classification and generalization, neural networks are generally used in problems of statistical nature, such as automatic classification of postcodes, or character recognition The neural network does not always rule used by a human The network is often a black box that provides an answer when he presents, but the network does not give easy to interpret

Neural networks are actually used, for example, for:

• for classification, e.g for the classification of animal species from pictures

• pattern recognition, e.g for optical character recognition (OCR), and in particular by banks to verify the amount of the check by the Post Office to sort mail according to postal code, etc

• approximation of an unknown function

• accelerated modeling of a known but very complicated to calculate function

• Stock estimates: Attempts to predict the frequency of stock prices This type of prediction is highly because it is not clear that the course of a share has recurring character (the market anticipates a largely increases as decreases predictable, which applies to any possible frequency variation of the period to make it difficult to reliably)

• modeling of learning and improving techniques of teaching Limits

• artificial neural networks require real case examples used for learning (what we call the learning base) These cases must be more numerous as the problem is more complex and that its topology is less structured For example, we can optimize a neural system of character reading by using the images of a large number of characters written by hand by many people Each character can be presented as a raw image, with a topology with two spatial dimensions, or a series of almost all segments connected The chosen topology, the complexity of the phenomenon modeled, and the number of examples must be relevant Practically, this is not always easy because the examples can be absolutely limited in quantity or too expensive to collect in sufficient numbers

The neuron calculates the sum of its entries multiplied by the corresponding weights (and adds the biases, if any), and then this value is passed through the transfer function to produce its output

A neural network is generally composed of a succession of layers, each of which takes as inputs the outputs of the previous one Each layer (i) is composed of Ni neurons, taking their input on the Ni 1 neurons in the previous layer to Each synapse synaptic -weights are associated, so that the Ni-1 are multiplied by this weight, then added by the neurons of level i Beyond this simple structure, the neural network may also contain loops that change radically the possibilities but also the complexity In the same way that loops can transform a combinatorial logic in sequential logic, the loops in a network of neurons transform a device for recognition of inputs into a complex machine capable of all sorts of behaviors

Trang 23

A neural network is a very large number of small identical processing units called artificial neurons They were the first electronic implementations (the Rosenblatt perceptron) and are most often simulated on a computer today to issues of cost and convenience

Neurobiologists know that each biological neuron is connected sometimes to thousands of others, and they transmit information by sending waves of depolarization More specifically, the neuron receives input signals from the other neurons by synapses, and outputs information through its axon In roughly similar manner, the artificial neurons are connected together by weighted connections With their size and speed, the networks can handle very properly questions of perception or automatic classification

• The MLP- type networks (Multi Layer Perceptron) calculate a linear combination of inputs, i.e the function returns the combination of inner product between the vector of inputs and the vector of synaptic weights

-• The type networks RBF (Radial Basis Function) calculate the distance between the inputs, i.e the function returns a combination of standard Euclidean vector from the vector difference between the input vectors

The activation function (or transfer function) is used to introduce non-linearity in the functioning of the neuron

Propagation of information

After this calculation is made, the neuron propagates its new internal state forward through its axon In a simple model, neuronal function is simply a threshold function: it is 1 if the weighted sum exceeds a certain threshold, 0 otherwise In a richer model, the neuron operates with real numbers (often in the interval [0,1] or [-1,1]) It is said that the neural network goes from one state to another when all of its neurons recalculate parallel their internal state, according to their entries

Learning

The concept of learning is not modeled in the context of deductive logic: this kind

of learning starts from what is already known from which it derives new knowledge But this is the opposite approach: by limited observations, it draws plausible generalizations:

it is an induction process

The concept of learning covers two facts

• memory: the process of assimilating possibly many examples in a dense form,

• generalization: being able to learn through examples, to treat distinct examples that have not been encountered yet, but similar to the ones in the training set These two points are partly in opposition If we favor one of them, we develop a system that does not deal very effectively another

In the case of statistical learning systems, used to optimize conventional statistical models, neural networks and Markov automates, it is the generalization that we should focus on

Learning can be supervised or not:

A supervised learning is that when we force the network to converge to a specific final state, at the same time that he has a reason

In contrast, in a non supervised learning, the network is allowed to converge to any state when it has a reason

Trang 24

-Algorithm

The vast majority of neural network algorithms receive a training that is to change the synaptic weight according to a set of data input of the network The purpose of this training is to enable the neural network to learn from examples If the training is done correctly, the network is able to provide responses in output very close to the original values of the training set.But the whole point of neural networks lies in their ability to generalize from the test set

Overtraining

Often, the examples of the basic learning include noisy or approximate values If

it requires the network to respond to almost perfect with respect to these examples, we can get a network that is biased by incorrect values To avoid this, there is a simple solution: just divide the examples in 2 subsets The first is used for learning and the 2nd

is the validation As long as the error obtained on the 2nd set decreases, you can continue learning, otherwise, stop it

Backpropagation

Backpropagation is in fact transmitting backward the error a neuron "commits" to its synapses and to the neurons connected thereto For neural networks, we typically use error gradient backpropagation, to correct mistakes according to the importance of the elements that have participated in the creation of these errors: synaptic weights which contribute to generate a significant error will be more significantly modified than the weight which have led to small error

All the weight of synaptic connections determine the functioning of the neural network The reasons are presented to a sub-set of the neural network: input layer When applying a pattern to a network, it aims to reach a stable condition When reached, the activation values of output neurons are the result Neurons that are neither part of the input layer or the output layer are called hidden neurons

The types of neural networks differ in several parameters:

• the topology of connections between neurons;

• The aggregation function used (weighted sum, pseudo-Euclidean distance );

• activation function used (sigmoid, step, linear function, Gaussian, );

the learning algorithm (gradient backpropagation, cascade correlation);

Many other parameters may be implemented as part of the learning of these neural networks, for example:

• the method of weight decay, thus avoiding the side effects and neutralize over-learning; Since there is a whole chapter dedicated to neural networks, this presentation is general All important details are given in chapter IV

Examples of pattern recognition

For optical character recognition we have an image (e.g bitmap) with containing

a printed (or handwritten) text Image may come from scaning a page from a printed paper We assume that each character was first segmented (using specific image processing techniques) and so we have a set of binary objects The image corresponding

to each character can be codified by a matrix structure in m x n values belonging to {0,1}

Trang 25

where 0 encodes lack of black pixel and 1 the their presence Below, examples of the letters a b c, , , and represented in binary form d

Quantitative characteristics: electron microscopy images classification

The left image comes from electron microscopy of hepatitis B patients' serum and highlights 3 types of particles: small spherical particles with a diameter of 22 nm, tubular forms of thickness of 22 nm and length 20 250 nm and Danne corpusculi of circular form -with a diameter of 42 nm Hepatitis B virus is considered to be the viral particle called Danne corpuscul Using specific image processing Segmentation techniques, the right-hand side image was obtained, where the particles are represented as binary objects

In order to detect the presence of the Danne corpusculi, we will represent the forms (particles) using 2 features associated to objects' geometry: area and Circularity The area

of an object is given by the number of pixels in the image that compose the object Circularity is defined as the ratio of the area and the square perimeter, representing an indicator of the shape of the object: C = 4πA/ P2 Circularity is 1 for a circle and a subunitary value for any other geometrical figure This will allow discrimination between the two circular and tube shapes while the area will make the difference between the two circular forms (diameter 22 and 42 nm respectively)

Each form (particle) will be described by the pair (A, C) so the form space will be R2 Calculating the values of the two features for each particle in the image, we represent each form as a point in the 2D plan, with the features associated with the two axes Similar representations can be made and 3 features (in the 3D space)

Trang 26

Circularity

Representation of the observed relative group forms (particles) of the three types and we can distinguish:

Class 1: small circular corpuscles (small area and circularity approximately 1)

Class 2: Danne corpuscles (large area circularity approximately 1)

Class 3: tube shaped corpuscles area varies due their length of 20( -250 nm circularity is approximately 0.5)

PATTERN RECOGNITION APPLICATIONS

• Biometrics: voice, iris, finger print, face, and gait recognition

• Lie detector

• Handwritten Zip code/digit/letter recognition

• Speech/voice recognition

• Smell recognition (e-nose, sensor networks)

• Defect detection in chip manufacturing

• Reading DNA sequences

• Medical diagnosis

Of all these, I decided to study the application of pattern recognition to Optical (printed Character Recognition (OCR); my approach is described in the next chapter, III

Trang 27

III Optical character recognition (OCR)

it, and with the help of Harvey Cook, a friend, built "Gismo" in his attic during evenings and weekends

Shepard then founded Intelligent Machines Research Corporation (IMR), which delivered the first OCR systems in the world to be exploited by private companies The first private system was installed for Reader's Digest in 1955, and many years later, was donated by Readers Digest to the Smithsonian, where it was put on display Other systems sold by IMR during the late 1950s included a billing slip reader to the Ohio Bell Telephone Company and a scanner to the U.S Air Force for the reading and transmission

of typed messages by telex IBM and others later also used the Shepard patents Since 1965, the United States Post uses OCR machines whose principle of operation has been designed by Jacob Rabinow, a prolific inventor, to sort mail Canada Post uses OCR systems since 1971 OCR systems read the name and address at the first automated sorting center, and print on the envelope a barcode The letters can then be sorted in the following centers by less expensive sorters which need only to read the barcodes To avoid any interference with the read address which can be anywhere on the letter, special ink is used, which is clearly visible under UV light This ink appears orange in normal lightning conditions

TECHNIQUE

The next steps that are generally followed in a character recognition system:

1 Encoding requires a data representation method compatible with the computer’s understanding needs, e.g., a camera or a scanner

2 Pre processing: noise cancellation, data size reduction, normalizations, recovery of slanted or distorted images, contrast corrections, the switch to two-color (black and white "or rather paper and ink), the detection of contours etc

3 Segmentation – i.e to isolate the lines of text in the image and characters within the lines It's also possible to detect underlined text, frames, and images

Trang 28

4 Learning is building up a classifying model and assigning a class to each element in the training set

5 Analysis and decision consist in attributing a previously unknown object to one

of the classes previously determined

Recognition methods

1 Classification by features: a shape to recognize is represented by a vector of numerical values the features – - calculated from this shape The number of features is from around 100 to 300 If the features are well chosen, a character class is represented by a "cloud" of adjacent points in the vector space of features The role of the classifier is to determine to which cloud (i.e what class of characters) the shape to recognize most likely belongs In this method class various types of artificial neural networks trained on databases of possible forms are included

2 Metric methods: consists in directly comparing the shape to recognize, using distance algorithms, with a series of learned models This kind of method is rarely used and little valued by researchers, because it poor results

3 Statistical methods: Statistical methods, such as Bayesian nets and Markov chains are often used, especially in the field of handwriting recognition

6 Post-processing means the possible validation of the made recognition decision For example, use of linguistic and contextual rules to reduce the number of recognition errors: dictionaries of words, syllables, ideograms In industrial systems, specialized techniques for certain areas of text (names, addresses) can use databases to eliminate the incorrect solutions

OCR used to work only with previously computer generated writing, it could only recognize typed letters but recently, handwritten text recognition also improved enormously, mainly due to the demand for the so-called “online” recognition, in which the user writes directly into the device (mobile phone, PC tablet, etc.) by hand The other,

“classical” recognition, working with already complete documents (images), is called

“offline” recognition This is the part of OCR studied in this paper Another element to consider is that OCR needs to scan the text, and to do that it needs that text to be of a relatively reasonable size If the font is too small OCR will not

be able to determine the separation between letters and will only see a blob of shapes

Trang 29

This can be largely dependent on the quality of the scanner, as higher quality scanners tend to get considerably sharper images that make Optical Character Recognition easier Key elements for a successful OCR system

1 It takes a complimentary merging of the input document stream with the processing requirements of the particular application with a total system concept that provides for convenient entry of exception type items with an output that provides cost effective entry to complete the system To show a successful example, let's review the early credit card OCR applications 1 Input was a carbon imprinted document However,

if the carbon was wrinkled, the imprinter was misaligned, or any one of a variety of reasons existed, the imprinted characters were impossible to read accurately

2 To compensate for this problem, the processing system permitted direct key entry of the fail to read items at a fairly high speed Directly keyed items from the misread document were under intelligent computer control which placed the proper data

in the right location for the data record Important considerations in designing the system encouraged the use of modulus controlled check digits for the embossed credit card account number This, coupled with tight monetary controls by batch totals,

reduced the chance of read substitutions

3 The output of these early systems provided a "country club" type of billing That is, each of the credit card sales slips was returned to the original purchaser This provided the credit card customer with the opportunity to review his own purchases to insure the final accuracy of billing This has been a very successful operation through the years Today's systems improve the process by increasing the amount of data to be read, either directly or through reproduction of details on the sales draft This provides customers with a "descriptive" billing statement which itemizes each transaction Attention to the details of each application step is a requirement for successful OCR systems

Hand printe- d recognition systems

Recognition of cursive text is an active area of research, with recognition rates even lower than that of hand-printed text Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script Reading the Amount line of a cheque (which is always

a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy The shapes of individual cursive characters themselves simply do not contain enough information to accurately (higher than 98%) recognize all handwritten cursive script

This has been a subject of intensive research in the last 10 years Significant improvements in the performance of recognition systems have been achieved Currentsystems are capable of transcribing handwriting with average recognition rates of 50-99 percent, depending on the constraints imposed (e.g., size of vocabulary, writer dependence, writing style, etc.) and also on the experimental conditions The improvements in performance have been achieved by different means Some researchers have combined different feature sets or used optimized feature sets Better modeling of

Trang 30

reference patterns and adaptation have also contributed to improve the performance However, one of the most successful approaches to achieving better performance is the combination of classifiers This stream has been used especially in application domains where the size of the lexicon is small Combination of classifiers relies on the assumption that different classification approaches have different strengths and weaknesses which can compensate for each other through the combination Verification can be considered

as a particular case of combination of classifiers The term verification is encountered in other contexts, but there is no consensus about its meaning Oliveira defines verification

as the postprocessing of the results produced by recognizers

Madhvanath defines verification as the task of deciding whether a pattern belongs

to a given class Cordella defines verification as a specialized type of classification devoted to ascertaining in a dependable manner whether an input sample belongs to a given category Cho defines verification as the validation of hypotheses generated by recognizers during the recognition process In spite of different definitions, some common points can be identified and a broader definition of verification could be apostprocessing procedure that takes as input hypotheses produced by a classifier or recognizer and which provides as output a single reliable hypothesis or a rejection of the input pattern In this paper, the term verification is used to refer to the postprocessing of the output of a handwriting recognition system resulting in rescored word hypotheses

In handwriting recognition, Takahashi and Griffin are among the earliest to mention the concept of verification and the goal was to enhance the recognition rate of an OCR algorithm They have designed character recognition system based on a multilayer a perceptron (MLP) which achieves a recognition rate of 94.6% for uppercase characters of the NIST database Based on an error analysis, verification by linear tournament with one- -to one verifiers between two categories was proposed and such a verification scheme increased the recognition rate by 1.2 percent Britto used a verification stage to enhance the recognition of a handwritten numeral string HMM-based system The verificationstage, composed of 20 numeral HMMs, as h improved the recognition rate for strings of different lengths by about 10% (from 81.65% to 91.57 percent) Powalka proposed a hybrid recognition system for online handwritten word recognition where letter verification is introduced to improve disambiguation among word hypotheses

A multiple interactive segmentation process identifies parts of the input data which can potentially be letters Each potential letter is recognized and further concatenated to form strings The letter verification procedure produces a list of wordsconstrained to be the same words provided by a holistic word recognizer Scores produced by the word recognizer and by the letter verifier are integrated into a single score using a weighted arithmetic average Improvements between 5% and 12% in the recognition rate are reported Madhvanath describe a system for rapid verification of unconstrained offline handwritten phrases using perceptual holistic features Given a binary image and a verification lexicon containing ASCII strings, holistic features are predicted from the verification ASCII strings and matched with the feature candidates extracted from the binary image The system rejects errors with 98% accuracy at a 30% acceptance level Guillevic and Suen presented a verification scheme at character level for handwritten words from a restricted lexicon of legal amounts of bank checks.Characters are verified using two k NNclassifiers The results- of the character recognition

Trang 31

are integrated with a word recognition module to shift up and down word hypotheses toenhance the word recognition rate

Some works give a different meaning to verification and attempt to improve reliability Recognition rate is a valid measure to characterize the performance of a recognition system, but, in real life applications, systems are required to have a high -reliability Reliability is related to the capability of a recognition system not to accept false word hypotheses and not to reject true word hypotheses Therefore, the question is not only to find a word hypothesis, but also to find out the trustworthiness of the hypothesis provided by a handwriting recognition system This problem may be regarded being as difficult as the recognition itself It is often desirable to accept word hypotheses that have been decoded with sufficient confidence This implies the existence of a hypothesis verification procedure which is usually applied after the classification

Verification strategies whose only goal is to improve reliability usually employmechanisms that reject word hypotheses according to established thresholds Pitrelli andPerrone compare several confidence scores for the verification of the output of an HMM-based online handwriting recognizer Better rejection performance is achieved by an MLP classifier that combines seven different confidence measures Gorski presents several confidence measures and a neural network to either accept or reject word hypothesis lists Such a rejection mechanism is applied to the recognition of courtesy check amount to find suitable error/rejection trade-offs Gloger presented two different rejection mechanisms, one based on the relative frequencies of reject feature values and another based on a statistical model of normal distributions to find a best trade off -between rejection and error rate for a handwritten word recognition system

IBM printed character recognition system

The recognition system consists of two main processing units — a character separator and an isolated character classifier Character separation (frequently called segmentation) can work in two modes:

- fixed (constrained) spacing mode (where character size is known in advance and therefore segmentation can be very robust)

- variable (arbitrary) spacing (where no a priori information can be assumed)

Hence, our demo consists of three parts: recognition of isolated characters, work with constrained segmentation, and work with unconstrained segmentation The recognition module gets on input an extracted and size normalized image representing a -character to be recognized The module produces on output an ordered list of a few of the most probable classification candidates, together with their confidence values The task is performed by matching the raster sample with template masks representing different characters The masks are prepared by an off line training phase A mask can be -considered as a raster image containing three types of pixels: black, white, and undefined (gray) Initially, template masks are built per font In a single font-set of masks, every character is represented by exactly one mask In practice, a font character to be recognized is often unknown a priori Hence, templates representing the most prevalent fonts are prepared and combined together So an Omnifont recognizer is used An input image is correlated with all the masks stored in the recognizer The mask which has the highest correlation score is taken to be the primary result of the recognition

Constrained Printing Recognition

Trang 32

In this case, character spacing is fixed Hence, segmentation is possible even when fields are distorted Credit card numbers impressed through a copy paper on transaction vouchers

Unconstrained Character Recognition

In the following example, we can see the main steps of the recognition process The input image used in the example was extracted from a fax cover sheet -

● Possible slant is estimated and compensated (in order to cope with italics and backslanted fonts)

● Top and bottom base lines are detected (The base lines are shown in red and blue colors on the following picture.)

● The whole image is divided into horizontally separated "words." (Two such words are detected in our example They are separated by a vertical green line.)

● Each word is processed separately, and decomposed into connected components

● The connected components undergo further analysis Some of them are decomposed into smaller parts (we call them atoms)

Thus, the problem of characters separation is reduced to a problem of correct partition of

an ordered sequence of atoms In other words, we need to combine the atoms to molecules Of course, this can be done in a variety of ways This choice is performed by using the recognition confidence values, produced by the character classification kernel described above All the molecules are recognized separately The average value of the recognition probabilities obtained for the corresponding molecules provides an estimate

of the confidence of the entire word for this example:

Segmentation #1 - confidence = 0.83

Segmentation #2 - confidence = 0.79

Segmentation #3 - confidence = 0.71

Correct segmentation - confidence = 0.95

This process enables the successful recognition of broken and connected characters and dot-matrix printing

Advances are being made to recognize characters based on the context of the word in which they appear, as with the Predictive Optical Word Recognition algorithm from ScanSoft Inc The next step for developers is document recognition, in which the software will use knowledge of the parts of speech and grammar to recognize individual characters

Determining what text in an image is can be a difficult task Consider the process below used in a language-independent OCR system described by researchers at BBN Technologies Inc and GTE Internetworking The top half of the diagram shows elements used in setting and training the system and in using scanned data, as well as rules specific

to the language and its orthography (the alphabet or other symbols)

Trang 33

OCR has been extensively used by libraries and governments to transfer

information previously available only in classical media to the more practical, handy, easier to process electronic format

With the advent of higher computer speeds and desktop personal computers, scanners are beingdeveloped for desk use for such things as high-speed entry of articles,

or other information, through imaging devices Not all of these utilize OCR recognition, but as the logic to handle intermixed fonts is developed, the number will increase Over time, OCR will become more powerful and less expensive Recognition will be done from captured images rather than from the actual item There will continue to be a need for improvement in handwritten character recognition and to reduce the fairly stringent document requirements of today's systems

While many applications today use direct data entry via keyboard, more and more

of these will return to automated data entry The reasons for this include the increased incidence of operator wrist problems from constant keying and the potential hazards of video display terminal emissions Therefore any application imaginable is a candidate for OCR

Many applications exist where it would be desirable to read handwritten entries Reading handwriting is a very difficult task considering the diversities that exist in ordinary penmanship However, progress is being made Early devices, using non-reading inks to define specifically sized character boxes, read- constrained handwritten entries This resulted in the development of a standard encouraging a certain style of handwriting The best example of unconstrained handwriting reading was the IBM 3895 This device read the convenience amount entries from checks and then encoded the amount on the check in magnetic E13B characters It is difficult to design a system to take care of misread characters The 3895 also reads the entries from deposit listings to confirm or to prevent substitutions With the advent of image processing systems, this type of recognition is once again being developed Restrictions on character size and the ability

to provide target areas that are outlined in non-read inks will assist the accuracy of recognition

Trang 34

OCR has never achieved a read rate that is 100% perfect Because of this, a system which permits rapid and accurate correction of rejects is a major requirement Exception item processing is always a problem because it delays the completion of the job entry, particularly the balancing function

Of even greater concern is the problem of misreading a character (substitutions)

In particular, if the system does not accurately balance dollar data, customer dissatisfaction will occur

Through the years, the desire has been:

1 to increase the accuracy of reading, that is, to reduce rejects and substitutions

2 to reduce the sensitivity of scanning to read less-controlled input

3 to eliminate the need for specially designed fonts (characters), and

4 to read handwritten characters

However, today's systems, while much more forgiving of printing quality and more accurate than earlier equipment, still work best when specially designed characters are used and attention to printing quality is maintained However, these limits are not objectionable to most applications, and dedicated users of OCR systems are growing each year should It also be noted that the ability to read characters is not, by itself, sufficient

to create a successful system

Developing a robust OCR system is a complicated task and requires a lot of effort Such systems usually are really complicated and can hide a lot of logic behind the code The use of artificial neural network in OCR applications can dramatically simplify the code and improve quality of recognition while achieving good performance Another benefit of using neural network in OCR is extensibility of the system's ability to recognize more character sets than initially defined The task as working with tens of thousands of Chinese characters, for example, is not as easy as working with the traditional Latin-based character sets

Trang 35

IV Neural networks

Among the many pattern recognition methods I chose to study the Neural Network approach

A neural network is a wonderful tool that can help to resolve OCR type problems

Of course, the selection of appropriate classifiers is essential The NN is an processing paradigm inspired by the way the human brain processes information Neural Networks are collections of mathematical models that represent some of the observed properties of biological nervous systems and draw on the analogies of adaptive biological learning The key element of an neural network is its topology Unlike the original Perceptron model, shown by Minsky and Papert to have limited computational capability, the neural networks of today consists of a large number of highly interconnected processing elements (nodes) that are tied together with weighted connections (links) Learning in biological systems involves adjustments to the synaptic connections that exist between the neurons This is true for neural networks as well Learning typically occurs

information-by example through training, or exposure to a set of input/output data (pattern) where the training algorithm adjusts the link weights The link weights store the knowledge necessary to solve specific problems Inrecent years neural computing has emerged as a practical technology, with successful applications in many fields The majority of these applications are concerned with problems in pattern recognition, and make use of feed-forward network architectures such as the multi layer perceptron and the radial basis -function network Also, it has also become widely acknowledged that successful applications of neural computing require a principled, rather than ad hoc, approach From the perspective of pattern recognition, neural networks can be regarded as an extension of the many conventional techniques which have been developed over several decades Artificial neural networks (as opposed to the biological neural networks, which they were modeled after) are made up of interconnecting artificial neurons (programming constructs that mimic the properties of biological neurons – simple processing units) Artificial neural networks are used for solving artificial intelligence problems without necessarily creating a model of a real biological system They just emulate the capacity of the living systems to LEARN and ADAPT Their main feature is their ability to learn from examples

In a network, each sub group is treated independent of the others and transmit the results of its analysis to subgroups The information network will spread layer by layer, from input layer to output layer, from or through any one or more intermediate layers (called hidden layers) It should be noted that according to the learning algorithm, it is also possible to have a spread of information backwards (back propagation) Usually (except for layers of input and output), each neuron in a layer is connected to all neurons

-in the previous layer and the next layer

The RNA has the ability to store of empirical knowledge and make it available for use Processing skills (and thus knowledge) of the network will be stored in synaptic

Trang 36

weights, obtained by processes of adaptation or learning In this sense, the RNA thus resembles the brain not only because the knowledge is acquired through learning but more, this knowledge is stored in the connections between the entities, either in the synaptic weight

HISTORY

Neurologists Warren McCulloch and Walter Pitts led the early work on neural networks They formed a simplified model of biological neuron called formal neurons They also showed theoretically that formal neural networks can perform simple logic, arithmetic and symbolic functions

The function of formal neural networks, like the biological model, is to solve problems Unlike traditional methods of computer resolution, we should not build a program step by step The most important parameters of this model are the synaptic coefficients They are the ones who build the model resolution in the information given to the network We must therefore find a mechanism to calculate them from the data we can acquire for the problem at hand This is the fundamental principle of learning In a model

of formal neural networks, learning is to first calculate the values of synaptic coefficients using the available examples

The work of McCulloch and Pitts gave no indication of a method for adapting the synaptic coefficients This issue at the heart of thinking about learning has experienced an initial response through the work of Canadian physiologist Donald Hebb on learning in

1949 described in his book The Organization of Behavior Hebb proposed a simple rule that lets you change the value of the synaptic coefficients depending on the activity of units they connect This rule now known as the "Hebb rule" is almost always present in current models, even the most sophisticated

From this article, the idea is sowed over time in people's minds, and it germinated

in the mind of Frank Rosenblatt in 1957 with the model of the perceptron This is the first artificial system capable of learning from experience, including where his instructor commits few errors

In 1969, a serious blow was dealt to the scientific community around neural networks: Lee Marvin Minsky and Seymour Papert published a book highlighting some limitations of the theoretical Perceptron, including the inability to handle nonlinear problems or related They extended the limitations implicit in all models of artificial neural networks Then appear at an impasse, research on neural networks lost much of its public funding, and industry away from them too The funds for artificial intelligence were redirected instead to the formal logic and research walked for ten years However, the solid qualities of certain neural networks in adaptive material, (e.g Adaline), enabling them to model in evolutionary phenomena themselves changing the lead to be integrated into more or less explicit in the corpus of adaptive systems used in telecommunications or process control industries

In 1982, John Joseph Hopfield, recognized physicist, gave a new impetus to the neuron with an article introducing a new model of neural network (fully recurrent) This article was success for several reasons, including the main color was the theory of neural networks rigor own physicists The neuronal became a subject of study acceptable,

Trang 37

although the Hopfield model suffers major limitations of the model year 1960, including the inability to treat non-linear problems

At the same time, algorithmic approaches to artificial intelligence was the subject

of disillusionment, their applications do not meet expectations This disillusionment motivated a reorientation of research in artificial intelligence to neural networks (although these relate to the perception that artificial intelligence artificial, strictly speaking) The search was restarted and the industry took some interest in neuronal (especially for applications such as guided cruise missiles) In 1984, The backpropagation the gradient of the error concept was introduced

A revolution occurs in the field of artificial neural networks: a new generation of neural networks, capable of successfully treating non linear phenomena: the multilayer -perceptron has no defects highlighted by Marvin Minsky First proposed by Werbos, the Multi-Layer Perceptron appears in 1986, introduced by Rumelhart and, simultaneously, under a nearby home Yann Le Cun These systems rely on backpropagating the gradient

of the error in systems with several layers

Neural networks have subsequently increased significantly, and were part of the first systems to receive the light of the theory of statistical regularization introduced by Vladimir Vapnik in the Soviet Union and popularized in the West since the fall of the wall This theory, one of the largest in the field of statistics, allows anticipating, regulating and investigating phenomena related to over-learning It can regulate a system

of learning to referee between the best modeling poor (e.g the average) and modeling too rich to be optimized so unrealistic on a number of examples too small, and would be ineffective on examples not yet learned even close examples learned Over-learning is a challenge faced every system of learning by example, they use methods of direct optimization (e.g linear regression), iterative (e.g gradient descent), or iterative semi-direct (conjugated gradient, expectation maximization .) and that they are applied to -conventional statistical models, the hidden Markov models or networks of formal neurons

Summary

As we have seen, the starting point in history for the neural network science can

be placed in the early 1940s when Warren McCulloch and Walter Pitts proposed the first formal model of the neuron (1943), emphasizing its computing capabilities and the possibility to imitate its operating mode by electronic circuits

In the late 1949, Hebb, based on Pavlov’s research, stated the synaptic permeability adapting principle according to which, every time a synaptic connection is used, its permeability in creases Upon this principle is founded the synaptic weights altering adapting

In 1957 Rosenblatt developed a hardware implemented network, called perceptron, to recognize printed characters

-1950-1960 – Windrow and Hoff developed algorithms based on minimizing the error for the training set for one-level networks

1969 is considered “the beginning of the end” for neural networks Minsky and Papert published “Perceptrons” in which they acknowledge the limitations of one-level networks In addition to this, the limitations of the available technique of the time, the interest in and the activity of scientists involved in neural networks decreased dramatically

Ngày đăng: 22/01/2024, 17:04