1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Artificial intelligence and machine learning for edge computing

756 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Nội dung

"Artificial Intelligence and Machine Learning for Predictive and Analytical Rendering in Edge Computing focuses on the role of AI and machine learning as it impacts and works alongside Edge Computing. Sections cover the growing number of devices and applications in diversified domains of industry, including gaming, speech recognition, medical diagnostics, robotics and computer vision and how they are being driven by Big Data, Artificial Intelligence, Machine Learning and distributed computing, may it be Cloud Computing or the evolving Fog and Edge Computing paradigms. Challenges covered include remote storage and computing, bandwidth overload due to transportation of data from End nodes to Cloud leading in latency issues, security issues in transporting sensitive medical and financial information across larger gaps in points of data generation and computing, as well as design features of Edge nodes to store and run AI/ML algorithms for effective rendering"

Trang 2

Table of Contents

CoverTitle pageCopyrightContributorsPreface

Part I: AI and machine learningChapter 1: Supervised learningAbstract

1: Introduction2: Perceptron3: Linear regression4: Logistic regression5: Multilayer perceptron6: KL divergence

7: Generalized linear models8: Kernel method

9: Nonlinear SVM classifier10: Tree ensembles

Chapter 2: Supervised learning: From theory to applicationsAbstract

Trang 3

Chapter 3: Unsupervised learningAbstract

1: Introduction2: k-means clustering3: k-means++ clustering4: Sequential leader clustering5: EM algorithm

6: Gaussian mixture model7: Autoencoders

8: Principal component analysis9: Linear discriminant analysis10: Independent component analysisReferences

Chapter 4: Regression analysisAbstract

Trang 4

1: Introduction2: Linear regression3: Cost functions4: Gradient descent5: Polynomial regression6: Regularization

7: Evaluating a machine learning modelReferences

Chapter 5: The integrity of machine learning algorithms against software defect predictionAbstract

1: Introduction2: Related works3: Proposed method4: Experiment5: Results

6: Threats to validity7: ConclusionsReferences

Chapter 6: Learning in sequential decision-making under uncertaintyAbstract

Acknowledgments1: Introduction

2: Multiarmed bandit problem

Trang 5

3: Markov decision process planning problem4: Reinforcement learning

5: SummaryReferences

Chapter 7: Geospatial crime analysis and forecasting with machine learning techniquesAbstract

1: Introduction2: Related work3: Methodology

4: Results and discussion5: Conclusions

4: Truth content discovery algorithm

5: Trustworthy and scalable service providers algorithm

6: Efficient feature extraction and classification (EFEC) algorithm7: QUERY retrieval time (QRT)

8: Trust content discovery and trustworthy and scalable service providers algorithm

Trang 6

9: Efficient feature xtraction and classification (EFEC) algorithm and customer review datasets10: Summary

11: Conclusions

12: Future enhancementsReferences

Chapter 9: Reliable diabetes mellitus forecasting using artificial neural network multilayerperceptron

Abstract1: Introduction2: Related works3: Methodology

4: Building the diabetic diagnostic criteria

5: Evaluating the diabetes outcomes using classification algorithms6: Conclusions

Chapter 10: A study of deep learning approach for the classification of electroencephalogram(EEG) brain signals

Abstract1: Introduction2: Methods3: Results4: Discussion5: ConclusionsReferences

Trang 7

Chapter 11: Integrating AI in e-procurement of hospitality industry in the UAEAbstract

1: Introduction2: Problem statement3: Authors’ contributions4: Significance of the study5: Theoretical framework6: Research aims and objectives7: Literature review

8: Major findings9: Discussions

10: Major gaps in the study11: Conclusions

Chapter 12: Application of artificial intelligence and machine learning in blockchain technologyAbstract

Acknowledgment1: Introduction

2: Applications of artificial intelligence, machine learning, and blockchain technology

3: It takes two to tango: Future of artificial intelligence and machine learning in blockchaintechnology

4: Edge computing: A potential use case of blockchain5: Conclusions

References

Trang 8

Part II: Data science and predictive analysis

Chapter 13: Implementing convolutional neural network model for prediction in medicalimaging

Abstract1: Introduction

2: Convolutional neural networks

3: Implementing CNN for biomedical imaging and analysis4: Architecture models for different image type

5: Conclusion6: Future scopeReferences

Chapter 14: Fuzzy-machine learning models for the prediction of fire outbreaks: A comparativeanalysis

Abstract1: Introduction2: Related literature3: Research methodology

4: Machine learning algorithms for fire outbreak prediction5: Result and discussion

6: ConclusionsReferences

Chapter 15: Vehicle telematics: An Internet of Things and Big Data approachAbstract

1: Introduction

Trang 9

3: Contribution of intelligent e-learning system using BN model4: Learner assessment model

5: Results and discussions6: Conclusions and future workReferences

Chapter 17: Ensemble method for multiclassification of COVID-19 virus using spatial andfrequency domain features over X-ray images

Abstract1: Introduction2: Literature review3: Proposed methodology4: Result analysis

Trang 10

5: Discussion and conclusionsReferences

Chapter 18: Chronological text similarity with pretrained embedding and edit distanceAbstract

1: Introduction2: Literature review3: Theoretical background4: Modeling

5: Experimental settings6: Results and discussion7: Conclusions

5: ConclusionsReferences

Chapter 20: A real-time performance monitoring model for processing of IoT and big data usingmachine learning

Abstract1: Introduction

Trang 11

2: Experimental study3: Major findings4: ConclusionsReferences

Chapter 21: COVID-19 prediction from chest X-ray images using deep convolutional neuralnetwork

Abstract1: Introduction2: Methodology

3: Results and discussions4: Conclusions

ReferencesFurther reading

Chapter 22: Hybrid deep learning neuro-fuzzy networks for industrial parameters estimationAbstract

1: Introduction2: Preliminaries3: Methodology

4: Results and discussion5: Validation of model

6: Discussions on performance evaluation7: Conclusions

8: Future scopeReferences

Trang 12

Chapter 23: An intelligent framework to assess core competency using the level predictionmodel (LPM)

Abstract1: Introduction2: Related work

3: Existing applications4: Proposed system5: Experimental6: ConclusionsReferences

Part III: Edge computing

Chapter 24: Edge computing: A soul to Internet of things (IoT) dataAbstract

1: Introduction

2: Edge computing characteristics

3: New challenges in Internet of technology (IoT): Edge computing4: Edge computing support to IoT functionality

5: IoT applications: Cloud or edge computing?6: Benefits and potential of edge computing for IoT7: Use case: Edge computing in IoT

8: Pertinent open issues which require additional investigations for edge computing9: Conclusions

Chapter 25: 5G: The next-generation technology for edge communication

Trang 13

Abstract1: Introduction2: History

3: 5G technology4: 5G cellular network

5: Components used in 5G technology/network6: Differences from 4G architecture

7: Security of 5G architecture8: 5G time period

9: Case study on 5G technology10: 5G advancement

11: Advantage and disadvantage of 5G technology12: Challenges

13: Future scope14: ConclusionsReferences

Chapter 26: Challenges and opportunities in edge computing architecture using machine learningapproaches

Abstract1: Introduction

2: Overview of edge computing

3: Security and privacy in edge computing

4: Intersection of machine learning and edge using enabling technologies5: Machine learning and edge bringing AI to IoT

Trang 14

2: Hybrid software-defined networks

3: Security challenges in hybrid software-defined networks4: Solutions for hybrid software-defined networks

5: Learning techniques for hybrid software-defined networks6: Discussion and implementation

7: ConclusionsReferencesFurther reading

Chapter 28: Moving to the cloud, fog, and edge computing paradigms: Convergences and futureresearch direction

Abstract1: Introduction

2: Features and differences between cloud, fog, and edge computing3: Framework and programming models: Architecture of fog computing4: Moving cloud to edge computing

5: Case study: Edge computing for intelligent aquaculture6: Conclusions

Trang 15

Chapter 30: AI cardiologist at the edge: A use case of a dew computing heart monitoring solutionA use case of a dew computing heart monitoring solution

Abstract1: Introduction2: Related work

3: Architectural approach4: ECGalert use case5: Discussion

6: ConclusionsReferences

P A R T I

AI and machine learning

Chapter 1: Supervised learning

Kanishka Tyagia; Chinmay Raneb; Michael Manryca Aptiv Advanced Research Center, Agoura Hills,

Trang 16

c Department of Electrical Engineering, The University of Texas at Arlington, Arlington, TX, United States

Machine learning models learn different tasks with different paradigms that effectivelyaim to get the models better through training Supervised learning is a common form ofmachine learning training paradigm that has been used successfully in real-worldmachine learning applications Typical supervised learning involves two phases Inphase 1 (commonly called training), we give input and expect output (also known asground truth) and train a model with respect to a metric using an optimizationalgorithm In phase 2 (commonly called testing), we deploy the model with unseen dataand expect it to either classify or approximate the outputs Although supervisedlearning is covered in almost all machine learning textbooks, we will introduce andexplain supervised learning from an application point of view and its relationship toedge computing All the algorithms will be explained from a mathematical andtheoretical point of view and a programmer’s perspective This will help in doinghands-on experience in implementing algorithms for a variety of problems.

Linear regression; Logistic regression; Steepest descent; Conjugate gradient; Multilayerperceptron; Second-order algorithms; KL divergence; Generalized linear models; Kernelmachines; Bootstrapping

1: Introduction

This chapter discusses various supervised learning paradigm that is used to train and deploymachine learning models While the theory on supervised learning is predominated mainly by thecomputer vision and natural language processing community, much research is required in itsapplication to edge computing With the recent advances in machine learning algorithms,combined with the increasing computational power, edge computing is another area ofapplication where machine learning will be improving the current technology According to awhite paper from Cisco [1], 50 billion internet of things (IoT) devices will be connected to theinternet by the end of 2020, and even though the estimation that nearly 850 ZB of data will begenerated per year outside the cloud by 2021, the global data center traffic is only 21 ZBapproximately This means that a transformation from bug cloud data centers to a wide range ofedge devices is happening.

We plan to cover the traditional supervised learning algorithm basics and the trade tricks whileimplementing them for various real-life applications The assumptions made and theprogramming decisions made while implementing them are also discussed We start withgeneralized linear models along with optimization algorithms (gradient descent, Newton’smethod) and metrics (mean square error [MSE], cross-entropy) used in training the model Then,Nạve Bayes and kernel methods are discussed along with the decision tree and its variants Weprovide pseudocode for all the supervised learning algorithms that we discuss to give the readers

Trang 17

an idea of how the algorithms are implemented from theory to practice In addition, supervisedlearning algorithms that capture the variation of input data distribution and algorithms thatproject data in higher subspace and first- and second-order learning paradigms are the subjectmatter of this chapter The chapter covers the learning paradigms in sync with the modern-dayapplication to edge computing from an implementation perspective.

Edge computing and machine learning will enable the devices to have a more pervasive and grained intelligence Supervised learning is a common way of training machine learningalgorithms that can be later used in the inference stage on IoT and other edge devices.

fine-This chapter covers several supervised learning algorithms that are commonly used in variousreal-life applications. Section 2 covers the basic perceptron model that helps understand thelearning algorithm In Section 3, the linear regression algorithm is discussed that extends the ideaof perceptron algorithm to solve supervised regression tasks We discuss commonly usedgradient-based learning algorithms along with their pseudocodes. Section 4 discusses logisticregression, which is a very commonly used classifier due to its simplicity in structure andgeneralizability In Section 5, we bring the concepts together from the previous sections todiscuss the multilayer perceptron (MLP) network, a widely used neural network in real-lifeapplications We explain the structure, initialization strategy, and detail out various learningalgorithms that are reused to train an MLP. Section 6 discusses the Kullback-Leibler (KL)divergence measure used to calculate how two probability distributions are different from eachother. Section 7 discusses the generalized linear models that extend to the linear models but witha nonlinear activation on the output nodes. Section 8 talks about the kernel methods leadingto Section 9 that explains nonlinear support vector machine (SVM) classifiers We conclude thischapter in Section 10 by discussing various tree ensemble algorithms.

2: Perceptron

Perceptron algorithms came in early 1960; however, Minsky and Papert [2] show the restrictionthey have Rosenblatt [3] described an alpha perceptron that is an example of a statistical patternrecognition (SPR) system In a typical SPR system, the features are obtained from the raw inputusing no learning mechanism but instead using a common-sense rule Therefore, a humandecides what is a good feature and sees if it works, and if it does not work, tries another Welearn the weight associated with each feature activation to get a single scalar quantity and thenbased on if this quantity is above or below a threshold In a typical perceptron model for input

vector, x is used to compute a weighted sum from all the neurons and added with a bias vector.

This bias vector is also known as the threshold vector in the literature.  (1)

The linear perceptron will output y depending on the following rule as shown in Fig 1.

  (2)

Trang 18

The perception algorithm is fast and straightforward, and if the dataset is linearly separable, it isguaranteed to converge.

FIG 1 Perceptron activation functions.

3: Linear regression

In this section, we discuss the structure and notation of a linear regression As shown in Fig 2, a

linear regression is a weight matrix W that transforms an input vector x into a discriminantvector y [4] The weight w(m, n) connects the nth input to the mth output The training dataset

(xp, tp) consists of N-dimensional input vectors xp and M-dimensional desired output vectors tp.

The pattern number p varies from 1 to Nv, where Nv denotes the number of training patterns The

threshold is handled by augmenting x with an extra element, which is equal to 1 as xa = [1 : xT]T.

So xa contains Nu basis functions, where Nu = N + 1 For the pth training pattern, the network

output vector, yp can be written as

where xap denotes xa for the pth pattern.

Trang 19

FIG.

Trang 20

2 Linear regression.

3.1: Training a linear regression

To train the linear regression, we minimize the error function E that is a surrogate for a

nonsmooth classification error As in Ref [5], from a Bayesian point of view, we considermaximizing likelihood function or minimizing MSE in the least-square sense, where the MSEbetween the inputs and the outputs is

Here, the target output for the correct output. M denotes the total number of outputs We

minimize the error function from Eq (9) with respect to W by solving the M sets of N + 1 linear

equations given by  (5)

where the cross-correlation matrix C and the auto-correlation matrix R are, respectively,

Since R in Eq (35) is often ill-conditioned, it is unsafe to use Gauss-Jordan elimination Eq (35)is solved using orthogonal least squares (OLS) [6] In Ref [7], OLS is used to solve for radialbasis function network parameters OLS is useful for practical applications for two primaryreasons First, the training is fast since solving linear equations is straightforward Second, ithelps us to avoid some local minima [8] In terms of optimization theory, solving Eq (35)

for W is merely Newton’s algorithm for the output weights [9].Given

and the error function

Trang 21

  (15)

Trang 22

3.2: Steepest descent

The steepest descent gradient algorithm can be summarized in Algorithm 1.Algorithm 1 Steepest descent gradient algorithm.

1: Initialize w, Nit, it ← 02: while it < Nitdo3:       Calculate g from Eq (12)4:

Compute B2 from Eq (15)5:       Update w as w ← w + B2 ⋅ g6:       it ← it + 17: end while

Fig 3 illustrates gradient descent using a two-dimensional (2D) contour plot The X-axis denotesthe first weight w1 and the Y-axis is second weight w2 The arrows [g0,g1,g2,g3,g4,g5] in Fig.3 determines the direction of the negative gradients for each of the weights to reach a minimum.The learning factor controls the step size derived in Eq. (15) We can observe that gradients getsmaller as they approach the minimal point.

FIG.3 Steepest descent 2D contour plot.

From a programmers perspective, some of the debugging steps are:

1 1. E is nonincreasing2 2. Eg approaches 0

3 3. B2 ≥ 04 4. 

Trang 23

3.3: Conjugate gradient

As we see from the previous section, the weights are updated in the negative gradient direction ina basic gradient algorithm Although the error function reduces most rapidly along the negativedirection of the gradient, it does not necessarily create fast convergence Conjugate gradient(CG) algorithm [10] performs a line search in the conjugate direction and has faster convergencethan the backpropagation (BP) algorithm Although CG is a general unconstrained optimizationtechnique, its use in efficiently training an MLP is well documented in Ref [11].

To train an MLP using CG algorithm, we use a direction vector that is obtained from the

Here, p = vec(P, Poh, Poi) and P, Poi, and Poh are the direction vectors. B1 is the ratio of the gradientenergy from two consecutive iterations This direction vector, in turn, updates all the weightssimultaneously as follows:

1: Initialize w, Nit, it ← 02: while it < Nitdo3:       Calculate p from g4:       Compute z from

Eq (15)5:       Update w as w ← w + z ⋅ p6:       it ← it +17: end while

Fig 4 illustrates CG using a 2D contour plot The X-axis denotes the first weight w1 and axis is second weight w2 From the plot, we can observe that CG needs N + 1 number of steps to

the Y-reach the minimum.

Trang 24

FIG 4 Conjugate gradient 2D contour plot.

From a programmer’s perspective, some of the debugging steps are as follows:

Trang 25

Since it is a two-class classification, the classes C1 and C2 can be denoted as y ∈ 0, 1 Linearregression does not normally work in classification since even if y = 0 or 1,   can be greater thanor less than 1 Therefore, we use logistic regression in which y ∈ 0, 1 It should be emphasizedthat this is still a classification rather than a regression Here, σ() is a logistic function defined as

when wTx tends to ∞ then  tends to 1 and when wTx tends to −∞ then   tends to 0.

The decision boundary for logistic regression is the property of the parameters and not thetraining set So, if there is a nonlinear decision boundary, it is mapped according to the nonlinearactivation function used in logistic regression.

Since   is nonlinear (sigmoid), the decision boundary can be learned as manifold learning of

nonlinear surfaces In order to train the logistic regression, the cost function J(w) should

preferably be convex so that only one global minimum is present We would not use the MSE asin linear regression since it is a nonconvex function with multiple local minima However, thereare ways to train a logistic regression along with MSE [12, 13].

The logistic regression cost function will be as follows:

Trang 26

Algorithm 3 Gradient descent algorithm.

1: Initialize w, Nit, it ← 02: while it < Nitdo3:       Calculate g4:

Update w as w ← w + z ⋅ g5:       it ← it + 16: end while

Trang 27

5: Multilayer perceptron

In this chapter, we start by describing the multilayer perceptron (MLP), a nonlinear signalprocessor with good approximation and classification properties The MLP has basis functionsthat can adapt during the training process by utilizing example input and desired outputs AnMLP will minimize an error criterion and closely mimic an optimal processor in which thecomputational burden in processing an input vector is controlled by slowly varying the numberof coefficients [16, 17] We review the first- and second-order training algorithms for MLPfollowed by a classifier design of MLP through regression.

5.1: Structure and notation

Fig 5 illustrates a single-layer fully connected MLP The input weights w(k, n) connect the nthinput to the kth hidden unit Output weights woh(m, k) connect the kth hidden unit’s nonlinearactivation Op(k) to the mth output yp(m), which has a linear activation The bypassweights woi(m, n) connects the nth input to the mth output The training data described by the set

of independent identically distributed input-output pair   consist of N-dimensional input

vectors xp and M-dimensional desired output vectors tp The pattern number p varies from 1 to Nv,

where Nv denotes the number of training vectors present in the datasets Let Nh denote the numberof hidden units In order to handle the thresholds in the input layer, the input unit is augmented

by an extra element xp(N + 1), where xp(N + 1) = 1 For each training pattern p, the hidden layer

net function vector np can be written as

The kth element of the hidden unit activation vector Op is calculated as Op(k) = f(np(k)), where f(⋅)

denotes the sigmoid activation function The network output vector yp can be written as

The expression for the actual outputs given in Eq (32) can be rewritten as

where Xa = [xpT:OpT]T is the augmented input column vector with Nu basis functions, where Nu = 1

+ N + Nh The total number of weights Nw = Nh ⋅ (1 + N) + M ⋅ Nh Similarly, Wo is the M- by Nu

-dimensional augmented weight matrix defined as Wo = [Woh:Woi].

Trang 28

FIG 5 Fully connected MLP.

To train an MLP, we recast the MLP learning problem as an optimization problem and use astructural risk minimization framework to design the learning algorithm [5 18] Essentially, this

framework is used to minimize the error function E as in Eq (9) that is a surrogate for anonsmooth classification error As in Ref [5], from a Bayesian point of view, we considermaximizing-likelihood function or minimizing MSE in a least-square sense Therefore, the MSEbetween the inputs and the outputs is defined as

  (34)

Trang 29

Here, λ is an L2 regularization parameter used to avoid memorization and overfitting The

nonlinearity in yp causes the error E to be nonconvex, and so in practice, local minima of the

error function may be found In the earlier discussion, we have assumed that tp has a Gaussian

distribution with input xp In case the conditional distribution of targets, given input has aBernoulli distribution, the error function, which is given by the negative log likelihood, is then across-entropy error function [5].

In Ref [5], it is concluded that using a cross-entropy error function instead of the MSE for aclassification problem leads to faster training as well as improved generalization Apart from

cross-entropy and L2 error form, we also have an L1 error measure Golik et al [19] and Simard etal [20] discuss a good comparison between the L2 and cross-entropy and suggest using cross-entropy error function for classification in order to have faster training and improvedgeneralization Our goal is to obtain an optimal value of the weights connected in an MLP Inorder to achieve this, we use empirical risk minimization [17] framework to design the learningalgorithms An essential benefit of converting the training of an MLP into an optimizationproblem is that we can now use various optimizing algorithms to optimize the learning of anMLP.

5.2: Initialization

5.2.1: Input means and standard deviations

If some inputs have even more significant standard deviations than others, they can dominate thetraining, even if they are relatively useless Inputs are normalized with zero mean and unitstandard deviation.

5.2.2: Randomizing the input weights

As from Manry [16], the input weights matrix W is initialized randomly from a zero-mean

Gaussian random number generator The training of the input weights strongly depends on thegradient of the hidden unit’s activation functions with respect to the inputs Training of inputweights will cease if the hidden units it feeds into have an activation function derivative of zerofor all patterns In order to remove the dominance of large variance inputs, we divide the inputweights by the input’s standard deviation Therefore, we adjust the mean and standard deviationof all the hidden units net functions This is called net control as in Ref [21] At this point, wehave determined the initial input weights, and we are now ready to initialize the output weights.To solve for the weights connected to the output of the network, we use a technique called outputweight optimization (OWO) [22, 23] OWO minimizes the error function from Eq (9) with

respect to Wo by solving the M sets of Nu equations given by

Here, the cross-correlation matrix C, auto-correlation matrix R

Trang 30

In order to incorporate the regularization, we modify the R matrix elements except the threshold

where r is a vector containing the diagonal elements of R and diag() is an operator that creates a

diagonal matrix from the vector.

The MLP network is now initialized and ready to be trained with first- or second-orderalgorithms Training an MLP can be seen as an unconstrained optimization problem that usuallyinvolves first-order gradient methods such as BP, CG, and second-order Levenberg-Marquardt(LM), Newton’s method as the most popular learning algorithm Training algorithms can beclassified as

1 1. One-stage algorithm, in which all the weights of the network are updated simultaneously.2 2. Two-stage algorithm, in which input and output weights are trained alternately.

Fig 6 shows a flowchart that summarizes all the training algorithms that will be described in thesubsequent sections.

FIG 6 Typical training algorithms for training an MLP.

Trang 31

5.3: First-order learning algorithms

The first-order learning algorithms update the weights of the MLP based on gradient matrices,that is, the first-order information, hence the name We start by discussing the training of anMLP with a one-stage algorithm In this, we train both the output and input weightssimultaneously using either BP or CG algorithm We then describe a two-stage algorithm calledOWO-hidden weight optimization.

5.3.1: Backpropagation algorithm

The BP algorithm is a greedy line search algorithms that have a step size to achieve themaximum amount of decrease of the objective function at each step [5] BP is a computationallyefficient method in conjunction with gradient-based algorithms that are used widely to train anMLP [24] However, due to the nonconvexity of error function (Eq 9) in neural networks, BP isnot guaranteed to find global minima but rather only local minima Although this is consideredas a major drawback, recently in Ref [25] it is discussed as to why local minima are still usefulin many practical problems In each training epoch, we update all the weights of the network in aBP algorithm as follows:

Here, w is a vector of network weights as w = vec(W, Woh, Woi) and g is a vector of network

gradients g = vec(G, Goh, Goi) The gradient matrices are negative partial of E w.r.t the

weights,  ,  , and  . vec() operator performs a lexicographicordering of a matrix into a vector. z is the optimal learning factor that is derived using a Taylorseries expansion of the MSE E, expressed in terms of z, as [26]

The BP algorithm can be summarized Algorithm 4.Algorithm 4 Backpropagation algorithm.

1: Initialize w, Nit, it ← 02: while it < Nitdo3:       Calculate g4:       Compute z from Eq.

(39)5:       Update w as w ←w + z  ⋅  g6:       it ← it + 17: end while

As in Ref [5], the BP algorithm has two major criticism First, it does not scale well, that is, ittakes   operations for sufficiently large Nw and second, being a simple gradient descentprocedure, it is unduly slow in the presence of flat error surfaces and is not a very reliablelearning paradigm.

5.3.2: Training lemmas

Trang 32

Lemma 1

For the kth hidden unit, if f  ′(np(k)) = 0 for all patterns, then weights feeding into the unit willnot change during BP training.

Observe the partial derivative formula below The partial of E with respect to w(k, n) is 0

under the conditions of the lemma.

2 2. During net control, it seems that standard deviations should be less than 2 and mean shouldbe close to 0.

Lemma 2

Let the kth hidden unit have any set of weights Let units k + j for j between 1 and (K − 1) haveidentically valued input and output weights as the kth unit In other words, w(k + j, n) is thesame for j between 0 and (K − 1), and woh(i, k + j) is the same for j between 0 and (K − 1). Aftertraining with BP, all K of these hidden units still have identically valued weights, and can bereplaced by a single hidden unit.

Trang 33

Note that as mn increases,   becomes dominated by the first term, which has no

information related to changes in the nth input.

Trang 34

3 3. Lemma 5 shows that if a multioutput MLP is trained, using the same hidden units for eachoutput, it is not optimal.

4 4. Lemma 3 shows that BP lacks affine invariance [27].

5.4: Second-order learning algorithms

The basic idea behind using a second-order method is to improve the first-order algorithms byusing the second derivative along with the first derivative [5] We present two, one-stagealgorithms, namely Newton’s method and LM and then a two-stage algorithm called as OWO-multiple optimal learning factor [28–30].

5.4.1: Newton’s method

For Newton’s method, given a starting point, we construct a quadratic approximation to a doubledifferentiable error function that matches the first- and second-order derivative value at thatpoint We then minimize this quadratic function instead of the original error function by

expanding the Taylor series of E′ about the point wk as is clear from the equation below:

Trang 35

We calculate the second-order direction, d, by solving the set of linear equations with OLS

Assuming quadratic error function in Eq (9) and H to be positive definite, applying first-order

necessary condition (FONC) [31], on all the weights in an MLP, we update the weights as  (49)

The Newton’s algorithm can be summarized in Algorithm 5.Algorithm 5 Newton’s algorithm.

1: Initialize w, Nit, it ← 02: while it < Nitdo3:       Calculate g and H from Eqs (46), (47).4:Compute d from Eq (48).5:       Update w as w ←w + d6:       it ← it + 17: end while

Newton’s method is quadratic convergent and affine invariant [16] Since it converges fast, we

would like to use it to train an MLP, but generally, the Hessian H is singular [32].

If the error function is quadratic, then the approximation is exactly a one-step solution; otherwisethe approximation will provide only an estimate to the exact solution In case of a nonquadratic

error measure, we will require a line search and w is updated as

5.4.2: LM algorithm

The LM algorithm is a compromise between Newton’s method, which converges rapidly nearlocal or global minima but may diverge, and gradient descent, which has assured convergencethrough a proper selection of step size parameter but converge slowly Following Eq (45), the

LM algorithm is a suboptimal method Since usually H is singular in Newton’s method, an

alternate is to modify the Hessian matrix as in LM [33] algorithm or use a two-step method suchas layer-by-layer training [34] In LM, we modify the Hessian as

Trang 36

After obtaining HLM, weights of the model are updated using Eq (49).

The regularizing parameter λ plays a crucial role in the way the LM algorithm functions If weset λ equal to 0, then Eq (52) reduces to Newton’s method (Eq 49) On the other hand, if we

assign a large value to λ such that λ ⋅I overpowers the Hessian H, the LM algorithms are

effective as a gradient descent algorithm Press et al [35] recommend an excellent Marquardtrecipe for the selection of λ.

From a practical perspective, the computational complexity of obtaining HLM can be demanding,

mainly when the dimensionality of the weight vector w is high Therefore, due to scalability

constraints, LM is particularly suitable for a small network.The LM algorithm can be summarized in Algorithm 6.Algorithm 6 LM algorithm.

1: Initialize w, Nit, it ← 02: while it < Nitdo3:       Present all patterns to the network to

computer error Eold from Eq (9)4:       Calculate g and H from Eqs (46), (47)5:

Obtain HLM from Eq (51)6:       Compute d from Eq (52)7:       Update w as w ←w + d8:

Recompute the error Enew by using the updated weights.9:       ifEnew < Eoldthen10:

Reduce the value of λ11:       goto Step 312:     else13:       Increase the value of λ14:

end if15:     it ← it + 116: end while

6: KL divergence

From an information theory point of view, Kullback-Leibler (KL) divergence is a relative

entropy measure that estimates how close we have modeled an approximate distribution q(x)with respect to the unknown distribution p(x) In order to arrive at the expression for KL

divergence, we follow the probabilistic viewpoint The following explanation is suitable for bothcontinuous and discrete distribution In order to quantitatively determine how good the

approximate model q(x) is compared to (x), we calculate the log-likelihood ratio for an entire

dataset as

Eq (53) represents that for any random sample x, how much likely is it that the data point is to

occur in p(x) as compared to q(x) An intuition behind Eq (53) is that if p(x) fits the data moreclosely, then it will give a value greater than 0, whereas a value less than 0 indicates that q(x)

better fits the data In case the models fit the data equally well, Eq (53) will give a value of 0 Sowith Eq (53), we can quantify how much better one model is better over the other given the

Trang 37

dataset For a large dataset   with N sampled points that are independent and

identically distributed (iid), we calculate an analytical term called an average predictive power

The way the probabilistic interpretation is connected to the neural network design is the

assumption of iid A common misconception about KL divergence is to think of it as a distance

between the p(x) and q(x) However, this is wrong since KL divergence does not follow the

commutative property, that is, KL(p||q)≠KL(p||q).

7: Generalized linear models

From Section 3 on linear regression, we add a nonlinear activation to the output neurons andcreate a class of models known as generalized linear models Platt [36] has shown the utility ofusing nonlinear output activations such as sigmoids We describe an algorithm that usessigmoidal output activation to bound network outputs as shown in Fig 7 The error of theresulting generalized linear classifier (GLC) is

where Op(i) is defined as f(yp(i)) and f(⋅) denotes the sigmoid activation function.

Trang 39

FIG 7 Generalized linear model.

In order to mitigate this increase in EG, we need to map the linear output activations yp to

sigmoidal outputs Op as

where a and b are scalars that are found such that EG is minimized.8: Kernel method

A second-order feed forward training algorithm for a radial basis function (RBF) classifier is

patterns, the RBF basis functions are given as

Equating the gradient of Erbf(Wo) to 0, we solve for the output weight matrix Wo using OWO [37].

Let mk, βk, and c(n) denote the kth center vector, the kth spread parameter, and the weightedEuclidean distance measure weight for the nth input These parameters are optimized using an

efficient Newton’s algorithm [37].9: Nonlinear SVM classifier

A nonlinear support vector machine (SVM) follows the structure similar to the RBF classifier In

a nonlinear SVM, K(Xp, Xq) ≡ ϕ(Xp)Tϕ(Xq) is called the kernel function and ϕ(Xq) is a featurefunction For our comparisons  , γ > 0 For a given two-class

classification problem, a nonlinear SVM solves the following convex optimization problem:

Trang 40

subject to  , ξi ≥ 0 Here, the weight vector Wo is Nsv by 1 and b is a

bias The basis vector Xp is Nsv by 1, and 1 ≤ p ≤ Nsv. C is a user-specified positive parameter, andthe ξp are called the slack variables We need to find the optimum values of the weight vector Wo,

bias b, and the slack variables ξp to minimize Esvm The coordinated descent algorithm [39] is usedfor training For multiclass problems, one versus the rest strategy is implemented.

10: Tree ensembles10.1: Decision trees

Decision tree [40] is an upside-down tree-like structure generated from training data with themain root at the top as shown in Fig 8 Each node is further split into subsequent nodes calledbranches/subtree depending on the decisions made using training data The tree’s end where thetree further does not split is the leaf/terminal nodes that indicate the final decision Decision treesare used for both approximation and classification data.

FIG 8 Decision tree.

A decision tree is learned by splitting the branches into subbranches depending on the lowesterror values The split decision for approximation data is made based on the variance of the node.

Ngày đăng: 02/08/2024, 17:08

w