"Artificial Intelligence and Machine Learning for Predictive and Analytical Rendering in Edge Computing focuses on the role of AI and machine learning as it impacts and works alongside Edge Computing. Sections cover the growing number of devices and applications in diversified domains of industry, including gaming, speech recognition, medical diagnostics, robotics and computer vision and how they are being driven by Big Data, Artificial Intelligence, Machine Learning and distributed computing, may it be Cloud Computing or the evolving Fog and Edge Computing paradigms. Challenges covered include remote storage and computing, bandwidth overload due to transportation of data from End nodes to Cloud leading in latency issues, security issues in transporting sensitive medical and financial information across larger gaps in points of data generation and computing, as well as design features of Edge nodes to store and run AI/ML algorithms for effective rendering"
Trang 2Table of Contents
CoverTitle pageCopyrightContributorsPreface
Part I: AI and machine learningChapter 1: Supervised learningAbstract
1: Introduction2: Perceptron3: Linear regression4: Logistic regression5: Multilayer perceptron6: KL divergence
7: Generalized linear models8: Kernel method
9: Nonlinear SVM classifier10: Tree ensembles
Chapter 2: Supervised learning: From theory to applicationsAbstract
Trang 3Chapter 3: Unsupervised learningAbstract
1: Introduction2: k-means clustering3: k-means++ clustering4: Sequential leader clustering5: EM algorithm
6: Gaussian mixture model7: Autoencoders
8: Principal component analysis9: Linear discriminant analysis10: Independent component analysisReferences
Chapter 4: Regression analysisAbstract
Trang 41: Introduction2: Linear regression3: Cost functions4: Gradient descent5: Polynomial regression6: Regularization
7: Evaluating a machine learning modelReferences
Chapter 5: The integrity of machine learning algorithms against software defect predictionAbstract
1: Introduction2: Related works3: Proposed method4: Experiment5: Results
6: Threats to validity7: ConclusionsReferences
Chapter 6: Learning in sequential decision-making under uncertaintyAbstract
Acknowledgments1: Introduction
2: Multiarmed bandit problem
Trang 53: Markov decision process planning problem4: Reinforcement learning
5: SummaryReferences
Chapter 7: Geospatial crime analysis and forecasting with machine learning techniquesAbstract
1: Introduction2: Related work3: Methodology
4: Results and discussion5: Conclusions
4: Truth content discovery algorithm
5: Trustworthy and scalable service providers algorithm
6: Efficient feature extraction and classification (EFEC) algorithm7: QUERY retrieval time (QRT)
8: Trust content discovery and trustworthy and scalable service providers algorithm
Trang 69: Efficient feature xtraction and classification (EFEC) algorithm and customer review datasets10: Summary
11: Conclusions
12: Future enhancementsReferences
Chapter 9: Reliable diabetes mellitus forecasting using artificial neural network multilayerperceptron
Abstract1: Introduction2: Related works3: Methodology
4: Building the diabetic diagnostic criteria
5: Evaluating the diabetes outcomes using classification algorithms6: Conclusions
Chapter 10: A study of deep learning approach for the classification of electroencephalogram(EEG) brain signals
Abstract1: Introduction2: Methods3: Results4: Discussion5: ConclusionsReferences
Trang 7Chapter 11: Integrating AI in e-procurement of hospitality industry in the UAEAbstract
1: Introduction2: Problem statement3: Authors’ contributions4: Significance of the study5: Theoretical framework6: Research aims and objectives7: Literature review
8: Major findings9: Discussions
10: Major gaps in the study11: Conclusions
Chapter 12: Application of artificial intelligence and machine learning in blockchain technologyAbstract
Acknowledgment1: Introduction
2: Applications of artificial intelligence, machine learning, and blockchain technology
3: It takes two to tango: Future of artificial intelligence and machine learning in blockchaintechnology
4: Edge computing: A potential use case of blockchain5: Conclusions
References
Trang 8Part II: Data science and predictive analysis
Chapter 13: Implementing convolutional neural network model for prediction in medicalimaging
Abstract1: Introduction
2: Convolutional neural networks
3: Implementing CNN for biomedical imaging and analysis4: Architecture models for different image type
5: Conclusion6: Future scopeReferences
Chapter 14: Fuzzy-machine learning models for the prediction of fire outbreaks: A comparativeanalysis
Abstract1: Introduction2: Related literature3: Research methodology
4: Machine learning algorithms for fire outbreak prediction5: Result and discussion
6: ConclusionsReferences
Chapter 15: Vehicle telematics: An Internet of Things and Big Data approachAbstract
1: Introduction
Trang 93: Contribution of intelligent e-learning system using BN model4: Learner assessment model
5: Results and discussions6: Conclusions and future workReferences
Chapter 17: Ensemble method for multiclassification of COVID-19 virus using spatial andfrequency domain features over X-ray images
Abstract1: Introduction2: Literature review3: Proposed methodology4: Result analysis
Trang 105: Discussion and conclusionsReferences
Chapter 18: Chronological text similarity with pretrained embedding and edit distanceAbstract
1: Introduction2: Literature review3: Theoretical background4: Modeling
5: Experimental settings6: Results and discussion7: Conclusions
5: ConclusionsReferences
Chapter 20: A real-time performance monitoring model for processing of IoT and big data usingmachine learning
Abstract1: Introduction
Trang 112: Experimental study3: Major findings4: ConclusionsReferences
Chapter 21: COVID-19 prediction from chest X-ray images using deep convolutional neuralnetwork
Abstract1: Introduction2: Methodology
3: Results and discussions4: Conclusions
ReferencesFurther reading
Chapter 22: Hybrid deep learning neuro-fuzzy networks for industrial parameters estimationAbstract
1: Introduction2: Preliminaries3: Methodology
4: Results and discussion5: Validation of model
6: Discussions on performance evaluation7: Conclusions
8: Future scopeReferences
Trang 12Chapter 23: An intelligent framework to assess core competency using the level predictionmodel (LPM)
Abstract1: Introduction2: Related work
3: Existing applications4: Proposed system5: Experimental6: ConclusionsReferences
Part III: Edge computing
Chapter 24: Edge computing: A soul to Internet of things (IoT) dataAbstract
1: Introduction
2: Edge computing characteristics
3: New challenges in Internet of technology (IoT): Edge computing4: Edge computing support to IoT functionality
5: IoT applications: Cloud or edge computing?6: Benefits and potential of edge computing for IoT7: Use case: Edge computing in IoT
8: Pertinent open issues which require additional investigations for edge computing9: Conclusions
Chapter 25: 5G: The next-generation technology for edge communication
Trang 13Abstract1: Introduction2: History
3: 5G technology4: 5G cellular network
5: Components used in 5G technology/network6: Differences from 4G architecture
7: Security of 5G architecture8: 5G time period
9: Case study on 5G technology10: 5G advancement
11: Advantage and disadvantage of 5G technology12: Challenges
13: Future scope14: ConclusionsReferences
Chapter 26: Challenges and opportunities in edge computing architecture using machine learningapproaches
Abstract1: Introduction
2: Overview of edge computing
3: Security and privacy in edge computing
4: Intersection of machine learning and edge using enabling technologies5: Machine learning and edge bringing AI to IoT
Trang 142: Hybrid software-defined networks
3: Security challenges in hybrid software-defined networks4: Solutions for hybrid software-defined networks
5: Learning techniques for hybrid software-defined networks6: Discussion and implementation
7: ConclusionsReferencesFurther reading
Chapter 28: Moving to the cloud, fog, and edge computing paradigms: Convergences and futureresearch direction
Abstract1: Introduction
2: Features and differences between cloud, fog, and edge computing3: Framework and programming models: Architecture of fog computing4: Moving cloud to edge computing
5: Case study: Edge computing for intelligent aquaculture6: Conclusions
Trang 15Chapter 30: AI cardiologist at the edge: A use case of a dew computing heart monitoring solutionA use case of a dew computing heart monitoring solution
Abstract1: Introduction2: Related work
3: Architectural approach4: ECGalert use case5: Discussion
6: ConclusionsReferences
P A R T I
AI and machine learning
Chapter 1: Supervised learning
Kanishka Tyagia; Chinmay Raneb; Michael Manryca Aptiv Advanced Research Center, Agoura Hills,
Trang 16c Department of Electrical Engineering, The University of Texas at Arlington, Arlington, TX, United States
Machine learning models learn different tasks with different paradigms that effectivelyaim to get the models better through training Supervised learning is a common form ofmachine learning training paradigm that has been used successfully in real-worldmachine learning applications Typical supervised learning involves two phases Inphase 1 (commonly called training), we give input and expect output (also known asground truth) and train a model with respect to a metric using an optimizationalgorithm In phase 2 (commonly called testing), we deploy the model with unseen dataand expect it to either classify or approximate the outputs Although supervisedlearning is covered in almost all machine learning textbooks, we will introduce andexplain supervised learning from an application point of view and its relationship toedge computing All the algorithms will be explained from a mathematical andtheoretical point of view and a programmer’s perspective This will help in doinghands-on experience in implementing algorithms for a variety of problems.
Linear regression; Logistic regression; Steepest descent; Conjugate gradient; Multilayerperceptron; Second-order algorithms; KL divergence; Generalized linear models; Kernelmachines; Bootstrapping
1: Introduction
This chapter discusses various supervised learning paradigm that is used to train and deploymachine learning models While the theory on supervised learning is predominated mainly by thecomputer vision and natural language processing community, much research is required in itsapplication to edge computing With the recent advances in machine learning algorithms,combined with the increasing computational power, edge computing is another area ofapplication where machine learning will be improving the current technology According to awhite paper from Cisco [1], 50 billion internet of things (IoT) devices will be connected to theinternet by the end of 2020, and even though the estimation that nearly 850 ZB of data will begenerated per year outside the cloud by 2021, the global data center traffic is only 21 ZBapproximately This means that a transformation from bug cloud data centers to a wide range ofedge devices is happening.
We plan to cover the traditional supervised learning algorithm basics and the trade tricks whileimplementing them for various real-life applications The assumptions made and theprogramming decisions made while implementing them are also discussed We start withgeneralized linear models along with optimization algorithms (gradient descent, Newton’smethod) and metrics (mean square error [MSE], cross-entropy) used in training the model Then,Nạve Bayes and kernel methods are discussed along with the decision tree and its variants Weprovide pseudocode for all the supervised learning algorithms that we discuss to give the readers
Trang 17an idea of how the algorithms are implemented from theory to practice In addition, supervisedlearning algorithms that capture the variation of input data distribution and algorithms thatproject data in higher subspace and first- and second-order learning paradigms are the subjectmatter of this chapter The chapter covers the learning paradigms in sync with the modern-dayapplication to edge computing from an implementation perspective.
Edge computing and machine learning will enable the devices to have a more pervasive and grained intelligence Supervised learning is a common way of training machine learningalgorithms that can be later used in the inference stage on IoT and other edge devices.
fine-This chapter covers several supervised learning algorithms that are commonly used in variousreal-life applications. Section 2 covers the basic perceptron model that helps understand thelearning algorithm In Section 3, the linear regression algorithm is discussed that extends the ideaof perceptron algorithm to solve supervised regression tasks We discuss commonly usedgradient-based learning algorithms along with their pseudocodes. Section 4 discusses logisticregression, which is a very commonly used classifier due to its simplicity in structure andgeneralizability In Section 5, we bring the concepts together from the previous sections todiscuss the multilayer perceptron (MLP) network, a widely used neural network in real-lifeapplications We explain the structure, initialization strategy, and detail out various learningalgorithms that are reused to train an MLP. Section 6 discusses the Kullback-Leibler (KL)divergence measure used to calculate how two probability distributions are different from eachother. Section 7 discusses the generalized linear models that extend to the linear models but witha nonlinear activation on the output nodes. Section 8 talks about the kernel methods leadingto Section 9 that explains nonlinear support vector machine (SVM) classifiers We conclude thischapter in Section 10 by discussing various tree ensemble algorithms.
2: Perceptron
Perceptron algorithms came in early 1960; however, Minsky and Papert [2] show the restrictionthey have Rosenblatt [3] described an alpha perceptron that is an example of a statistical patternrecognition (SPR) system In a typical SPR system, the features are obtained from the raw inputusing no learning mechanism but instead using a common-sense rule Therefore, a humandecides what is a good feature and sees if it works, and if it does not work, tries another Welearn the weight associated with each feature activation to get a single scalar quantity and thenbased on if this quantity is above or below a threshold In a typical perceptron model for input
vector, x is used to compute a weighted sum from all the neurons and added with a bias vector.
This bias vector is also known as the threshold vector in the literature. (1)
The linear perceptron will output y depending on the following rule as shown in Fig 1.
(2)
Trang 18The perception algorithm is fast and straightforward, and if the dataset is linearly separable, it isguaranteed to converge.
FIG 1 Perceptron activation functions.
3: Linear regression
In this section, we discuss the structure and notation of a linear regression As shown in Fig 2, a
linear regression is a weight matrix W that transforms an input vector x into a discriminantvector y [4] The weight w(m, n) connects the nth input to the mth output The training dataset
(xp, tp) consists of N-dimensional input vectors xp and M-dimensional desired output vectors tp.
The pattern number p varies from 1 to Nv, where Nv denotes the number of training patterns The
threshold is handled by augmenting x with an extra element, which is equal to 1 as xa = [1 : xT]T.
So xa contains Nu basis functions, where Nu = N + 1 For the pth training pattern, the network
output vector, yp can be written as
where xap denotes xa for the pth pattern.
Trang 19FIG.
Trang 202 Linear regression.
3.1: Training a linear regression
To train the linear regression, we minimize the error function E that is a surrogate for a
nonsmooth classification error As in Ref [5], from a Bayesian point of view, we considermaximizing likelihood function or minimizing MSE in the least-square sense, where the MSEbetween the inputs and the outputs is
Here, the target output for the correct output. M denotes the total number of outputs We
minimize the error function from Eq (9) with respect to W by solving the M sets of N + 1 linear
equations given by (5)
where the cross-correlation matrix C and the auto-correlation matrix R are, respectively,
Since R in Eq (35) is often ill-conditioned, it is unsafe to use Gauss-Jordan elimination Eq (35)is solved using orthogonal least squares (OLS) [6] In Ref [7], OLS is used to solve for radialbasis function network parameters OLS is useful for practical applications for two primaryreasons First, the training is fast since solving linear equations is straightforward Second, ithelps us to avoid some local minima [8] In terms of optimization theory, solving Eq (35)
for W is merely Newton’s algorithm for the output weights [9].Given
and the error function
Trang 21(15)
Trang 223.2: Steepest descent
The steepest descent gradient algorithm can be summarized in Algorithm 1.Algorithm 1 Steepest descent gradient algorithm.
1: Initialize w, Nit, it ← 02: while it < Nitdo3: Calculate g from Eq (12)4:
Compute B2 from Eq (15)5: Update w as w ← w + B2 ⋅ g6: it ← it + 17: end while
Fig 3 illustrates gradient descent using a two-dimensional (2D) contour plot The X-axis denotesthe first weight w1 and the Y-axis is second weight w2 The arrows [g0,g1,g2,g3,g4,g5] in Fig.3 determines the direction of the negative gradients for each of the weights to reach a minimum.The learning factor controls the step size derived in Eq. (15) We can observe that gradients getsmaller as they approach the minimal point.
FIG.3 Steepest descent 2D contour plot.
From a programmers perspective, some of the debugging steps are:
1 1. E is nonincreasing2 2. Eg approaches 0
3 3. B2 ≥ 04 4.
Trang 233.3: Conjugate gradient
As we see from the previous section, the weights are updated in the negative gradient direction ina basic gradient algorithm Although the error function reduces most rapidly along the negativedirection of the gradient, it does not necessarily create fast convergence Conjugate gradient(CG) algorithm [10] performs a line search in the conjugate direction and has faster convergencethan the backpropagation (BP) algorithm Although CG is a general unconstrained optimizationtechnique, its use in efficiently training an MLP is well documented in Ref [11].
To train an MLP using CG algorithm, we use a direction vector that is obtained from the
Here, p = vec(P, Poh, Poi) and P, Poi, and Poh are the direction vectors. B1 is the ratio of the gradientenergy from two consecutive iterations This direction vector, in turn, updates all the weightssimultaneously as follows:
1: Initialize w, Nit, it ← 02: while it < Nitdo3: Calculate p from g4: Compute z from
Eq (15)5: Update w as w ← w + z ⋅ p6: it ← it +17: end while
Fig 4 illustrates CG using a 2D contour plot The X-axis denotes the first weight w1 and axis is second weight w2 From the plot, we can observe that CG needs N + 1 number of steps to
the Y-reach the minimum.
Trang 24FIG 4 Conjugate gradient 2D contour plot.
From a programmer’s perspective, some of the debugging steps are as follows:
Trang 25Since it is a two-class classification, the classes C1 and C2 can be denoted as y ∈ 0, 1 Linearregression does not normally work in classification since even if y = 0 or 1, can be greater thanor less than 1 Therefore, we use logistic regression in which y ∈ 0, 1 It should be emphasizedthat this is still a classification rather than a regression Here, σ() is a logistic function defined as
when wTx tends to ∞ then tends to 1 and when wTx tends to −∞ then tends to 0.
The decision boundary for logistic regression is the property of the parameters and not thetraining set So, if there is a nonlinear decision boundary, it is mapped according to the nonlinearactivation function used in logistic regression.
Since is nonlinear (sigmoid), the decision boundary can be learned as manifold learning of
nonlinear surfaces In order to train the logistic regression, the cost function J(w) should
preferably be convex so that only one global minimum is present We would not use the MSE asin linear regression since it is a nonconvex function with multiple local minima However, thereare ways to train a logistic regression along with MSE [12, 13].
The logistic regression cost function will be as follows:
Trang 26Algorithm 3 Gradient descent algorithm.
1: Initialize w, Nit, it ← 02: while it < Nitdo3: Calculate g4:
Update w as w ← w + z ⋅ g5: it ← it + 16: end while
Trang 275: Multilayer perceptron
In this chapter, we start by describing the multilayer perceptron (MLP), a nonlinear signalprocessor with good approximation and classification properties The MLP has basis functionsthat can adapt during the training process by utilizing example input and desired outputs AnMLP will minimize an error criterion and closely mimic an optimal processor in which thecomputational burden in processing an input vector is controlled by slowly varying the numberof coefficients [16, 17] We review the first- and second-order training algorithms for MLPfollowed by a classifier design of MLP through regression.
5.1: Structure and notation
Fig 5 illustrates a single-layer fully connected MLP The input weights w(k, n) connect the nthinput to the kth hidden unit Output weights woh(m, k) connect the kth hidden unit’s nonlinearactivation Op(k) to the mth output yp(m), which has a linear activation The bypassweights woi(m, n) connects the nth input to the mth output The training data described by the set
of independent identically distributed input-output pair consist of N-dimensional input
vectors xp and M-dimensional desired output vectors tp The pattern number p varies from 1 to Nv,
where Nv denotes the number of training vectors present in the datasets Let Nh denote the numberof hidden units In order to handle the thresholds in the input layer, the input unit is augmented
by an extra element xp(N + 1), where xp(N + 1) = 1 For each training pattern p, the hidden layer
net function vector np can be written as
The kth element of the hidden unit activation vector Op is calculated as Op(k) = f(np(k)), where f(⋅)
denotes the sigmoid activation function The network output vector yp can be written as
The expression for the actual outputs given in Eq (32) can be rewritten as
where Xa = [xpT:OpT]T is the augmented input column vector with Nu basis functions, where Nu = 1
+ N + Nh The total number of weights Nw = Nh ⋅ (1 + N) + M ⋅ Nh Similarly, Wo is the M- by Nu
-dimensional augmented weight matrix defined as Wo = [Woh:Woi].
Trang 28FIG 5 Fully connected MLP.
To train an MLP, we recast the MLP learning problem as an optimization problem and use astructural risk minimization framework to design the learning algorithm [5 18] Essentially, this
framework is used to minimize the error function E as in Eq (9) that is a surrogate for anonsmooth classification error As in Ref [5], from a Bayesian point of view, we considermaximizing-likelihood function or minimizing MSE in a least-square sense Therefore, the MSEbetween the inputs and the outputs is defined as
(34)
Trang 29Here, λ is an L2 regularization parameter used to avoid memorization and overfitting The
nonlinearity in yp causes the error E to be nonconvex, and so in practice, local minima of the
error function may be found In the earlier discussion, we have assumed that tp has a Gaussian
distribution with input xp In case the conditional distribution of targets, given input has aBernoulli distribution, the error function, which is given by the negative log likelihood, is then across-entropy error function [5].
In Ref [5], it is concluded that using a cross-entropy error function instead of the MSE for aclassification problem leads to faster training as well as improved generalization Apart from
cross-entropy and L2 error form, we also have an L1 error measure Golik et al [19] and Simard etal [20] discuss a good comparison between the L2 and cross-entropy and suggest using cross-entropy error function for classification in order to have faster training and improvedgeneralization Our goal is to obtain an optimal value of the weights connected in an MLP Inorder to achieve this, we use empirical risk minimization [17] framework to design the learningalgorithms An essential benefit of converting the training of an MLP into an optimizationproblem is that we can now use various optimizing algorithms to optimize the learning of anMLP.
5.2: Initialization
5.2.1: Input means and standard deviations
If some inputs have even more significant standard deviations than others, they can dominate thetraining, even if they are relatively useless Inputs are normalized with zero mean and unitstandard deviation.
5.2.2: Randomizing the input weights
As from Manry [16], the input weights matrix W is initialized randomly from a zero-mean
Gaussian random number generator The training of the input weights strongly depends on thegradient of the hidden unit’s activation functions with respect to the inputs Training of inputweights will cease if the hidden units it feeds into have an activation function derivative of zerofor all patterns In order to remove the dominance of large variance inputs, we divide the inputweights by the input’s standard deviation Therefore, we adjust the mean and standard deviationof all the hidden units net functions This is called net control as in Ref [21] At this point, wehave determined the initial input weights, and we are now ready to initialize the output weights.To solve for the weights connected to the output of the network, we use a technique called outputweight optimization (OWO) [22, 23] OWO minimizes the error function from Eq (9) with
respect to Wo by solving the M sets of Nu equations given by
Here, the cross-correlation matrix C, auto-correlation matrix R
Trang 30In order to incorporate the regularization, we modify the R matrix elements except the threshold
where r is a vector containing the diagonal elements of R and diag() is an operator that creates a
diagonal matrix from the vector.
The MLP network is now initialized and ready to be trained with first- or second-orderalgorithms Training an MLP can be seen as an unconstrained optimization problem that usuallyinvolves first-order gradient methods such as BP, CG, and second-order Levenberg-Marquardt(LM), Newton’s method as the most popular learning algorithm Training algorithms can beclassified as
1 1. One-stage algorithm, in which all the weights of the network are updated simultaneously.2 2. Two-stage algorithm, in which input and output weights are trained alternately.
Fig 6 shows a flowchart that summarizes all the training algorithms that will be described in thesubsequent sections.
FIG 6 Typical training algorithms for training an MLP.
Trang 315.3: First-order learning algorithms
The first-order learning algorithms update the weights of the MLP based on gradient matrices,that is, the first-order information, hence the name We start by discussing the training of anMLP with a one-stage algorithm In this, we train both the output and input weightssimultaneously using either BP or CG algorithm We then describe a two-stage algorithm calledOWO-hidden weight optimization.
5.3.1: Backpropagation algorithm
The BP algorithm is a greedy line search algorithms that have a step size to achieve themaximum amount of decrease of the objective function at each step [5] BP is a computationallyefficient method in conjunction with gradient-based algorithms that are used widely to train anMLP [24] However, due to the nonconvexity of error function (Eq 9) in neural networks, BP isnot guaranteed to find global minima but rather only local minima Although this is consideredas a major drawback, recently in Ref [25] it is discussed as to why local minima are still usefulin many practical problems In each training epoch, we update all the weights of the network in aBP algorithm as follows:
Here, w is a vector of network weights as w = vec(W, Woh, Woi) and g is a vector of network
gradients g = vec(G, Goh, Goi) The gradient matrices are negative partial of E w.r.t the
weights, , , and . vec() operator performs a lexicographicordering of a matrix into a vector. z is the optimal learning factor that is derived using a Taylorseries expansion of the MSE E, expressed in terms of z, as [26]
The BP algorithm can be summarized Algorithm 4.Algorithm 4 Backpropagation algorithm.
1: Initialize w, Nit, it ← 02: while it < Nitdo3: Calculate g4: Compute z from Eq.
(39)5: Update w as w ←w + z ⋅ g6: it ← it + 17: end while
As in Ref [5], the BP algorithm has two major criticism First, it does not scale well, that is, ittakes operations for sufficiently large Nw and second, being a simple gradient descentprocedure, it is unduly slow in the presence of flat error surfaces and is not a very reliablelearning paradigm.
5.3.2: Training lemmas
Trang 32Lemma 1
For the kth hidden unit, if f ′(np(k)) = 0 for all patterns, then weights feeding into the unit willnot change during BP training.
Observe the partial derivative formula below The partial of E with respect to w(k, n) is 0
under the conditions of the lemma.
2 2. During net control, it seems that standard deviations should be less than 2 and mean shouldbe close to 0.
Lemma 2
Let the kth hidden unit have any set of weights Let units k + j for j between 1 and (K − 1) haveidentically valued input and output weights as the kth unit In other words, w(k + j, n) is thesame for j between 0 and (K − 1), and woh(i, k + j) is the same for j between 0 and (K − 1). Aftertraining with BP, all K of these hidden units still have identically valued weights, and can bereplaced by a single hidden unit.
Trang 33Note that as mn increases, becomes dominated by the first term, which has no
information related to changes in the nth input.
Trang 343 3. Lemma 5 shows that if a multioutput MLP is trained, using the same hidden units for eachoutput, it is not optimal.
4 4. Lemma 3 shows that BP lacks affine invariance [27].
5.4: Second-order learning algorithms
The basic idea behind using a second-order method is to improve the first-order algorithms byusing the second derivative along with the first derivative [5] We present two, one-stagealgorithms, namely Newton’s method and LM and then a two-stage algorithm called as OWO-multiple optimal learning factor [28–30].
5.4.1: Newton’s method
For Newton’s method, given a starting point, we construct a quadratic approximation to a doubledifferentiable error function that matches the first- and second-order derivative value at thatpoint We then minimize this quadratic function instead of the original error function by
expanding the Taylor series of E′ about the point wk as is clear from the equation below:
Trang 35We calculate the second-order direction, d, by solving the set of linear equations with OLS
Assuming quadratic error function in Eq (9) and H to be positive definite, applying first-order
necessary condition (FONC) [31], on all the weights in an MLP, we update the weights as (49)
The Newton’s algorithm can be summarized in Algorithm 5.Algorithm 5 Newton’s algorithm.
1: Initialize w, Nit, it ← 02: while it < Nitdo3: Calculate g and H from Eqs (46), (47).4:Compute d from Eq (48).5: Update w as w ←w + d6: it ← it + 17: end while
Newton’s method is quadratic convergent and affine invariant [16] Since it converges fast, we
would like to use it to train an MLP, but generally, the Hessian H is singular [32].
If the error function is quadratic, then the approximation is exactly a one-step solution; otherwisethe approximation will provide only an estimate to the exact solution In case of a nonquadratic
error measure, we will require a line search and w is updated as
5.4.2: LM algorithm
The LM algorithm is a compromise between Newton’s method, which converges rapidly nearlocal or global minima but may diverge, and gradient descent, which has assured convergencethrough a proper selection of step size parameter but converge slowly Following Eq (45), the
LM algorithm is a suboptimal method Since usually H is singular in Newton’s method, an
alternate is to modify the Hessian matrix as in LM [33] algorithm or use a two-step method suchas layer-by-layer training [34] In LM, we modify the Hessian as
Trang 36After obtaining HLM, weights of the model are updated using Eq (49).
The regularizing parameter λ plays a crucial role in the way the LM algorithm functions If weset λ equal to 0, then Eq (52) reduces to Newton’s method (Eq 49) On the other hand, if we
assign a large value to λ such that λ ⋅I overpowers the Hessian H, the LM algorithms are
effective as a gradient descent algorithm Press et al [35] recommend an excellent Marquardtrecipe for the selection of λ.
From a practical perspective, the computational complexity of obtaining HLM can be demanding,
mainly when the dimensionality of the weight vector w is high Therefore, due to scalability
constraints, LM is particularly suitable for a small network.The LM algorithm can be summarized in Algorithm 6.Algorithm 6 LM algorithm.
1: Initialize w, Nit, it ← 02: while it < Nitdo3: Present all patterns to the network to
computer error Eold from Eq (9)4: Calculate g and H from Eqs (46), (47)5:
Obtain HLM from Eq (51)6: Compute d from Eq (52)7: Update w as w ←w + d8:
Recompute the error Enew by using the updated weights.9: ifEnew < Eoldthen10:
Reduce the value of λ11: goto Step 312: else13: Increase the value of λ14:
end if15: it ← it + 116: end while
6: KL divergence
From an information theory point of view, Kullback-Leibler (KL) divergence is a relative
entropy measure that estimates how close we have modeled an approximate distribution q(x)with respect to the unknown distribution p(x) In order to arrive at the expression for KL
divergence, we follow the probabilistic viewpoint The following explanation is suitable for bothcontinuous and discrete distribution In order to quantitatively determine how good the
approximate model q(x) is compared to (x), we calculate the log-likelihood ratio for an entire
dataset as
Eq (53) represents that for any random sample x, how much likely is it that the data point is to
occur in p(x) as compared to q(x) An intuition behind Eq (53) is that if p(x) fits the data moreclosely, then it will give a value greater than 0, whereas a value less than 0 indicates that q(x)
better fits the data In case the models fit the data equally well, Eq (53) will give a value of 0 Sowith Eq (53), we can quantify how much better one model is better over the other given the
Trang 37dataset For a large dataset with N sampled points that are independent and
identically distributed (iid), we calculate an analytical term called an average predictive power
The way the probabilistic interpretation is connected to the neural network design is the
assumption of iid A common misconception about KL divergence is to think of it as a distance
between the p(x) and q(x) However, this is wrong since KL divergence does not follow the
commutative property, that is, KL(p||q)≠KL(p||q).
7: Generalized linear models
From Section 3 on linear regression, we add a nonlinear activation to the output neurons andcreate a class of models known as generalized linear models Platt [36] has shown the utility ofusing nonlinear output activations such as sigmoids We describe an algorithm that usessigmoidal output activation to bound network outputs as shown in Fig 7 The error of theresulting generalized linear classifier (GLC) is
where Op(i) is defined as f(yp(i)) and f(⋅) denotes the sigmoid activation function.
Trang 39FIG 7 Generalized linear model.
In order to mitigate this increase in EG, we need to map the linear output activations yp to
sigmoidal outputs Op as
where a and b are scalars that are found such that EG is minimized.8: Kernel method
A second-order feed forward training algorithm for a radial basis function (RBF) classifier is
patterns, the RBF basis functions are given as
Equating the gradient of Erbf(Wo) to 0, we solve for the output weight matrix Wo using OWO [37].
Let mk, βk, and c(n) denote the kth center vector, the kth spread parameter, and the weightedEuclidean distance measure weight for the nth input These parameters are optimized using an
efficient Newton’s algorithm [37].9: Nonlinear SVM classifier
A nonlinear support vector machine (SVM) follows the structure similar to the RBF classifier In
a nonlinear SVM, K(Xp, Xq) ≡ ϕ(Xp)Tϕ(Xq) is called the kernel function and ϕ(Xq) is a featurefunction For our comparisons , γ > 0 For a given two-class
classification problem, a nonlinear SVM solves the following convex optimization problem:
Trang 40subject to , ξi ≥ 0 Here, the weight vector Wo is Nsv by 1 and b is a
bias The basis vector Xp is Nsv by 1, and 1 ≤ p ≤ Nsv. C is a user-specified positive parameter, andthe ξp are called the slack variables We need to find the optimum values of the weight vector Wo,
bias b, and the slack variables ξp to minimize Esvm The coordinated descent algorithm [39] is usedfor training For multiclass problems, one versus the rest strategy is implemented.
10: Tree ensembles10.1: Decision trees
Decision tree [40] is an upside-down tree-like structure generated from training data with themain root at the top as shown in Fig 8 Each node is further split into subsequent nodes calledbranches/subtree depending on the decisions made using training data The tree’s end where thetree further does not split is the leaf/terminal nodes that indicate the final decision Decision treesare used for both approximation and classification data.
FIG 8 Decision tree.
A decision tree is learned by splitting the branches into subbranches depending on the lowesterror values The split decision for approximation data is made based on the variance of the node.