"Artificial Intelligence and Machine Learning for Predictive and Analytical Rendering in Edge Computing focuses on the role of AI and machine learning as it impacts and works alongside Edge Computing. Sections cover the growing number of devices and applications in diversified domains of industry, including gaming, speech recognition, medical diagnostics, robotics and computer vision and how they are being driven by Big Data, Artificial Intelligence, Machine Learning and distributed computing, may it be Cloud Computing or the evolving Fog and Edge Computing paradigms. Challenges covered include remote storage and computing, bandwidth overload due to transportation of data from End nodes to Cloud leading in latency issues, security issues in transporting sensitive medical and financial information across larger gaps in points of data generation and computing, as well as design features of Edge nodes to store and run AI/ML algorithms for effective rendering"
Trang 2Part I: AI and machine learning
Chapter 1: Supervised learning
Trang 38: Principal component analysis
9: Linear discriminant analysis
10: Independent component analysis
References
Chapter 4: Regression analysis
Abstract
Trang 53: Markov decision process planning problem
4: Truth content discovery algorithm
5: Trustworthy and scalable service providers algorithm
6: Efficient feature extraction and classification (EFEC) algorithm
7: QUERY retrieval time (QRT)
8: Trust content discovery and trustworthy and scalable service providers algorithm
Trang 69: Efficient feature xtraction and classification (EFEC) algorithm and customer review datasets10: Summary
4: Building the diabetic diagnostic criteria
5: Evaluating the diabetes outcomes using classification algorithms
Trang 7Chapter 11: Integrating AI in e-procurement of hospitality industry in the UAE
2: Applications of artificial intelligence, machine learning, and blockchain technology
3: It takes two to tango: Future of artificial intelligence and machine learning in blockchaintechnology
4: Edge computing: A potential use case of blockchain
5: Conclusions
References
Trang 8Part II: Data science and predictive analysis
Chapter 13: Implementing convolutional neural network model for prediction in medicalimaging
Abstract
1: Introduction
2: Convolutional neural networks
3: Implementing CNN for biomedical imaging and analysis
4: Architecture models for different image type
4: Machine learning algorithms for fire outbreak prediction
5: Result and discussion
Trang 92: Big Data
3: Big Data with cloud computing
4: Internet of Things (IoT)
3: Contribution of intelligent e-learning system using BN model
4: Learner assessment model
5: Results and discussions
6: Conclusions and future work
Trang 105: Discussion and conclusions
2: Theoretical background and related works
3: Neural hybrid recommendation (NHybF)
Trang 12Chapter 23: An intelligent framework to assess core competency using the level predictionmodel (LPM)
Part III: Edge computing
Chapter 24: Edge computing: A soul to Internet of things (IoT) data
Abstract
1: Introduction
2: Edge computing characteristics
3: New challenges in Internet of technology (IoT): Edge computing
4: Edge computing support to IoT functionality
5: IoT applications: Cloud or edge computing?
6: Benefits and potential of edge computing for IoT
7: Use case: Edge computing in IoT
8: Pertinent open issues which require additional investigations for edge computing
9: Conclusions
References
Chapter 25: 5G: The next-generation technology for edge communication
Trang 135: Components used in 5G technology/network
6: Differences from 4G architecture
2: Overview of edge computing
3: Security and privacy in edge computing
4: Intersection of machine learning and edge using enabling technologies
5: Machine learning and edge bringing AI to IoT
Trang 142: Hybrid software-defined networks
3: Security challenges in hybrid software-defined networks
4: Solutions for hybrid software-defined networks
5: Learning techniques for hybrid software-defined networks
6: Discussion and implementation
2: Features and differences between cloud, fog, and edge computing
3: Framework and programming models: Architecture of fog computing
4: Moving cloud to edge computing
5: Case study: Edge computing for intelligent aquaculture
6: Conclusions
Trang 15Chapter 29: A comparative study on IoT-aided smart grids using blockchain platform
Abstract
1: Introduction to smart grid, IoT role, and challenges of smart grid implementations
2: Secure smart grid using blockchain technology
3: Conclusions
References
Chapter 30: AI cardiologist at the edge: A use case of a dew computing heart monitoring solution
A use case of a dew computing heart monitoring solution
AI and machine learning
Chapter 1: Supervised learning
Kanishka Tyagia; Chinmay Raneb; Michael Manryc a Aptiv Advanced Research Center, Agoura Hills,
Trang 16b Quantiphi, Inc., Marlborough, MA, United States
c Department of Electrical Engineering, The University of Texas at Arlington, Arlington, TX, United States
Abstract
Machine learning models learn different tasks with different paradigms that effectivelyaim to get the models better through training Supervised learning is a common form ofmachine learning training paradigm that has been used successfully in real-worldmachine learning applications Typical supervised learning involves two phases Inphase 1 (commonly called training), we give input and expect output (also known asground truth) and train a model with respect to a metric using an optimizationalgorithm In phase 2 (commonly called testing), we deploy the model with unseen dataand expect it to either classify or approximate the outputs Although supervisedlearning is covered in almost all machine learning textbooks, we will introduce andexplain supervised learning from an application point of view and its relationship toedge computing All the algorithms will be explained from a mathematical andtheoretical point of view and a programmer’s perspective This will help in doinghands-on experience in implementing algorithms for a variety of problems
Keywords
Linear regression; Logistic regression; Steepest descent; Conjugate gradient; Multilayerperceptron; Second-order algorithms; KL divergence; Generalized linear models; Kernelmachines; Bootstrapping
1: Introduction
This chapter discusses various supervised learning paradigm that is used to train and deploymachine learning models While the theory on supervised learning is predominated mainly by thecomputer vision and natural language processing community, much research is required in itsapplication to edge computing With the recent advances in machine learning algorithms,combined with the increasing computational power, edge computing is another area ofapplication where machine learning will be improving the current technology According to awhite paper from Cisco [1], 50 billion internet of things (IoT) devices will be connected to theinternet by the end of 2020, and even though the estimation that nearly 850 ZB of data will begenerated per year outside the cloud by 2021, the global data center traffic is only 21 ZBapproximately This means that a transformation from bug cloud data centers to a wide range ofedge devices is happening
We plan to cover the traditional supervised learning algorithm basics and the trade tricks whileimplementing them for various real-life applications The assumptions made and theprogramming decisions made while implementing them are also discussed We start withgeneralized linear models along with optimization algorithms (gradient descent, Newton’smethod) and metrics (mean square error [MSE], cross-entropy) used in training the model Then,Nạve Bayes and kernel methods are discussed along with the decision tree and its variants Weprovide pseudocode for all the supervised learning algorithms that we discuss to give the readers
Trang 17an idea of how the algorithms are implemented from theory to practice In addition, supervisedlearning algorithms that capture the variation of input data distribution and algorithms thatproject data in higher subspace and first- and second-order learning paradigms are the subjectmatter of this chapter The chapter covers the learning paradigms in sync with the modern-dayapplication to edge computing from an implementation perspective.
Edge computing and machine learning will enable the devices to have a more pervasive and grained intelligence Supervised learning is a common way of training machine learningalgorithms that can be later used in the inference stage on IoT and other edge devices
fine-This chapter covers several supervised learning algorithms that are commonly used in variousreal-life applications. Section 2 covers the basic perceptron model that helps understand thelearning algorithm In Section 3, the linear regression algorithm is discussed that extends the idea
of perceptron algorithm to solve supervised regression tasks We discuss commonly usedgradient-based learning algorithms along with their pseudocodes. Section 4 discusses logisticregression, which is a very commonly used classifier due to its simplicity in structure andgeneralizability In Section 5, we bring the concepts together from the previous sections todiscuss the multilayer perceptron (MLP) network, a widely used neural network in real-lifeapplications We explain the structure, initialization strategy, and detail out various learningalgorithms that are reused to train an MLP. Section 6 discusses the Kullback-Leibler (KL)divergence measure used to calculate how two probability distributions are different from eachother. Section 7 discusses the generalized linear models that extend to the linear models but with
a nonlinear activation on the output nodes. Section 8 talks about the kernel methods leading
to Section 9 that explains nonlinear support vector machine (SVM) classifiers We conclude thischapter in Section 10 by discussing various tree ensemble algorithms
2: Perceptron
Perceptron algorithms came in early 1960; however, Minsky and Papert [2] show the restrictionthey have Rosenblatt [3] described an alpha perceptron that is an example of a statistical patternrecognition (SPR) system In a typical SPR system, the features are obtained from the raw inputusing no learning mechanism but instead using a common-sense rule Therefore, a humandecides what is a good feature and sees if it works, and if it does not work, tries another Welearn the weight associated with each feature activation to get a single scalar quantity and thenbased on if this quantity is above or below a threshold In a typical perceptron model for input
vector, x is used to compute a weighted sum from all the neurons and added with a bias vector.
This bias vector is also known as the threshold vector in the literature
(1)
The linear perceptron will output y depending on the following rule as shown in Fig 1
(2)
Trang 18The perception algorithm is fast and straightforward, and if the dataset is linearly separable, it isguaranteed to converge.
FIG 1 Perceptron activation functions.
3: Linear regression
In this section, we discuss the structure and notation of a linear regression As shown in Fig 2, a
linear regression is a weight matrix W that transforms an input vector x into a discriminant vector y [4] The weight w(m, n) connects the nth input to the mth output The training dataset
(xp, tp ) consists of N-dimensional input vectors x p and M-dimensional desired output vectors t p
The pattern number p varies from 1 to N v , where N v denotes the number of training patterns The
threshold is handled by augmenting x with an extra element, which is equal to 1 as xa = [1 : xT]T
So xa contains N u basis functions, where N u = N + 1 For the pth training pattern, the network
output vector, yp can be written as
(3)
where xap denotes xa for the pth pattern.
Trang 19FIG.
Trang 202 Linear regression.
3.1: Training a linear regression
To train the linear regression, we minimize the error function E that is a surrogate for a
nonsmooth classification error As in Ref [5], from a Bayesian point of view, we considermaximizing likelihood function or minimizing MSE in the least-square sense, where the MSEbetween the inputs and the outputs is
(4)
Here, the target output for the correct output. M denotes the total number of outputs We
minimize the error function from Eq (9) with respect to W by solving the M sets of N + 1 linear
Since R in Eq (35) is often ill-conditioned, it is unsafe to use Gauss-Jordan elimination Eq (35)
is solved using orthogonal least squares (OLS) [6] In Ref [7], OLS is used to solve for radialbasis function network parameters OLS is useful for practical applications for two primaryreasons First, the training is fast since solving linear equations is straightforward Second, ithelps us to avoid some local minima [8] In terms of optimization theory, solving Eq (35)
for W is merely Newton’s algorithm for the output weights [9]
Given
(8)
and the error function
Trang 21(15)
Trang 223.2: Steepest descent
The steepest descent gradient algorithm can be summarized in Algorithm 1
Algorithm 1 Steepest descent gradient algorithm
1: Initialize w, N it , it ← 02: while it < N itdo3: Calculate g from Eq (12)4:
Compute B2 from Eq (15)5: Update w as w ← w + B2 ⋅ g6: it ← it + 17: end while
Fig 3 illustrates gradient descent using a two-dimensional (2D) contour plot The X-axis denotes the first weight w1 and the Y-axis is second weight w2 The arrows [g0,g1,g2,g3,g4,g5] in Fig
3 determines the direction of the negative gradients for each of the weights to reach a minimum.The learning factor controls the step size derived in Eq. (15) We can observe that gradients getsmaller as they approach the minimal point
FIG.
3 Steepest descent 2D contour plot.
From a programmers perspective, some of the debugging steps are:
1 1. E is nonincreasing
2 2. E g approaches 0
3 3. B2 ≥ 0
4 4.
Trang 233.3: Conjugate gradient
As we see from the previous section, the weights are updated in the negative gradient direction in
a basic gradient algorithm Although the error function reduces most rapidly along the negativedirection of the gradient, it does not necessarily create fast convergence Conjugate gradient(CG) algorithm [10] performs a line search in the conjugate direction and has faster convergencethan the backpropagation (BP) algorithm Although CG is a general unconstrained optimizationtechnique, its use in efficiently training an MLP is well documented in Ref [11]
To train an MLP using CG algorithm, we use a direction vector that is obtained from the
gradient g as
(16)
Here, p = vec(P, P oh , P oi ) and P, P oi , and P oh are the direction vectors. B1 is the ratio of the gradientenergy from two consecutive iterations This direction vector, in turn, updates all the weightssimultaneously as follows:
The CG algorithm can be summarized in Algorithm 2
Algorithm 2 Conjugate gradient algorithm
1: Initialize w, N it , it ← 02: while it < N it do3: Calculate p from g4: Compute z from
Eq (15)5: Update w as w ← w + z ⋅ p6: it ← it +17: end while
Fig 4 illustrates CG using a 2D contour plot The X-axis denotes the first weight w1 and axis is second weight w2 From the plot, we can observe that CG needs N + 1 number of steps to
the Y-reach the minimum
Trang 24FIG 4 Conjugate gradient 2D contour plot.
From a programmer’s perspective, some of the debugging steps are as follows:
Trang 25Since it is a two-class classification, the classes C1 and C2 can be denoted as y ∈ 0, 1 Linear regression does not normally work in classification since even if y = 0 or 1, can be greater than
or less than 1 Therefore, we use logistic regression in which y ∈ 0, 1 It should be emphasized that this is still a classification rather than a regression Here, σ() is a logistic function defined as
(21)
when w T x tends to ∞ then tends to 1 and when w T x tends to −∞ then tends to 0
The decision boundary for logistic regression is the property of the parameters and not thetraining set So, if there is a nonlinear decision boundary, it is mapped according to the nonlinearactivation function used in logistic regression
Since is nonlinear (sigmoid), the decision boundary can be learned as manifold learning of
nonlinear surfaces In order to train the logistic regression, the cost function J(w) should
preferably be convex so that only one global minimum is present We would not use the MSE as
in linear regression since it is a nonconvex function with multiple local minima However, thereare ways to train a logistic regression along with MSE [12, 13]
The logistic regression cost function will be as follows:
Trang 26(27)
Therefore,
(28)
In order to train, we minimize J(w) using gradient descent algorithm as in Algorithm
3 Algorithm 3 is a fundamental building block for more advanced algorithms like CG, BFGS, orL-BFGS
Algorithm 3 Gradient descent algorithm
1: Initialize w, N it, it ← 02: while it < N itdo3: Calculate g4:
Update w as w ← w + z ⋅ g5: it ← it + 16: end while
Trang 275: Multilayer perceptron
In this chapter, we start by describing the multilayer perceptron (MLP), a nonlinear signalprocessor with good approximation and classification properties The MLP has basis functionsthat can adapt during the training process by utilizing example input and desired outputs AnMLP will minimize an error criterion and closely mimic an optimal processor in which thecomputational burden in processing an input vector is controlled by slowly varying the number
of coefficients [16, 17] We review the first- and second-order training algorithms for MLPfollowed by a classifier design of MLP through regression
5.1: Structure and notation
Fig 5 illustrates a single-layer fully connected MLP The input weights w(k, n) connect the nth input to the kth hidden unit Output weights w oh (m, k) connect the kth hidden unit’s nonlinear activation O p (k) to the mth output y p (m), which has a linear activation The bypass weights w oi (m, n) connects the nth input to the mth output The training data described by the set
of independent identically distributed input-output pair consist of N-dimensional input
vectors x p and M-dimensional desired output vectors tp The pattern number p varies from 1 to N v,
where N v denotes the number of training vectors present in the datasets Let N h denote the number
of hidden units In order to handle the thresholds in the input layer, the input unit is augmented
by an extra element x p (N + 1), where x p (N + 1) = 1 For each training pattern p, the hidden layer
net function vector n p can be written as
(31)
The kth element of the hidden unit activation vector Op is calculated as O p (k) = f(n p (k)), where f(⋅)
denotes the sigmoid activation function The network output vector y p can be written as
(32)
The expression for the actual outputs given in Eq (32) can be rewritten as
(33)
where X a = [x pT:O pT]T is the augmented input column vector with N u basis functions, where N u = 1
+ N + N h The total number of weights N w = N h ⋅ (1 + N) + M ⋅ N h Similarly, W o is the M- by N u
-dimensional augmented weight matrix defined as W o = [W oh :W oi]
Trang 28FIG 5 Fully connected MLP.
To train an MLP, we recast the MLP learning problem as an optimization problem and use astructural risk minimization framework to design the learning algorithm [5 18] Essentially, this
framework is used to minimize the error function E as in Eq (9) that is a surrogate for anonsmooth classification error As in Ref [5], from a Bayesian point of view, we considermaximizing-likelihood function or minimizing MSE in a least-square sense Therefore, the MSEbetween the inputs and the outputs is defined as
(34)
Trang 29Here, λ is an L2 regularization parameter used to avoid memorization and overfitting The
nonlinearity in y p causes the error E to be nonconvex, and so in practice, local minima of the
error function may be found In the earlier discussion, we have assumed that t p has a Gaussian
distribution with input x p In case the conditional distribution of targets, given input has aBernoulli distribution, the error function, which is given by the negative log likelihood, is then across-entropy error function [5]
In Ref [5], it is concluded that using a cross-entropy error function instead of the MSE for aclassification problem leads to faster training as well as improved generalization Apart from
cross-entropy and L2 error form, we also have an L1 error measure Golik et al [19] and Simard et
al [20] discuss a good comparison between the L2 and entropy and suggest using entropy error function for classification in order to have faster training and improvedgeneralization Our goal is to obtain an optimal value of the weights connected in an MLP Inorder to achieve this, we use empirical risk minimization [17] framework to design the learningalgorithms An essential benefit of converting the training of an MLP into an optimizationproblem is that we can now use various optimizing algorithms to optimize the learning of anMLP
cross-5.2: Initialization
5.2.1: Input means and standard deviations
If some inputs have even more significant standard deviations than others, they can dominate thetraining, even if they are relatively useless Inputs are normalized with zero mean and unitstandard deviation
5.2.2: Randomizing the input weights
As from Manry [16], the input weights matrix W is initialized randomly from a zero-mean
Gaussian random number generator The training of the input weights strongly depends on thegradient of the hidden unit’s activation functions with respect to the inputs Training of inputweights will cease if the hidden units it feeds into have an activation function derivative of zerofor all patterns In order to remove the dominance of large variance inputs, we divide the inputweights by the input’s standard deviation Therefore, we adjust the mean and standard deviation
of all the hidden units net functions This is called net control as in Ref [21] At this point, wehave determined the initial input weights, and we are now ready to initialize the output weights
To solve for the weights connected to the output of the network, we use a technique called outputweight optimization (OWO) [22, 23] OWO minimizes the error function from Eq (9) with
respect to W o by solving the M sets of N u equations given by
(35)
Here, the cross-correlation matrix C, auto-correlation matrix R
Trang 30In order to incorporate the regularization, we modify the R matrix elements except the threshold
as
(37)
where r is a vector containing the diagonal elements of R and diag() is an operator that creates a
diagonal matrix from the vector
The MLP network is now initialized and ready to be trained with first- or second-orderalgorithms Training an MLP can be seen as an unconstrained optimization problem that usuallyinvolves first-order gradient methods such as BP, CG, and second-order Levenberg-Marquardt(LM), Newton’s method as the most popular learning algorithm Training algorithms can beclassified as
Trang 315.3: First-order learning algorithms
The first-order learning algorithms update the weights of the MLP based on gradient matrices,that is, the first-order information, hence the name We start by discussing the training of anMLP with a one-stage algorithm In this, we train both the output and input weightssimultaneously using either BP or CG algorithm We then describe a two-stage algorithm calledOWO-hidden weight optimization
5.3.1: Backpropagation algorithm
The BP algorithm is a greedy line search algorithms that have a step size to achieve themaximum amount of decrease of the objective function at each step [5] BP is a computationallyefficient method in conjunction with gradient-based algorithms that are used widely to train anMLP [24] However, due to the nonconvexity of error function (Eq 9) in neural networks, BP isnot guaranteed to find global minima but rather only local minima Although this is considered
as a major drawback, recently in Ref [25] it is discussed as to why local minima are still useful
in many practical problems In each training epoch, we update all the weights of the network in a
BP algorithm as follows:
(38)
Here, w is a vector of network weights as w = vec(W, Woh , W oi ) and g is a vector of network
gradients g = vec(G, Goh , G oi) The gradient matrices are negative partial of E w.r.t the
weights, , , and . vec() operator performs a lexicographic ordering of a matrix into a vector. z is the optimal learning factor that is derived using a Taylor series expansion of the MSE E, expressed in terms of z, as [26]
(39)
The BP algorithm can be summarized Algorithm 4
Algorithm 4 Backpropagation algorithm
1: Initialize w, N it , it ← 02: while it < N it do3: Calculate g4: Compute z from Eq.
(39)5: Update w as w ←w + z ⋅ g6: it ← it + 17: end while
As in Ref [5], the BP algorithm has two major criticism First, it does not scale well, that is, ittakes operations for sufficiently large N w and second, being a simple gradient descentprocedure, it is unduly slow in the presence of flat error surfaces and is not a very reliablelearning paradigm
5.3.2: Training lemmas
Trang 32Lemma 1
For the kth hidden unit, if f ′(n p (k)) = 0 for all patterns, then weights feeding into the unit will not change during BP training.
Proof
Observe the partial derivative formula below The partial of E with respect to w(k, n) is 0
under the conditions of the lemma
Trang 33Note that as m n increases, becomes dominated by the first term, which has no
information related to changes in the nth input.
Trang 342 2. Together, Lemmas 1 through 4 provide the motivation for zero-mean inputs, random initial weights, and net control with mean and standard deviation ≠0.
3 3. Lemma 5 shows that if a multioutput MLP is trained, using the same hidden units for each output, it is not optimal.
4 4. Lemma 3 shows that BP lacks affine invariance [ 27 ].
5.4: Second-order learning algorithms
The basic idea behind using a second-order method is to improve the first-order algorithms byusing the second derivative along with the first derivative [5] We present two, one-stagealgorithms, namely Newton’s method and LM and then a two-stage algorithm called as OWO-multiple optimal learning factor [28–30]
5.4.1: Newton’s method
For Newton’s method, given a starting point, we construct a quadratic approximation to a doubledifferentiable error function that matches the first- and second-order derivative value at thatpoint We then minimize this quadratic function instead of the original error function by
expanding the Taylor series of E′ about the point w k as is clear from the equation below:
Trang 35We calculate the second-order direction, d, by solving the set of linear equations with OLS
(48)
Assuming quadratic error function in Eq (9) and H to be positive definite, applying first-order
necessary condition (FONC) [31], on all the weights in an MLP, we update the weights as
(49)
The Newton’s algorithm can be summarized in Algorithm 5
Algorithm 5 Newton’s algorithm
1: Initialize w, N it , it ← 02: while it < N itdo3: Calculate g and H from Eqs (46), (47).4: Compute d from Eq (48).5: Update w as w ←w + d6: it ← it + 17: end while
Newton’s method is quadratic convergent and affine invariant [16] Since it converges fast, we
would like to use it to train an MLP, but generally, the Hessian H is singular [32]
If the error function is quadratic, then the approximation is exactly a one-step solution; otherwisethe approximation will provide only an estimate to the exact solution In case of a nonquadratic
error measure, we will require a line search and w is updated as
(50)
5.4.2: LM algorithm
The LM algorithm is a compromise between Newton’s method, which converges rapidly nearlocal or global minima but may diverge, and gradient descent, which has assured convergencethrough a proper selection of step size parameter but converge slowly Following Eq (45), the
LM algorithm is a suboptimal method Since usually H is singular in Newton’s method, an
alternate is to modify the Hessian matrix as in LM [33] algorithm or use a two-step method such
as layer-by-layer training [34] In LM, we modify the Hessian as
Trang 36After obtaining H LM, weights of the model are updated using Eq (49)
The regularizing parameter λ plays a crucial role in the way the LM algorithm functions If we set λ equal to 0, then Eq (52) reduces to Newton’s method (Eq 49) On the other hand, if we
assign a large value to λ such that λ ⋅I overpowers the Hessian H, the LM algorithms are
effective as a gradient descent algorithm Press et al [35] recommend an excellent Marquardt recipe for the selection of λ.
From a practical perspective, the computational complexity of obtaining H LM can be demanding,
mainly when the dimensionality of the weight vector w is high Therefore, due to scalability
constraints, LM is particularly suitable for a small network
The LM algorithm can be summarized in Algorithm 6
Algorithm 6 LM algorithm
1: Initialize w, N it , it ← 02: while it < N itdo3: Present all patterns to the network to
computer error E old from Eq (9)4: Calculate g and H from Eqs (46), (47)5:
Obtain H LM from Eq (51)6: Compute d from Eq (52)7: Update w as w ←w + d8:
Recompute the error E new by using the updated weights.9: ifE new < E oldthen10:
Reduce the value of λ11: goto Step 312: else13: Increase the value of λ14:
end if15: it ← it + 116: end while
6: KL divergence
From an information theory point of view, Kullback-Leibler (KL) divergence is a relative
entropy measure that estimates how close we have modeled an approximate distribution q(x) with respect to the unknown distribution p(x) In order to arrive at the expression for KL
divergence, we follow the probabilistic viewpoint The following explanation is suitable for bothcontinuous and discrete distribution In order to quantitatively determine how good the
approximate model q(x) is compared to (x), we calculate the log-likelihood ratio for an entire
dataset as
(53)
Eq (53) represents that for any random sample x, how much likely is it that the data point is to
occur in p(x) as compared to q(x) An intuition behind Eq (53) is that if p(x) fits the data more closely, then it will give a value greater than 0, whereas a value less than 0 indicates that q(x)
better fits the data In case the models fit the data equally well, Eq (53) will give a value of 0 Sowith Eq (53), we can quantify how much better one model is better over the other given the
Trang 37dataset For a large dataset with N sampled points that are independent and
identically distributed (iid), we calculate an analytical term called an average predictive power
The way the probabilistic interpretation is connected to the neural network design is the
assumption of iid A common misconception about KL divergence is to think of it as a distance
between the p(x) and q(x) However, this is wrong since KL divergence does not follow the
commutative property, that is, KL(p||q)≠KL(p||q).
7: Generalized linear models
From Section 3 on linear regression, we add a nonlinear activation to the output neurons andcreate a class of models known as generalized linear models Platt [36] has shown the utility ofusing nonlinear output activations such as sigmoids We describe an algorithm that usessigmoidal output activation to bound network outputs as shown in Fig 7 The error of theresulting generalized linear classifier (GLC) is
(56)
where O p (i) is defined as f(y p (i)) and f(⋅) denotes the sigmoid activation function.
Trang 39FIG 7 Generalized linear model.
In order to mitigate this increase in E G, we need to map the linear output activations yp to
sigmoidal outputs Op as
(57)
where a and b are scalars that are found such that E G is minimized
8: Kernel method
A second-order feed forward training algorithm for a radial basis function (RBF) classifier is
patterns, the RBF basis functions are given as
Equating the gradient of E rbf(Wo) to 0, we solve for the output weight matrix Wo using OWO [37]
Let mk , β k , and c(n) denote the kth center vector, the kth spread parameter, and the weighted Euclidean distance measure weight for the nth input These parameters are optimized using an
efficient Newton’s algorithm [37]
9: Nonlinear SVM classifier
A nonlinear support vector machine (SVM) follows the structure similar to the RBF classifier In
a nonlinear SVM, K(X p, Xq ) ≡ ϕ(X p)T ϕ(X q ) is called the kernel function and ϕ(X q) is a featurefunction For our comparisons , γ > 0 For a given two-class
classification problem, a nonlinear SVM solves the following convex optimization problem:
Trang 40subject to , ξ i ≥ 0 Here, the weight vector Wo is N sv by 1 and b is a
bias The basis vector X p is N sv by 1, and 1 ≤ p ≤ N sv C is a user-specified positive parameter, and the ξ p are called the slack variables We need to find the optimum values of the weight vector Wo,
bias b, and the slack variables ξ p to minimize E svm The coordinated descent algorithm [39] is usedfor training For multiclass problems, one versus the rest strategy is implemented
10: Tree ensembles
10.1: Decision trees
Decision tree [40] is an upside-down tree-like structure generated from training data with themain root at the top as shown in Fig 8 Each node is further split into subsequent nodes calledbranches/subtree depending on the decisions made using training data The tree’s end where thetree further does not split is the leaf/terminal nodes that indicate the final decision Decision treesare used for both approximation and classification data
FIG 8 Decision tree.
A decision tree is learned by splitting the branches into subbranches depending on the lowesterror values The split decision for approximation data is made based on the variance of the node