Supervised Machine Learning Lecture notes for the Statistical

Supervised Machine Learning Lecture notes for the Statistical Machine Learning course Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, Thomas B Schön Version March 12, 2019 Department of Informat.

What is machine learning all about?

Machine learning enables computers to learn from data without explicit programming by utilizing mathematical models to identify unknown variables While a simple example is fitting a straight line to data, machine learning often employs more complex models This process allows for drawing conclusions about new, unseen data For instance, a model trained on 1,000 puppy images can accurately determine whether a new image depicts a puppy, a capability known as generalization.

The science of machine learning is about learning models that generalize well.

Supervised learning involves analyzing labeled data represented as pairs {x_i, y_i}, where x_i denotes inputs and y_i denotes outputs This approach allows us to understand the relationship between inputs and outputs, as seen in medical diagnoses like heart disease, where electrocardiogram (ECG) readings serve as inputs (x) and the corresponding diagnoses are the outputs (y) By utilizing a substantial dataset of ECG readings and their associated diagnoses, we can train a supervised machine learning model to predict diagnoses for new ECG readings (y_b) A well-trained model will make accurate predictions that closely align with true diagnoses, demonstrating effective generalization beyond the training data.

Supervised learning faces significant challenges due to its reliance on labeled data, which entails both input-output pairs {x_i, y_i} Labeling data can be costly, difficult, or even impossible, as it often requires human interpretation Additionally, many advanced methods demand large datasets for optimal performance This has led to the rise of unsupervised learning, which utilizes only unlabeled input data {x_i} A key aspect of unsupervised learning is clustering, where data is automatically grouped based on similarity Furthermore, semi-supervised learning has emerged as a valuable approach that leverages both labeled and unlabeled data, bridging the gap between supervised and unsupervised methods.

Access to a vast amount of unlabeled data, combined with a limited but valuable set of labeled data, can enhance the effectiveness of machine learning models Utilizing this small quantity of labeled data alongside the larger unlabeled dataset can lead to significant insights and improved performance.

In reinforcement learning, a key branch of machine learning, the focus extends beyond merely analyzing measured data for predictions or understanding specific situations; it emphasizes learning through interactions and feedback to optimize decision-making processes.

1 Some common synonyms used for the input variable include feature, predictor, regressor, covariate, explanatory variable, controlled variable and independent variable.

In machine learning, the output variable is often referred to by several synonyms, including response, regressand, label, explained variable, predicted variable, and dependent variable To develop a system capable of taking actions in the real world, the most prevalent method involves maximizing a reward that encourages the desired state of the environment, a concept closely related to reinforcement learning and control theory Additionally, the emerging field of causal learning seeks to address the more complex challenge of understanding cause and effect relationships, distinguishing it from traditional machine learning approaches that primarily focus on identifying correlations within data.

Regression and classification

Supervised machine learning algorithms can be effectively categorized based on the nature of the output variable, which can be either quantitative or qualitative Understanding the distinction between these types of variables is crucial for selecting the appropriate algorithm for a given problem For a clearer understanding, refer to Table 1.1 for examples illustrating quantitative and qualitative variables.

Table 1.1: Examples of quantitative and qualitative variables.

Variable type Example Handle as

Numeric (continuous) 32.23 km/h, 12.50 km/h, 42.85 km/h Quantitative

Numeric (discrete) with natural ordering 0 children, 1 child, 2 children Quantitative Numeric (discrete) without natural ordering 1 = Sweden, 2 = Denmark, 3 = Norway Qualitative Text (not numeric) Uppsala University, KTH, Lund University Qualitative

Depending on whether the output of a problem is quantitative or qualitative , we refer to the problem as either regression or classification.

Regressionmeans the output is quantitative, andclassificationmeans the output is qualitative.

This means that whether a problem is about regression or classification depends only on its output The input can be either quantitative or qualitative in both cases.

The distinction between quantitative and qualitative data, as well as between regression and classification, can be somewhat arbitrary For instance, one might consider the absence of children as qualitatively different from having children, leading to a classification output of "children: yes/no" instead of a quantitative measure like "0, 1, or 2 children." This illustrates how a regression problem can be transformed into a classification problem based on the interpretation of the data.

Overview of these lecture notes

1.3 Overview of these lecture notes

The following sketch gives an idea on how the chapters are connected.

Chapter 2: The regression problem and linear regression

Chapter 3: The classification problem and three parametric classifiers

Chapter 4: Non-parametric methods for regression and classification: k-NN and trees

Chapter 5: How well does a method perform? Chapter 7: Neural networks and deep learning Chapter 6: Ensemble methods needed recommended

Further reading

Linear regression, a statistical method that has been utilized for over 200 years, was independently introduced by Adrien-Marie Legendre in 1805 and Carl Friedrich Gauss in 1809 through their discovery of the least squares method Its significance is widely acknowledged in numerous statistics and machine learning textbooks, including works by Bishop (2006) and Gelman et al (2013), as well as Hastie, Tibshirani, and Friedman.

The least squares technique has a long history, but its regularized versions are relatively new Ridge regression, introduced by Hoerl and Kennard in 1970, is also known as Tikhonov regularization in numerical analysis The LASSO method was first presented by Tibshirani in 1996 A comprehensive overview of sparse models and the LASSO is provided in the recent monograph by Hastie, Tibshirani, and Wainwright (2015).

2.A Derivation of the normal equations

X T Xβb =X T y. can be derived from (2.16) βb =argmin β kXβ−yk 2 2, in different ways We will present one based on (matrix) calculus and one based on geometry and linear algebra.

No matter how (2.17) is derived, ifX T Xis invertible, it (uniquely) gives βb = (X T X) −1 X T y,

IfX T Xis not invertible, then (2.17) has infinitely many solutionsβb, which all are equally good solutions to the problem (2.16).

V(β) =kXβ−yk 2 2= (Xβ−y) T (Xβ−y) =y T y−2y T Xβ+β T X T Xβ, (2.33) and differentiateV(β)with respect to the vectorβ,

SinceV(β)is a positive quadratic form, its minimum must be attained at ∂β ∂ V(β) = 0, which characterizes the solutionβb as

2.A Derivation of the normal equations

To minimize the expression \( kX\beta - yk^2_2 \), we must select \(\beta\) such that \(X\beta\) serves as the orthogonal projection of \(y\) onto the subspace formed by the columns \(c_j\) of \(X\) This orthogonal projection can be determined through the application of the normal equations.

We can break down the vector \( y \) into two components: \( y^\perp \), which is orthogonal to the subspace spanned by all columns \( c_i \), and \( y_k \), which lies within that subspace Since \( y^\perp \) is orthogonal to both \( y_k \) and \( X\beta \), we can establish that the squared norm \( \|X\beta - y_k\|^2 \) is greater than or equal to \( \|y^\perp\|^2 \) Additionally, applying the triangle inequality leads to the conclusion that the squared norm \( \|X\beta - y\|^2 \) is less than or equal to the sum of \( \|y^\perp\|^2 \) and \( \|X\beta - y_k\|^2 \).

To achieve the minimum of the criterion \( kX\beta - yk^2_2 \), we must select \( \beta \) such that \( X\beta = y_k \) Consequently, the optimal solution \( \beta_b \) must ensure that the residual \( X\beta_b - y \) is orthogonal to the subspace formed by all columns \( c_i \).

(remember that two vectorsu,vare, by definition, orthogonal if their scalar product,u T v, is0.) Since the columnsc j together form the matrixX, we can write this compactly as

(y−Xbβ) T X= 0, (2.39) where the right hand side is thep+ 1-dimensional zero vector This can equivalently be written as

3 The classification problem and three parametric classifiers

In this article, we will explore the classification problem, which involves qualitative outputs, in contrast to the quantitative outputs of regression problems A classifier is a method used for classification, and we will begin with logistic regression as our first classifier Additionally, we will introduce linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) in this chapter More advanced classification techniques, including classification trees, boosting, and deep learning, will be discussed in subsequent chapters.

The classification problem

Classification involves predicting a qualitative output based on various input types, which can be either quantitative or qualitative The output is limited to a finite set of values, denoted by K, representing the number of possible classes or labels For example, K could equal 2 for binary outputs like {false, true} or 4 for categories such as {Sweden, Norway, Finland, Denmark} Throughout this discussion, we assume the number of classes, K, is known and use integers 1 through K for convenience in notation It's important to note that this integer labeling does not imply any inherent order among the classes.

In binary classification, which occurs when there are only two classes (K = 2), we typically use the labels 0 and 1, or alternatively, positive (class k = 1) and negative (class k = 0) This choice of labeling is primarily for mathematical convenience, allowing for easier computation and analysis.

Classification involves predicting the output based on given input, specifically through the statistical approach of estimating class probabilities p(y|x) Here, y represents the output class labels (1, 2, , or K), while x denotes the input variables The notation p(y|x) encompasses both qualitative probability masses and quantitative probability densities, illustrating the likelihood of a class label y occurring given the input x Understanding this probability is crucial, as it treats the class label y as a random variable, reflecting the inherent randomness present in the real-world data from which it originates.

1 In Chapter 6 we will use k = 1 and k = −1 instead.

Example 3.1: Modeling voting behavior—randomness in the class labely

When analyzing voting preferences among various population groups, it's essential to recognize that not all individuals within a specific demographic will support the same political party This can be mathematically represented by treating voting outcomes as a random variable that adheres to a certain probability distribution For instance, if we examine the voting behavior of 45-year-old women, we find that 13% support the cerise party, 39% favor the turquoise party, and 48% back the purple party This can be expressed as p(y=cerise party|45 year old women) = 0.13, p(y=turquoise party|45 year old women) = 0.39, and p(y=purple party|45 year old women) = 0.48.

In this way, we use probabilitiesp(y|x)to describe the non-trivial fact that

(a) not all 45 year old women vote for the same party, but

Among 45-year-old women, party affiliation is not entirely random; the purple party emerges as the most favored choice, while the cerise party ranks as the least preferred option.

The number of output classes in this example isK = 3.

Logistic regression

Learning the logistic regression model from training data

The logistic function enables the transition from linear regression, a regression technique, to logistic regression, a classification method However, this transformation means we cannot utilize the convenient normal equations for estimating β in logistic regression, unlike in linear regression Similar to linear regression, our goal is to estimate β from the training data T = {(x i , y i )} n i=1 by employing the maximum likelihood approach, which involves solving for βb = arg max β.

Let us now work out a detailed expression for the likelihood function 2 ,

To optimize the function with respect to β, it is often more effective to focus on the logarithm of `(β)` due to numerical stability Since the logarithm is a monotonic function, maximizing log`(β)` will yield the same optimal argument as maximizing `(β)` Therefore, we can express this as log`(β) = Σ (y_i = 1) β^T x_i - log`.

The simplification in the second equality relies on the chosen labeling, thaty i = 0ory i = 1, which is indeed the reason for why this labeling is convenient.

A necessary condition for the maximum oflog`(β)is that its gradient is zero,

2 We now add β to the expression p(y | x), to explicitly show its dependence also on β.

The equation presented is vector-valued, leading to a system of p + 1 equations that must be solved, involving p + 1 unknown elements of the vector β Unlike the linear regression model with Gaussian noise discussed in Section 2.3.1, this maximum likelihood problem results in a nonlinear system of equations that does not have a general closed-form solution Consequently, a numerical solver is necessary, as outlined in Appendix B The Newton–Raphson algorithm, which is equivalent to the iteratively reweighted least squares algorithm, is the standard method employed for this purpose, as referenced in Hastie, Tibshirani, and Friedman 2009, Chapter 4.4.

Algorithm 1:Logistic regression for binary classification

Data: Training data{x i , y i } n i=1(with output classesy= 0,1and test inputx ?

Decision boundaries for logistic regression

Logistic regression is utilized to model class probabilities, specifically p(y=0|x) and p(y=1|x) To make predictions for a test input x?, we must follow a final step after learning the parameters β from the training data This involves computing p(y=0|x?) and p(y=1|x?), and ultimately predicting by ? as the class with the highest probability, represented by the equation by? = arg max k=0,1 p(y=k|x?).

This is illustrated in Figure 3.2 for a one-dimensional inputx.

A classifier maps all possible test input points (x) to predictions (y), typically creating distinct regions that correspond to the same prediction The curve that separates these regions, distinguishing different class predictions, is known as the decision boundary Examples of decision boundaries for logistic classifiers are depicted in Figure 3.2 for a one-dimensional input scenario and in Figure 3.3 for two-dimensional input cases.

We can find the decision boundary by solving the equation p(y= 1|x) =p(y= 0|x) (3.11) which with logistic regression gives e β T x

The equationβ T x= 0parameterizes a (linear) hyperplane Hence, the decision boundaries in logistic regression always have the shape of a (linear) hyperplane.

We distinguish between different types of classifiers by the shape of their decision boundary: Since logistic regression only haslineardecision boundaries, it is consequently called alinear classifier.

In binary classification, where the output is either 0 or 1, logistic regression models the probabilities of each class based on a scalar input x After learning the parameter β from training data, the model predicts the probability of y being 1 (in blue) and y being 0 (in red) for any given test input To make class predictions, the model selects the class with the highest predicted probability, and the point where the prediction shifts from one class to another is known as the decision boundary, represented by a dashed vertical line.

Logistic regression for binary classification (K = 2 classes) consistently produces a linear decision boundary In this context, the red dots and green circles represent training data from two distinct classes, while the area where the red and green fields overlap illustrates the decision boundary established by the logistic regression classifier based on the training data.

(b) Logistic regression for K = 3 classes We have now introduced training data from a third class, marked with blue crosses The decision boundary between any pair of two classes is still linear.

Figure 3.3: Examples of decision boundaries for logistic regression.

Logistic regression for more than two classes

Logistic regression can be extended to handle multiple classes (K > 2) through various generalization methods This article will focus on a specific approach that will be beneficial for deep learning applications discussed in Chapter 7 The process involves two key steps: implementing one-hot encoding and substituting the logistic function with the softmax function.

One-hot encoding is a technique used to represent categorical data as binary vectors Instead of using a single integer value to denote categories, this method transforms the output into a K-dimensional vector In this representation, if there are K categories, the vector contains a 1 at the index corresponding to the category and 0s elsewhere For instance, with K equal to 3, the encoding would clearly distinguish each category by using a unique binary vector for each.

Vanilla encoding One-hot encoding y i = 1 y i 1 0 0T y i = 2 y i 0 1 0T y i = 3 y i 0 0 1T

Since we now have a vector-valued output y, we also need a vector-valued alternative to the logistic function To this end, we introduce the vector-valued softmax function softmax(z), 1

The softmax function, represented by a K-dimensional input vector \( z = [z_1, z_2, \ldots, z_K] \), possesses key properties: its output vector sums to 1, and each element lies within the range of [0,1] In a manner akin to the integration of linear regression with the logistic function for binary classification, we now merge linear regression with the softmax function to effectively model class probabilities.

In multi-class logistic regression, we utilize K vectors β₁, , βₖ, where each vector corresponds to a class, resulting in an increase in the number of parameters to learn as K grows Similar to binary logistic regression, we can employ the maximum likelihood approach to estimate these parameters, denoted collectively as θ By applying one-hot encoding, the likelihood function is structured as log`(θ) = logp(y|X;θ), which expands to the sum of the log probabilities for each class based on the observed data: logp(1|xᵢ;θ) for class 1, logp(2|xᵢ;θ) for class 2, and so on, up to logp(K|xᵢ;θ) for class K.

The likelihood function represented by XK k=1 y ik logp(k|xi;θ) (3.16), where y ik denotes the elements of one-hot encoding vectors, serves as an objective function in numerical optimization, similar to its application in binary cases This specific formulation will consistently arise whenever one-hot encoding is utilized and is commonly known as cross-entropy.

Linear and quadratic discriminant analysis (LDA & QDA)

Using Gaussian approximations in Bayes’ theorem

From probability theory, Bayes’ theorem might be familiar, which says that p(y|x) = p(x|y)p(y)

In classification tasks, the primary focus is on the conditional probability p(y|x), which is crucial for understanding the relationship between input features and output labels However, in real-world machine learning scenarios, we lack direct access to this probability, as we only have training data without explicit equations While logistic regression simplifies this by directly modeling p(y|x), Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) take a different approach by assuming that the likelihood p(x|y) follows a Gaussian distribution, irrespective of the actual data distribution Given that p(x|y) is a distribution over the input x, which often consists of multiple dimensions, it is represented as a multivariate Gaussian characterized by a mean vector and a covariance matrix.

A classifier should effectively learn from training data by estimating the parameters of the Gaussian distribution, specifically the mean vector (â) and the covariance matrix (Σ) In Linear Discriminant Analysis (LDA), the mean vector is assumed to vary across classes, while the covariance matrix is considered uniform for all classes Conversely, Quadratic Discriminant Analysis (QDA) assumes that both the mean vector and covariance matrix differ for each class These parameters are denoted with a hat symbol (â and Σb) to indicate they are derived from the data Additionally, Bayes' theorem incorporates the term p(y), representing the probability of a random data point belonging to class y, which is often unknown To estimate p(y), we use the class occurrence in the training data, denoted as πk, allowing us to approximate p(y) as p(k) For instance, if 22% of the training data is labeled as class 1, we approximate p(1) as π1 = 0.22.

Thus, in LDAp(y|x)is modeled as p(y=k|x ? ) = bπ k N x ? |àb k ,Σb

P K j=1πb j N x ? |àb j ,Σb, (3.18) fork= 1,2, , K This is what is shown in Figure 3.4.

3 Note to be confused with Latent Dirichlet Allocation, which is a completely different machine learning method.

4 TODO: This actually assumes that x is quantiative How are qualitative inputs handled in LDA/QDA? x 1 x 2

Training data points x i with label y i = 0

Training data points x i with label y i = 1

In Linear Discriminant Analysis (LDA), it is assumed that the input variable x follows a Gaussian distribution for each output class y, with distinct means for each class while sharing a common covariance across all classes This assumption helps in understanding the structure of the training data as we derive the LDA model.

Level curves forp(x| y = 0) Level curves forp(x | y = 1) Training data points x i with label y i = 0 Training data points x i with label y i = 1

In Quadratic Discriminant Analysis (QDA), it is assumed that the input variable \( x \) follows a Gaussian distribution for a given output \( y \) Unlike Linear Discriminant Analysis, QDA allows both the mean and the covariance to vary across different classes The accompanying plot illustrates the expected distribution of the training data as we derive the QDA model.

LDA and QDA models are based on the assumption that the conditional probability p(x | y) follows a Gaussian distribution, treating input variables as random with a specific distribution In Linear Discriminant Analysis (LDA), the covariance of the input distribution remains constant across all classes, differing only in their locations Conversely, Quadratic Discriminant Analysis (QDA) allows for varying covariances among different classes, enabling more flexibility in modeling class distributions.

In practical applications of Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA), the underlying assumptions regarding the distribution of input variables (x) are often not met Despite this limitation, these assumptions serve as the foundational motivation for employing these analytical methods.

In full analogy, for QDA we have (the only difference is the covariance matrixΣb) p(y=k|x ? ) = bπ k N x ? |àb k ,Σb k

We have not imposed any restrictions on K; however, both Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) can be effectively utilized for both binary and multi-class classification tasks In the following sections, we will delve deeper into various aspects of these methods.

Using LDA and QDA in practice

LDA and QDA are derived from Bayes’ theorem, focusing on the left side of the equation while assuming that p(x|y) follows a Gaussian distribution Although this assumption often does not hold true in practical scenarios, both LDA and QDA remain effective classifiers even when the Gaussian distribution assumption is not verified.

To effectively learn an LDA or QDA classifier from training data {x_i, y_i} for prediction without prior knowledge of the true distribution p(x|y), one should first gather a representative dataset Next, the data should be split into training and validation sets to ensure the model's performance can be evaluated The LDA or QDA model can then be trained using the training data, allowing it to learn the relationships between the features and the target labels Once the model is trained, it can be applied to new data points to make predictions, leveraging the learned patterns from the training phase Regularly assessing the model's accuracy on the validation set is crucial for fine-tuning and improving its predictive capabilities.

To effectively implement Linear Discriminant Analysis (LDA) or Quadratic Discriminant Analysis (QDA), it is essential to estimate the parameters πb k, àb k, and Σb for LDA, or Σb k for QDA for each class k, where k ranges from 1 to K Among these parameters, the most straightforward to estimate is πb k, which represents the relative frequency of class k in the training dataset, calculated as b π k = n k / n.

3.3 Linear and quadratic discriminant analysis (LDA & QDA) wheren k is the number of training data samples in classk Consequently, alln k must sum ton, and therebyP kbπ k = 1 Further, the mean vectorà k of each class is learned as b à k = 1 n k

X i:y i =k xi, (3.20b) the empirical mean among all training samples of classk For LDA, the common covariance matrixΣfor all classes is usually learned as Σb = 1 n−K

(x i −àb k )(x i −àb k ) T (3.20c) which can be shown to be an unbiased estimate of the covariance matrix 5 For QDA, one covariance matrixΣ k has to be learned for each classk= 1, , K, usually as Σb k = 1 n k −1

(xi−àb k )(xi−àb k ) T , (3.20d) which similarly also can be shown to be an unbiased estimate.

Remark 3.3 To derive the learning of LDA and QDA, we did not make use of the maximum likelihood idea, in contrast to linear and logistic regression Furthermore, learning LDA and QDA amounts to inserting the training data into the closed-form expressions(3.20), similar to linear regression (the normal equations), but different from logistic regression (which requires numerical optimization).

After determining the parameters \( b\pi_k \), \( àb_k \), and \( \Sigma_b \) or \( \Sigma_{b_k} \) for each class \( k = 1, \ldots, K \), we establish a model for \( p(y|x) \) This model enables us to make predictions for a test input \( x ? \) Similar to logistic regression, we convert \( p(y|x ?) \) into actual predictions by selecting the most probable class, represented as \( b y ? = \arg \max_k p(y=k|x ?) \).

We summarize this by algorithm 2 and 3, and illustrate by Figure 3.5 and 3.6.

Algorithm 2:Linear Discriminant Analysis, LDA

Data: Training data{x i , y i } n i=1 (with output classesk= 1, , K) and test inputx ?

8 Find largestp(y=k|x ? )and setby ? to thatk

5 This means that the if we estimate Σ b like this for new training data over and over again, the average would be the true covariance matrix of p(x).

Algorithm 3:Quadratic Discriminant Analysis, QDA

Data: Training data{x i , y i } n i=1(with output classesk= 1, , K) and test inputx ?

7 Find largestp(y=k|x ? )and setby ? to thatk

Figure 3.5 illustrates the Linear Discriminant Analysis (LDA) for three classes (K = 3) with a one-dimensional input (p = 1) The upper left panel depicts the Gaussian model of p(x | k), characterized by parameters b, k, and Σ, which are derived from training data not shown Since p = 1, only a scalar variance Σ² is present instead of a covariance matrix The upper right panel displays b π k, an approximation of p(k), which, along with Bayes’ theorem, is used to compute P(k | x) in the bottom panel The final class prediction corresponds to the highest probability, represented by the topmost solid colored line, indicating that for an input x ? = 0.7, the predicted class is y b = 2 (green) Decision boundaries, illustrated by vertical dotted lines in the bottom plot, are determined at the intersections of the solid colored lines.

3.3 Linear and quadratic discriminant analysis (LDA & QDA)

Figure 3.6 illustrates Quadratic Discriminant Analysis (QDA) for three classes, highlighting a key difference from Linear Discriminant Analysis (LDA) shown in Figure 3.5 In QDA, the learned variance Σ b k of p(x | k) varies for each class k, resulting in more complex decision boundaries This complexity is evident in the decision boundary where the blue class (b y = 3) is situated between the red (b y = 1) and green (b y = 2) classes around -0.5.

Decision boundaries for LDA and QDA

After training the model with the data parameters, we can compute the predicted class for a test input \( x \) by applying equations (3.18) and (3.19) for each class \( k \) and selecting the class with the highest probability \( p(y|x) \) These equations are straightforward enough to allow us to analyze the decision boundary, which is the point in the input space where the predictions transition between different classes.

In Linear Discriminant Analysis (LDA), the maximizing argument (arg max k) remains unchanged by logarithmic transformations or constant terms Thus, we can express the classification rule as b y LDA = arg max k p(y=k|x), which simplifies to b y LDA = arg max k log p(y=k|x) This further breaks down to b y LDA = arg max k [log π k + log N(x|àb k, Σb)], highlighting the relationship between class probabilities and the model's parameters.

= arg max k logπ k + logN x|àb k ,Σb

2(x−àb k ) T Σb −1 (x−àb k ) = arg max k logπ k −1

The function δ_k LDA(x), often referred to as the discriminant function, defines the decision boundary between two class predictions, such as k=0 and k=1 This boundary is characterized by the equation δ_0 LDA(x) = δ_1 LDA(x), indicating that the set of points x satisfying this condition represents the transition between the two classes The relationship can be expressed as log(π_0) - 1, highlighting the mathematical foundation of the decision boundary in Linear Discriminant Analysis (LDA).

2àb T 1 Σb −1 àb 1 +x T Σ − 1 àb 1 ⇔ x T Σ − 1 (àb 0 −àb 1 ) = logbπ1−logbπ0−1

Linear discriminant analysis (LDA) establishes a decision boundary in the x-space, represented by the equation {x: x^T A = c}, where A is a matrix and c is a constant This formulation indicates that the decision boundary is always linear, which is the reason for the term "linear" in linear discriminant analysis.

For QDA we can do a similar derivation b y QDA = arg max k logπ k −1

(3.25) and setδ QDA 0 (x) =δ QDA 1 (x)to find the decision boundary as the set of pointsxfor which logbπ 0 −1

This is now on the format{x :x T A+x T Bx =c}, aquadratic form, and the decision boundary for

QDA is thus alwaysquadratic(and thereby also nonlinear!), which is the reason for its namequadratic discriminant analysis.

3.3 Linear and quadratic discriminant analysis (LDA & QDA) x 1 x 2

Linear Discriminant Analysis (LDA) with two classes (K = 2) consistently produces a linear decision boundary In this context, the red dots represent one class of training data, while the green circles denote another The overlapping area between the red and green regions illustrates the decision boundary established by the LDA classifier based on the training data.

(b) LDA for K = 3 classes We have now introduced training data from a third class, marked with blue crosses. The decision boundary between any two pair of classes is still linear. x 1 x 2

(c) QDA has quadratic (i.e., nonlinear) decision boundaries, as in this example where a QDA classifier is learned from the shown training data. x 1 x 2

(d) With K = 3 classes are the decision boundaries for QDA possibly more complex than with LDA, as in this case (cf (b)).

Figure 3.7 illustrates the decision boundaries for Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) In comparison to Figure 3.3, which depicts the decision boundary for logistic regression using the same training data, it's evident that while both LDA and logistic regression feature linear decision boundaries, they differ in their specific configurations.

(a) With K = 2 classes, Bayes’ classifier tell us to take the class which has probability > 0.5 as the prediction y Here, the prediction would therefore b be y b = 1. y = 1 y = 2 y = 3 y = 4

In a scenario with K = 4 classes, the Bayes' classifier indicates that the prediction y b should correspond to the class with the highest probability, resulting in y b = 4 This differs from the case of K = 2 classes, where it is possible for no class to exceed a probability of 0.5.

Bayes’ classifier — a theoretical justification for turning p(y | x) into y b

Bayes’ classifier

To design an effective classifier that minimizes misclassification errors, it is essential that the predicted output label closely matches the true output label for the majority of test data points Ideally, if we had precise knowledge of the probabilities p(y|x), we could define the optimal classifier as b y= arg max k p(y=k|x) However, in practice, classifiers such as logistic regression, LDA, and QDA only provide estimations of these probabilities rather than exact values.

The optimal classifier predicts the label with the highest probability based on the input x, represented by the Bayes' classifier in equation (3.27) This concept is visually illustrated in Figure 3.8 We will first demonstrate the reasons behind its optimality and then explore its relationship with other classifiers.

Optimality of Bayes’ classifier

To maximize the accuracy of our predictions, we aim to enhance the likelihood of our chosen variable \( y_b \) aligning with the random variable \( y \) This relationship can be quantified mathematically through the expected value of the distribution for the random variable \( y \), denoted as \( E_{y \sim p(y | x)} \) Additionally, we can utilize the indicator function \( I{} \), which outputs one when its condition is met and zero otherwise, to further illustrate this concept.

To maximize the probability p(y=k|x), we focus on the relationship p(yb|x) as defined in equation (3.28) By applying the definition of expected value and eliminating terms that equal zero, we can identify the optimal choice for yb This choice should be made to ensure that p(by|x) reaches its maximum value, aligning with our previous assertion in (3.27).

3.4 Bayes’ classifier — a theoretical justification for turningp(y|x)intoyb

Bayes’ classifier in practice: useless, but a source of inspiration

Bayes' classifier relies on the probability distribution p(y|x), assuming that this information is known When p(y|x) is available, Bayes' classifier is the optimal choice, and no other methods are necessary However, in most machine learning scenarios, this distribution is unknown, highlighting the essence of machine learning: our limited understanding of how y depends on x, which is primarily informed by the training data.

Bayes' classifier remains a valuable concept in machine learning, as many classifiers discussed can be viewed as different approximations of it Essentially, these methods allow us to estimate the probability of outcomes, p(y|x), based on training data Although not all methods have been introduced yet, we will provide a brief overview of how certain classifiers connect to this foundational principle.

• In binary logistic regressionp(y|x)is modeled as

In linear and quadratic discriminant analysis (LDA and QDA), the probability p(y|x) is determined using Bayes' theorem, where p(x|y) is modeled as a Gaussian distribution with parameters derived from the training data, and p(y) is represented by the empirical distribution of the training dataset.

• Ink-nearest neighbor (k-NN),p(y|x) is modeled as the empirical distribution in thek-nearest samples in the training data.

• In tree-based methods,p(y|x)is modeled as the empirical distribution among the training data samples in the same leaf node.

• In deep learning,p(y|x)is modeled using a deep neural network and a softmax function.

Different classifiers employ various methods to model the conditional probability p(y|x) The standard approach is to utilize Bayes' classifier, which selects the class y with the highest probability p(y|x) In scenarios with only two classes, the prediction corresponds to the class that has a probability greater than 0.5.

Is it always good to predict according to Bayes’ classifier?

Bayes' classifier is often not directly accessible in practice, as we typically only have data without a clear representation of p(y|x) However, it can still serve as a valuable reference when transforming a modeled approximation of p(y|x) into predictions Throughout this chapter, we have implicitly relied on this concept It is important to note that while the prediction yb, corresponding to the highest probability assigned by our model, may seem like a logical choice, it is not always the best option Although this approach is a reasonable starting point, we must consider additional factors before finalizing our predictions.

Bayes’ classifier is considered optimal when the primary objective is to minimize misclassifications However, this goal can be more complex in certain scenarios, such as predicting a patient's health status, where the consequences of a false positive (predicting 'well') may be more severe than a false negative (predicting 'bad'), or vice versa In these cases, the classification goal becomes asymmetric, rendering Bayes’ classifier less effective for achieving optimal outcomes.

Bayes' classifier achieves optimal performance solely when the conditional probability p(y|x) is known with precision However, if we only possess an approximation of p(y|x), the effectiveness of the classifier may not be guaranteed, and the previously established method may no longer be the best approach.

6 Sometimes this is not very explicit in the method, but if you look carefully, you will find it.

More on classification and classifiers

Linear and nonlinear classifiers

Linear regression refers to a regression model that is linear in its parameters In the context of classification, the term "linear" describes classifiers with linear decision boundaries, while "nonlinear" applies to those with nonlinear boundaries Logistic regression and Linear Discriminant Analysis (LDA) are examples of linear classifiers, whereas Quadratic Discriminant Analysis (QDA) is a nonlinear classifier It's important to note that although logistic regression and LDA are both classified as linear, their decision boundaries differ Subsequent chapters will introduce additional classifiers, excluding decision stumps.

Linear regression can be enhanced by incorporating nonlinear transformations of inputs to generate additional features, allowing the linear classifier to form complex decision boundaries However, this process necessitates manual crafting and selection of these transformations A more automated and commonly used method for creating a sophisticated classifier from a basic one is boosting, which is discussed in Chapter 6.

Regularization

Overfitting can occur in classification tasks, similar to linear regression, particularly when the number of training samples is not significantly larger than the number of inputs To mitigate this issue, regularization techniques are often employed A popular method for logistic regression involves applying a Ridge Regression-like penalty to the coefficients Additionally, for Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA), regularizing the estimation of the covariance matrix can also be beneficial.

Evaluating binary classifiers

Binary classification, where K = 2, plays a crucial role in detecting the presence of various entities, such as diseases or objects on radar In this context, a value of y = 1 indicates a "positive" presence, while y = 0 signifies a "negative" absence These applications are characterized by their ability to effectively discern and identify critical conditions or objects.

Many datasets are predominantly composed of instances labeled as 0, which can lead to a classifier achieving high accuracy simply by predicting this value consistently For instance, a medical support system that always predicts a patient is "healthy" may appear accurate most of the time; however, this approach is ultimately ineffective and fails to provide meaningful insights.

(i) A missed detection (predictingyb= 0, when in facty = 1) might have much more sever consequences than a false detection (predictingyb= 1, when in facty= 0).

For such classification problems, there is a set of analysis tools and terminology which we will introduce now.

3.5 More on classification and classifiers

FP/N False positive rate, Fall-out, Probability of false alarm

TN/N True negative rate, Specificity, Selectivity

TP/P True positive rate, Sensitivity, Power, Recall, Probability of detection

FN/P False negative rate, Miss rate

TP/P* Positive predictive value, Precision

Table 3.1: Common terminology related to the quantities (TN, FN, FP, TP) in the confusion matrix.

A confusion matrix is a valuable tool for visualizing the results of a binary classifier when evaluated on a test dataset It categorizes the test data into four groups based on the actual output (y) and the predicted output (b) by the classifier The matrix includes True Negatives (TN) and False Negatives (FN) for the actual output of y=0, as well as False Positives (FP) and True Positives (TP) for the actual output of y=1, providing a clear overview of the classifier's performance.

In the confusion matrix, it is essential to replace TN, FN, FP, and TP with their corresponding numerical values, as demonstrated in the following example Additionally, a comprehensive list of terminology associated with the confusion matrix is provided in Table 3.1.

The confusion matrix offers a concise and insightful summary of a classifier's performance It is crucial to differentiate between false positives (FP, or type I errors) and false negatives (FN, or type II errors) based on the specific application While the ideal scenario is to have both errors equal to zero, this is seldom achievable in real-world situations.

Inspired by the Bayes' classifier, our standard approach for making predictions involves converting the probability p(y = 1|x) into binary outcomes using a threshold (t) Specifically, if p(y = 1|x) is greater than or equal to t, we predict yb = 1; if it is less than t, we predict yb = 0, with a default threshold of t = 0.5 To reduce the false positive rate, we can increase the threshold t, although this may lead to a higher false negative rate, and vice versa.

Tuning the threshold (3.30) is vital for enhancing performance in binary classification To effectively compare classifiers like logistic regression and QDA for specific problems, the ROC curve, which stands for "receiver operating characteristics," serves as a valuable tool rooted in communications theory.

To create an ROC curve, plot the true positive rate (TP/P) against the false positive rate (FP/N) for all values of t in the range [0,1] The resulting curve typically resembles the one shown in Figure 3.9 A perfect classifier, which accurately predicts outcomes with complete certainty, will touch the upper left corner of the graph, while a classifier that makes random guesses will produce a straight diagonal line.

The area under the ROC curve, known as AUC, serves as a compact summary of the ROC curve's performance A perfect classifier achieves an AUC of 1, while a classifier that makes random guesses has an AUC of 0.5.

Example 3.2: Confusion matrix in thyroid disease detection

The thyroid is a crucial endocrine gland that regulates metabolic rate and protein synthesis, making thyroid disorders potentially serious This study focuses on detecting thyroid diseases using a dataset from the UCI Machine Learning Repository, which includes 7,200 data points and 21 medical indicators The dataset categorizes diagnoses into normal, hyperthyroid, and hypothyroid, which we simplify to a binary classification of normal versus not normal It is divided into a training set of 3,772 samples and a test set of 3,428 samples We applied a logistic regression classifier to the training data, achieving a confusion matrix that shows 3,177 true positives and 237 false positives for normal cases, along with 1 false negative and 13 true negatives for not normal cases.

Most test data points are correctly predicted asnormal, but a large part of thenot normaldata is also falsely predicted as normal This might indeed be undesired in the application.

To change the picture, we change the threshold tot= 0.15, and obtain new predictions with the following confusion matrix instead: y =normal y =not normal b y =normal 3067 165 b y=not normal 111 85

The recent change has improved the true positive rate to 85 patients correctly identified as not normal, compared to only 13 previously However, this comes with a trade-off, as the false positive rate has increased significantly, with 111 patients now incorrectly predicted as not normal, up from just 1 The evaluation of whether this trade-off is beneficial depends on the specific application and the severity of consequences associated with each type of error.

Focusing solely on total accuracy, or misclassification rate, can be misleading; for instance, a predictor that always predicts 'normal' achieves an accuracy of nearly 93% In contrast, another model with an accuracy of 92% may offer greater practical utility, highlighting the importance of considering more nuanced metrics beyond just accuracy.

Typical example Perfect classifier Random guess

4 Non-parametric methods for regression and classification: k -NN and trees

The statistical methods we have explored, including linear regression, logistic regression, LDA, and QDA, operate with a fixed set of parameters that are determined from the training data Once these parameters are established, the training data can be discarded, as it is no longer needed Although an increase in training data can lead to more accurate parameter estimation with reduced variance, the inherent structure of these models remains unchanged For instance, logistic regression is limited to defining linear decision boundaries, regardless of the volume of training data available.

Another category of methods adapts to training data rather than relying on a fixed structure and parameters Two prominent examples of this approach are k-nearest neighbors (k-NN) and tree-based methods While both techniques can be applied to classification and regression tasks, this discussion will primarily concentrate on their application in classification.

k-NN

Decision boundaries for k-NN

In Example 4.1, we initially computed a prediction for a single test data point, x ? By shifting this point to the left to x alt ? = [0 2] T, the three nearest training points include i= 6 and i= 1, replacing i= 2 With k= 3, the prediction becomes p(Red|x ?) = 2/3, leading to the classification of yb= Red The midpoint between x ? and x alt ? at [0.5 2] T is equidistant to both i= 1 and i= 2, placing it at the decision boundary This reasoning allows us to outline the complete decision boundaries shown in Figure 4.1, illustrating that k-NN is a nonlinear classification method capable of creating complex decision boundaries.

Figure 4.1: Decision boundaries for the problem in Example 4.1 for the two choices of the parameter k.

Choosing k

The user has to decide on whichkto use ink-NN and this decision has a big impact on the final classifier.

Figure 4.2 presents a scenario featuring two input variables (p=2), three classes (K=3), and an increased number of training data samples The subfigures depict the decision boundaries of a k-NN classifier, showcasing the differences between k=1 and k=11.

In k-NN classification, setting k=1 ensures that all training data points are classified accurately, resulting in decision boundaries that closely mirror the training data, including its noise and random variations However, with k=11, the averaging process can misplace some training points, leading to less tailored decision boundaries Despite the perfect fit of k=1, k=11 is often favored as it reduces the risk of overfitting, suggesting that this model may yield better performance on unseen test data A systematic method for selecting the optimal k value is through cross-validation, which will be explored further in Chapter 5.

(a) Decision boundary for k-NN with k = 1 for a 3-class problem A complex, but possibly also overfitted, decision boundary. x 1 x 2

(b) Decision boundary for k-NN with k = 11 A more rigid and less flexible decision boundary.

Figure 4.2: Decision boundaries for k-NN.

Normalization

One critical aspect of k-NN that should not be overlooked is the normalization of input data Since k-NN relies on Euclidean distances to measure the proximity between data points, ensuring that these distances accurately reflect the closeness of points is essential For instance, consider a training dataset with two input variables, where one variable ranges from 0 to 100 and the other from 0 to 1 This scenario often arises when the variables represent different physical quantities with varying units In such cases, the Euclidean distance between a test point and a training data point may not provide a valid measure of similarity unless the data is properly normalized.

(x i1 −x ?1 ) 2 + (x i2 −x ?2 ) 2 would almost only depend on the first term(x i1 −x ?1 ) 2 and the values of the second componentx i2 would have a small impact.

To address the normalization of input data, one effective method is to scale the first component by dividing it by 100, resulting in a new value defined as \( x_{\text{new}}^{i1} = \frac{x^{i1}}{100} \), ensuring that both components fall within the range of [0, 1] This normalization can be generalized using the formula \( x_{\text{new}}^{ij} = \frac{x^{ij} - \min(x^{ij})}{\max(x^{ij}) - \min(x^{ij})} \) for all \( j = 1, \ldots, p \) and \( i = 1, \ldots, n \) Another widely used normalization technique involves the mean and standard deviation of the training data, expressed as \( x_{\text{new}}^{ij} = \frac{x^{ij} - \bar{x}_j}{\sigma_j} \) for all \( j = 1, \ldots, p \) and \( i = 1, \ldots, n \), where \( \bar{x}_j \) and \( \sigma_j \) represent the mean and standard deviation for each input variable, respectively.

Trees

Basics

A classification tree models the function p(y|x) through a series of rules based on input variables x1, , xp, which can be visualized as a binary tree This tree partitions the input space into distinct regions, assigning a constant predicted class probability p(y|x) within each region An example will further illustrate this concept.

Example 4.2: Predicting colors with a classification tree

In a classification problem involving two input variables, x1 and x2, and a quantitative output represented by the colors red or blue, a classification tree can effectively categorize new data points To classify a new point, denoted as x? = [x?1, x?2]T, one begins at the top of the tree and traverses downward until reaching a terminal branch Each terminal branch indicates a constant predicted class probability, specifically p(Red|x?) For instance, the conditions x2 < 3.0 and x1 < 5.0 guide the classification process within the tree structure.

A classification tree consists of internal nodes and leaf nodes, where each internal node applies a rule of the form x j < s k to determine the left branch, while the right branch corresponds to x j ≥ s k This specific tree features two internal nodes and three leaf nodes, illustrating its structure and decision-making process.

In a classification tree, the region partitioning illustrates how the data is divided, with each region aligning to a leaf node on the tree The boundaries between these regions signify the splits made within the tree Each area is color-coded to indicate the highest predicted class probability, providing a visual representation of the classification outcomes.

A pseudo code for classifying a test input with the tree above would look like if x_2 < 3.0 then return p(Red|x)=0 else if x_1 < 5.0 then return p(Red|x)=1/3 else return p(Red|x)=1 end end

In the given example, when analyzing the point x? = [2.5, 3.5] T, the first decision leads to the right branch due to x?2 = 3.5 being greater than or equal to 3.0, while the second decision directs us to the left branch since x?1 = 2.5 is less than 5.0 As a result, the probabilities for classification yield p(Red|x?) = 1/3 and p(Blue|x?) = 2/3 This classification tree effectively partitions the input space into distinct rectangular regions, as depicted in the accompanying figure.

In the given example, the endpoints of branches R1, R2, and R3 are termed leaf nodes, while the internal splits, x2 < 3.0 and x1 < 5.0, are classified as internal nodes The connections between these nodes are known as branches Although visualizing the region partition becomes challenging with more than two input variables, the tree structure operates consistently, with each internal node leading to exactly two branches.

This example demonstrates the application of a classification tree for making predictions, and the subsequent section will detail the process of learning the tree from training data.

Training a classification tree

In classification tree models, the class probability p(k|x) is represented as a constant c_mk for each region R_m and class k, where k ranges from 1 to K This can be expressed mathematically as p(y=k|x) = ΣM m=1 c_mk I{x∈R_m}, with M denoting the total number of regions or leaf nodes in the tree The indicator function I{x∈R_m} equals 1 if x is within region R_m, and 0 otherwise Additionally, to ensure that the probabilities in each region sum to 1, a constraint is applied.

The primary objective of building a classification tree from training data \({x_i, y_i}_{i=1}^n\) is to create a model that maximizes the likelihood of the observed data This method, known as the maximum likelihood approach, is similar to the technique used in logistic regression To achieve this, we aim to minimize the negative logarithm of the likelihood, ultimately seeking a tree \(T\) that reduces the specified expression.

By inserting the model stated in (4.4) into (4.5), we get

Hereπb mk is the proportion of training data points in regionR m that are from classkwithn m being the total number of training data points in regionm We can show 1 that

XK k=1 b π mk logc mk XK k=1 b π mk logπ mk c mk

To minimize the expression XK k=1 b π mk logbπ mk, we achieve equality when c mk equals bπ mk Consequently, minimizing the related expression leads to the conclusion that c mk should be set to πb mk The next step involves identifying the regions R m, with the goal of selecting these regions to further minimize the overall expression.

1 We use the so called log sum inequality and the two constraints P K k=1 c mk = 1 and P K k=1 b π mk = 1 for all m = 1, , M.

4.2 Trees is known as theentropyfor regionm 2

Finding the optimal tree that minimizes (4.8) is a challenging combinatorial problem that is computationally infeasible To address this, we employ a greedy algorithm called recursive binary splitting, which minimizes each node split individually rather than optimizing the entire tree at once This method begins at the top of the tree, progressively splitting the input, with each split creating two new branches As a result, it constructs a tree similar to the previously illustrated example The greedy nature of this approach lies in its incremental strategy, where only one split is introduced at a time without considering the complete tree structure.

Consider the setting when we are about to do our first split into a pair of half-planes

The split depends on the indexjof the input variable at which the split is performed and the cutpoints. The corresponding proportionsπb mk will also depend onjands. b π 1k (j, s) = 1 n 1

We seek the splitting variablejand cutpointsthat solve minj,s

To optimize the input variables, we evaluate all potential splits and select the pair (j, s) that minimizes the objective function This process is iteratively repeated to identify the optimal values (j, s) for each newly formed branch We continue this procedure until a predetermined stopping criterion is met, such as ensuring that no region contains more than five training data points.

The tree in Example 4.2 has been constructed based on the methodology outlined above, which we will illustrate in the example below.

Example 4.3: Learning a classification tree (continuation of Example 4.2)

We consider the same setup as in Example 4.2 with the following dataset x 1 x 2 y 9.0 2.0 Blue 1.0 4.0 Blue 4.0 6.0 Blue 4.0 1.0 Blue 1.0 2.0 Blue 1.0 8.0 Red 6.0 4.0 Red 7.0 9.0 Red 9.0 8.0 Red

We want to learn a classification tree, by using the entropy criteria in (4.8) and growing the tree until there are no regions with more than five data points left.

There are countless ways to split data points; however, all splits that yield the same partition are essentially identical Therefore, in practical terms, we only recognize nine distinct splits.

In the dataset, if any π b mk equals zero, the expression 0 log 0 is defined as zero, aligning with the limit as r approaches zero from the positive side, where r log r equals zero The various splits in the data are illustrated by the dashed lines in the figure above.

We analyze all nine splits, beginning with the split at x1 = 2.5, which divides the input space into two regions: R1 (where x1 < 2.5) and R2 (where x1 ≥ 2.5) In region R1, there are two blue data points and one red data point, resulting in a total of n1 = 3 data points Consequently, the proportions of the two classes in region R1 are π1B = 2/3 for blue and π1R = 1/3 for red The entropy for this region is then calculated based on these proportions.

In regionR 2 we haven 2 = 7data points with the proportionsbπ 2B = 3/7andbπ 2R = 4/7 The entropy for this regions will be

7) = 0.68 (4.11) and the total weighted entropy for this split becomes n 1 Q 1 (T) +n 2 Q 2 (T) = 3ã0.64 + 7ã0.68 = 6.69 (4.12)

We compute the cost for all other splits in the same manner, and summarize it in the table below.

From the table we can read that the two splits atx2 0, minimizing this expression w.r.t.ybis equivalent to minimizingW e w.r.t.y That is,b b y b = arg min y b

To enhance the performance of the ensemble model, each member should be trained by minimizing the weighted misclassification loss, assigning a specific weight to each data point (x_i, y_i) denoted as w_b_i This approach emphasizes correcting previous misclassifications by focusing on those data points that were incorrectly classified in the earlier iterations of the ensemble, thereby refining the overall accuracy of the ensemble of the first b−1 classifiers.

The practical solution to problem (6.13) hinges on the selection of the base classifier, which determines the specific constraints on the function \( y_b \), such as using a shallow classification tree Despite this, (6.13) represents a conventional classification objective that can be effectively addressed, at least in an approximate sense, through standard learning algorithms Integrating weights into the objective function is generally uncomplicated for most base classifiers, as it involves merely adjusting the individual components of the loss function utilized during the training process.

After learning the bth ensemble member, denoted as yb(x), by solving equation (6.13), the next step is to compute its coefficient αb This is achieved by minimizing the objective function outlined in equation (6.12) By differentiating this expression with respect to α and setting the derivative to zero, we derive the necessary equation for optimization.

Pn j=1w b j I{y i 6=by b (x i )} (6.15) to be the weighted misclassification error for thebth classifier, we can express the optimal value for its coefficient as α b = 1

The derivation of the AdaBoost algorithm is now complete, as summarized in Algorithm 6 This algorithm utilizes the recursive computation of weights, as indicated in expression (6.8) on line 2 Additionally, an explicit weight normalization has been incorporated on line 2 for practical convenience, which does not influence the overall derivation of the method.

Remark 6.2 One detail worth commenting is that the derivation of the AdaBoost procedure assumes that all coefficients{α b } B b=1 are positive To see that this is indeed the case when the coefficients are computed according to(6.16), note that the functionlog((1−x)/x)is positive for any0< x 0.5, then we could simply flip the sign of all predictions made by b y b (x)to reduce the error!

1 Assign weightsw i 1 = 1/nto all data points.

(a) Train a weak classifieryb b (x)on the weighted training data{(xi, yi, w b i )} n i=1

(b) Update the weights{w i b+1 } n i=1 from{w i b } n i=1 : i ComputeE train b =P n i=1w b i I{y i 6=yb b (x i )} ii Computeα b = 0.5 log((1−E train b )/E train b ). iii Computew b+1 i =w i b exp(−α b y i by b (x i )),i= 1, , n iv Normalize Setw b+1 i ←w b+1 i /Pn j=1w j b+1 , fori= 1, , n.

In the previously discussed method, we assumed that each base classifier provides a discrete class prediction, represented as b(x) ∈ {-1, 1} However, many practical classification models focus on estimating conditional class probabilities, p(y|x), as outlined in Section 3.4.1 This allows for an alternative approach where each base model can output a real number, using these values as the foundation for voting This enhancement of Algorithm 6, known as Real AdaBoost, is explored by Friedman, Hastie, and Tibshirani (2000).

Boosting vs bagging: base models and ensemble size

AdaBoost and other boosting algorithms involve two critical design decisions: selecting the base classifier and determining the number of iterations (B) for the boosting process While any classification method can serve as a base classifier, shallow classification trees or decision stumps (trees of depth one) are the most commonly used due to their ability to learn effective models with weak classifiers and their quick training times In contrast, employing deep classification trees can negatively impact performance The tree depth should be carefully chosen to achieve the desired level of interaction among input variables, as a tree with M terminal nodes can model functions depending on up to M - 1 input variables.

Boosting algorithms differentiate themselves from bagging methods by utilizing shallow trees as base classifiers, while bagging focuses solely on variance reduction through averaging without addressing bias For bagging to be effective, it requires low-bias base models, typically deep decision trees, which may exhibit high variance In contrast, boosting effectively reduces both variance and bias, allowing for the use of simpler base models.

Bagging and boosting are two distinct ensemble learning techniques, with bagging operating in parallel and boosting in a sequential manner In boosting, each iteration introduces a new base model that targets the errors of the previous model, while bagging employs multiple identically distributed base models that collectively address the same problem, culminating in a simple average of the ensemble This characteristic of bagging helps prevent overfitting, even with a large number of ensemble members, unlike boosting, where increased iterations can lead to excessive flexibility and potential overfitting.

Exponential loss Hinge loss Binomial deviance Huber-like loss Misclassification loss

In classification tasks, many base models may lead to overfitting, which typically develops gradually and shows limited sensitivity to the choice of hyperparameter B To mitigate this issue, it is advisable to select B systematically, such as through early stopping, a widely used technique in training neural networks.

Robust loss functions and gradient boosting

The margin \( C(x) \) serves as an effective measure of classification error, where negative margins indicate incorrect classifications and positive margins reflect correct ones Consequently, it is logical to adopt a loss function that decreases with increasing margin, imposing greater penalties on negative margins The exponential loss function, utilized in the AdaBoost algorithm, fulfills this criterion, as illustrated in Figure 6.2 However, its tendency to heavily penalize negative margins can pose challenges in real-world applications, rendering the classifier overly sensitive to noise and outliers, including mislabeled or atypical data points.

To overcome the limitations of exponential loss, it is beneficial to explore more robust loss functions for classification, as illustrated in Figure 6.3 While a detailed analysis of these functions is beyond the scope of this discussion, readers can refer to Hastie, Tibshirani, and Friedman (2009, Chapter 10.6) for further insights Notably, the alternative loss functions presented exhibit less aggressive penalties for large negative margins, making them more resilient to noisy data due to their gentler slopes.

The choice of using exponential loss in the AdaBoost algorithm is primarily due to computational convenience, as it allows for a closed-form solution to the optimization problem In contrast, utilizing a more robust loss function would compromise this analytical tractability, making the optimization process more complex.

The challenge of optimizing the base classifier y(x)b can be effectively addressed using numerical optimization techniques Although this process is somewhat complex due to the nature of the optimization variable, it has been discovered that an approximate solution can be achieved for a wide range of loss functions This method, which draws inspiration from gradient descent, is known as gradient boosting (Friedman, 2001; Mason et al., 1999).

5 Hinge loss, binomial deviance, and the Huber-like loss all increase linearly for large negative margins Exponential loss, of course, increases exponentially.

Algorithm 7 presents a pseudo-code example of a gradient boosting method, highlighting that a crucial step is fitting a base model to the negative gradient of the loss function.

Boosting can be understood as a process where each base model aims to rectify the errors made by the ensemble up to that point The negative gradient of the loss function indicates the direction in which the model should be adjusted to minimize loss.

1 Initialize (as a constant),C 0 (x)≡arg min c P n i=1L(y i , c).

(a) Compute the negative gradient of the loss function, g b i =−

(b) Train a baseregressionmodelfb b (x)to fit the gradient values, fb b = arg min f

Gradient boosting, originally designed for classification, can also be adapted for regression with slight modifications Notably, the base models \( fb_b(x) \) are derived from solving a regression problem, even though the algorithm ultimately functions as a classifier This is because the negative gradient values \( \{g_i^b\}_{i=1}^n \) are quantitative, despite the qualitative nature of the data \( \{y_i\}_{i=1}^n \) In this context, a base model is fitted to these negative gradient values by minimizing a square loss criterion.

The value γ in the algorithm functions as a tuning parameter akin to the step size in standard gradient descent Typically, this value is determined through line search methods, often in conjunction with regularization techniques such as shrinkage (Friedman 2001) When employing trees as base models, which is a common practice, the optimization of the step size can be performed simultaneously with determining the terminal node values, leading to a more efficient implementation (Friedman 2001).

As mentioned above, gradient boosting requires a certain amount of smoothness in the loss function.

For effective gradient boosting implementations, a minimal requirement is that the loss function is nearly everywhere differentiable to compute the gradient However, some methods necessitate stronger conditions, including second-order differentiability The binomial deviance stands out as a "safe choice" due to its infinite differentiability and strong convexity, coupled with favorable statistical properties Consequently, binomial deviance is widely regarded as one of the most commonly utilized loss functions in practice.

The classification loss functions illustrated in Figure 6.3 are:

Binomial deviance: L(y, c) = log(1 + exp(−yc)).

7 Neural networks and deep learning

Neural networks serve dual purposes in machine learning, effectively handling both regression and classification tasks, making them an evolution of linear regression and logistic regression Historically, the implementation and analysis of neural networks with hidden layers gained traction, leading to notable advancements and success stories during the 1980s and early 1990s.

In the 2000s, the emergence of deep learning, characterized by deep neural networks with multiple hidden layers, revolutionized machine learning This advancement, fueled by new software, hardware, parallel training algorithms, and vast amounts of training data, has significantly enhanced the field Deep learning has achieved remarkable success in various applications, such as image classification, speech recognition, and language translation, with new developments and analyses being published daily.

In Section 7.1, we will extend linear regression to a two-layer neural network, which includes one hidden layer, and subsequently advance to deep neural networks Moving on to Section 7.2, we will shift our focus from regression to the classification setting Section 7.3 introduces a specialized neural network designed specifically for image processing, while Section 7.4 delves into the intricacies of training neural networks.

Neural networks for regression

Generalized linear regression

The linear regression model is illustrated graphically as z = β₀ + β₁x₁ + β₂x₂ + + βₚxₚ, as depicted in Figure 7.1a In this model, each input variable xⱼ is represented by a node, while each parameter βⱼ is connected by a link The output z is calculated as the sum of all terms βᵢxⱼ, with the input variable 1 corresponding to the offset term β₀.

To describe nonlinear relationships between the input vector \( x = [x_1, x_2, \ldots, x_p]^T \) and the output \( z \), we introduce a nonlinear scalar function known as the activation function \( \sigma: \mathbb{R} \rightarrow \mathbb{R} \) This leads to a modification of the linear regression model into a generalized linear regression model, where the linear combination of inputs is transformed by the activation function, resulting in the equation \( z = \sigma(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p) \).

This extension to the generalized linear regression model is visualized in Figure 7.1b.

Figure 7.1: Graphical illustration of a linear regression model (Figure 7.1a), and a generalized linear regression model (Figure 7.1b) In Figure 7.1a, the output z is described as the sum of all terms β 0 and { β j x j } p j=1 , as in (7.3).

In Figure 7.1b, the circle denotes addition and also transformation through the activation function σ, as in (7.4).

Figure 7.2: Two common activation functions used in neural networks The logistic (or sigmoid) function (Figure 7.2a), and the rectified linear unit (Figure 7.2b).

Common activation functions in neural networks include the logistic function and the rectified linear unit (ReLU), as shown in Figures 7.2a and 7.2b The logistic function, also known as the sigmoid function, is familiar from logistic regression and exhibits an affine behavior near x=0, saturating at 0 and 1 for decreasing and increasing values of x, respectively In contrast, the ReLU function is straightforward, acting as the identity function for positive inputs while outputting zero for negative inputs.

For many years, the logistic function was the preferred activation function in neural networks However, in recent years, the ReLU (Rectified Linear Unit) has surged in popularity due to its simplicity and effectiveness, becoming the standard choice in most neural network models today.

The generalized linear regression model is straightforward but insufficient for capturing complex relationships between inputs and outputs To enhance the model's flexibility, we will implement multiple generalized linear regression models to form a layer, leading to a two-layer neural network Subsequently, we will stack these layers in a sequential manner, culminating in a deep neural network, commonly referred to as deep learning.

Two-layer neural network

In section 7.4, the output is generated using a single scalar regression model To enhance its adaptability and transform it into a two-layer neural network, we propose that the output be the sum of M distinct generalized linear regression models, each with its own set of parameters The parameters for the ith regression model are represented as β0i, β1i, , βpi, and the output for this model is denoted as hi The equation for hi is given by hi = σ(β0i + β1ix1 + β2ix2 + + βpixp) for i = 1, , M.

These intermediate outputsh i are so-calledhidden units, since they are not the output of the whole model.

TheM different hidden units{h i } M i=1 instead act as input variables to an additional linear regression model z=β0+β1h1+β2h2+ã ã ã+β M h M (7.6)

To distinguish the parameters in (7.5) and (7.6) we add the superscripts(1)and(2), respectively The equations describing this two-layer neural network (or equivalently, neural network with one layer of

1 h 1 h M σ σ σ β 01 (1) β pM (1) β (2) 0 β M (2) Input variables Hidden units Output

Figure 7.3: A two-layer neural network, or equivalently, a neural network with one intermediate layer of hidden units. hidden units) are thus h 1 =σ β 01 (1) +β 11 (1) x 1 +β 21 (1) x 2 +ã ã ã+β (1) p1 x p

The model can be represented as a graph featuring two layers of links, as shown in Figure 7.3, with each link associated with a specific parameter Additionally, an offset term is incorporated in both the input and hidden layers.

Matrix notation

The two-layer neural network model can be expressed more succinctly through matrix notation, where the parameters of each layer are organized into a weight matrix \( W \) and an offset vector \( b^{(1)} = h \beta^{(1)} 0 \ldots \beta^{(1)} 0M \) This compact representation enhances clarity and efficiency in understanding the model's structure.

(7.8) The full model can then be written as h=σ

, (7.9a) z=W (2)T h+b (2)T , (7.9b) where we have also stacked the components inxandh asx = [x 1 , , x p ] T andh = [h 1 , , h M ] T The activation functionσacts element-wise The two weight matrices and the two offset vectors will be

In neural network literature, the term "bias" typically refers to the offset vector, but this is technically just a model parameter rather than a statistical bias To prevent confusion, we will refer to it as an offset The model parameters can be represented as θ vec(W (1)) T vec(W (2)) T b (1) b (2) T.

In this article, we outline a nonlinear regression model represented as y = f(x; θ) + ε, where f(x; θ) is defined as z It is important to note that the predicted output z, as indicated in equation (7.9b), is influenced by all parameters in θ, despite this not being explicitly reflected in the notation.

Deep neural network

A two-layer neural network serves as a foundational model in machine learning, supported by extensive research and analysis Its true potential emerges when multiple layers of generalized linear regression models are stacked to form a deep neural network These deep networks excel at capturing complex relationships, such as those between images and their corresponding classes, making them one of the leading methods in the field today.

In a neural network, layers are indexed by l, with each layer characterized by a weight matrix W(l) and an offset vector b(l) For instance, W(1) and b(1) are associated with layer l=1, while W(2) and b(2) correspond to layer l=2, and so on Additionally, there are multiple layers of hidden units, represented as h(l-1) Each layer comprises M(l) hidden units, denoted as h(l) = [h(l)₁, , h(l)ₘ]ᵀ, where the number of units M₁, M₂, etc., may vary across different layers.

Each layer maps a hidden layerh (l − 1) to the next hidden layerh (l) as h (l) =σ(W (l)T h (l − 1) +b (l)T ) (7.11)

In a deep neural network, layers are arranged in a stacked manner where the output of the first hidden layer serves as the input to the second layer, and this pattern continues, with each subsequent layer receiving the output of the previous one This stacking of multiple layers enables the construction of a deep neural network, enhancing its ability to learn complex patterns.

A deep neural network ofLlayers can mathematically be described as (cf (7.9)) h (1) =σ(W (1)T x+b (1)T ), h (2) =σ(W (2)T h (1) +b (2)T ),

A graphical representation of this model is represented in Figure 7.4.

In deep learning, the weight matrix for the first layer, denoted as W(1), has dimensions p×M1, while the corresponding offset vector b(1) is 1×M1 For applications with multi-dimensional outputs, the last layer's weight matrix W(L) is sized M(L-1)×K, and its offset vector b(L) is 1×K Intermediate layers, indexed from l=2 to L-1, feature weight matrices W(l) with dimensions M(l-1)×M(l) and offset vectors of size 1×M(l) The number of inputs p and outputs K is predetermined by the specific problem, whereas the number of layers is flexible.

Land the dimensionsM1, M2, are user design choices that will determine the flexibility of the model.

Learning the network from data

Similar to the previously discussed parametric models like linear and logistic regression, utilizing deep neural networks requires learning all the model parameters These parameters include the weights and biases represented as θ, vec(W(1))^T, vec(W(2))^T, and so on, culminating in vec(W(L))^T, b(1)^T, b(2)^T, and b(L)^T.

Input variables Hidden units Hidden units Hidden units Hidden units Outputs

Figure 7.4: A deep neural network with L layers Each layer is parameterized with W (l) and b (l)

Deep neural networks with extensive width and depth possess millions of parameters, granting them significant flexibility However, this complexity necessitates mechanisms to prevent overfitting, with common methods including regularization techniques like ridge regression Additionally, there are various deep learning-specific strategies to address this issue It's important to note that an increase in parameters also demands greater computational power for effective model training.

In regression analysis, the training dataset T = {(xi, yi)}^n i=1 comprises n samples of input x and output y We begin by applying maximum likelihood estimation, assuming the presence of Gaussian noise characterized by N(0, σ²) This leads us to derive the square error loss function, as discussed in Section 2.3.1, represented by the equation bθ = arg min θ.

(7.14) This problem can be solved with numerical optimization, and more precisely stochastic gradient descent. This is described in more detail in Section 7.4.

Using the model parameters θ and the inputs {x_i} from i=1 to n, we can calculate the predicted outputs {z_i} from i=1 to n with the equation z_i = f(x_i; θ) For instance, in the two-layer neural network discussed in Section 7.1.2, the hidden layer is computed as h_i^T = σ(x_i^T W^(1) + b^(1)), followed by the output layer given by z_i^T = h_i^T W^(2) + b^(2).

In equation (7.15), the equations are rearranged in contrast to the model presented in (7.9) This adjustment allows for the straightforward extension of (7.15) to incorporate multiple data points Similar to the approach in (2.4), we organize all data points into matrices, with each data point represented as a separate row.

We can then write (7.15) as

The equation Z=HW(2)+b(2) represents the relationship between the predicted output and hidden units, organized in matrix form This structure is essential for implementing the model in code In the laboratory work of the course, TensorFlow will be utilized to execute this implementation effectively.

Input variables Hidden units Hidden units Hidden units Hidden units Logits Outputs

Figure 7.5: A deep neural network with L layers for classification The only difference to regression (Figure 7.4) is the softmax transformation after layer L.

Note that in (7.17) the offset vectorsb 1 andb 2 are added and broadcasted to each row See more details regarding implementation of a neural network in the instructions for the laboratory work.

Neural networks for classification

Learning classification networks from data

In our training dataset, we utilize samples of inputs and outputs represented as {(x i , y i )} n i=1 For classification tasks, we implement one-hot encoding for the output y i, which allows us to represent problems with K distinct classes Consequently, each output y i is structured as a K-dimensional vector, expressed as y i = (y i1, , y iK)ᵀ.

If a data pointibelongs to class kthen y ik = 1andy ij = 0for allj6=k See more about the one-hot encoding in Section 3.2.3.

In neural networks utilizing the softmax activation function in the final layer, training is typically performed using the negative log-likelihood, commonly known as the cross-entropy loss function, to optimize the model parameters.

Cross-entropy is minimized when the predicted probability \( p(k|x_i; \theta) \) is close to 1 for the true class \( k \) where \( y_{ik} = 1 \) For instance, if the data point belongs to class \( k = 2 \) out of a total of \( K = 3 \) classes, we represent it as \( y_i = [0, 1, 0]^T \) With parameters \( \theta_A \), if we predict \( p(1|x_i; \theta_A) = 0.1 \), \( p(2|x_i; \theta_A) = 0.8 \), and \( p(3|x_i; \theta_A) = 0.1 \), the resulting low cross-entropy \( L(x_i, y_i, \theta_A) = 0.22 \) indicates a strong confidence in the correct class Conversely, with parameters \( \theta_B \), predicting \( p(1|x_i; \theta_B) = 0.8 \), \( p(2|x_i; \theta_B) = 0.1 \), and \( p(3|x_i; \theta_B) = 0.1 \) results in a much higher cross-entropy \( L(x_i, y_i, \theta_B) = 2.30 \) Therefore, parameters \( \theta_A \) are preferred over \( \theta_B \).

Calculating the loss function using the logarithm can cause numerical issues when p(k|xi;θ) approaches zero, as the logarithm of a value close to zero tends to negative infinity However, this problem can be mitigated because the logarithm in the cross-entropy loss function effectively cancels out the exponential in the softmax function.

XK k=1 y ik z ik −lognPK j=1e z ij o

XK k=1 y ik z ik −max j z ij −lognP K j=1e z ij − max j z ij o

, (7.22) wherez ik are the logits.

Convolutional neural networks

Data representation of an image

Digital grayscale images are composed of pixels arranged in a matrix, with each pixel value ranging from 0 (black) to 1 (white), representing various shades of gray For instance, a 6×6 pixel image exemplifies this concept, where each pixel serves as an input variable in an image classification problem The indices j and k specify the pixel's position within the image matrix.

Vectorizing image pixels into a long vector can lead to a loss of important structural information, as it disregards the similarity between neighboring pixels In contrast, Convolutional Neural Networks (CNNs) maintain this spatial relationship by organizing input variables and hidden layers as matrices The fundamental element of a CNN is the convolutional layer, which plays a crucial role in preserving the inherent structure of image data.

The convolutional layer

After the input layer, a hidden layer is implemented with the same number of hidden units as input variables For an image consisting of 6×6 pixels, this results in 36 hidden units, organized in a 6×6 matrix similar to the input variables Unlike the previous sections, which featured dense layers where each input variable connects to every hidden unit with a unique parameter β jk, this configuration allows for a structured representation of the data.

Zero-padding allows for the preservation of image size in subsequent layers, even when parts of the region extend beyond the image boundaries While this technique offers flexibility, it may hinder the ability to capture significant patterns, resulting in poor generalization on unseen data In contrast, convolutional layers utilize the inherent structure of images to create a more efficiently parameterized model Unlike dense layers, convolutional layers capitalize on sparse interactions and parameter sharing to enhance their parameterization.

Sparse interactions in a dense layer occur when most parameters are set to zero, meaning that a hidden unit in a convolutional layer relies only on a small region of the image, typically a 3×3 area The position of this region corresponds to the location of the hidden unit in the matrix topology; for example, moving the hidden unit one step to the right shifts the region in the image accordingly In cases where hidden units are located at the image borders, the corresponding region may extend beyond the image boundaries To address this, zero-padding is employed, where the missing pixels are replaced with zeros, as illustrated in the accompanying figures.

In a convolutional neural network (CNN), each link between input variables and hidden units shares a unique set of parameters, known as a kernel, across multiple locations in the network This approach contrasts with dense layers, where each connection has its own parameters For instance, in a convolutional layer, the same parameters are used to process different regions of an image, allowing the network to detect specific features, such as edges or corners, regardless of their position This parameter sharing leads to fewer overall parameters compared to a dense layer, significantly reducing the complexity of the model For example, a convolutional layer requires only 10 parameters to operate, while a dense layer would need 1,332 parameters for the same task Consequently, convolutional layers can capture more properties of an image with the same number of parameters, enhancing their efficiency and effectiveness in image processing tasks.

Condensing information with strides

In a convolutional neural network (CNN), the number of hidden units in the convolutional layer matches the number of pixels in the input image To effectively condense information as more layers are added, it is common to reduce the number of hidden units at each layer One effective method for achieving this is by selectively applying the kernel to certain pixels rather than every pixel in the image, thereby optimizing the processing and enhancing the network's efficiency.

In a convolutional layer with a stride of [2,2] and a kernel size of 3 × 3, the kernel is applied to every two pixels both horizontally and vertically This results in a reduction of the output dimensions, yielding hidden units that are half the number of rows and columns compared to the input For instance, applying this to a 6 × 6 image produces 3 × 3 hidden units, as illustrated in Figure 7.10.

The stride in a convolutional layer determines the number of pixels the kernel shifts over the image with each step For instance, a stride of [1,1] means the kernel moves one pixel at a time both vertically and horizontally, while a stride of [2,2] indicates a movement of two pixels in both directions Despite these differences in stride, both configurations require the same number of parameters Additionally, information can be condensed after a convolutional layer through a process called pooling, which offers an alternative method for data reduction For more in-depth information on pooling, refer to Goodfellow, Bengio, and Courville (2016).

Multiple channels

Although the networks in Figure 7.8 and 7.10 have only 10 parameters each, utilizing a single kernel may not be sufficient to capture all the interesting properties of the images in the dataset To address this limitation, multiple kernels can be added to the network, each with its own set of kernel parameters This extension enables each kernel to produce its own set of hidden units, also known as a channel, through the same convolution operation As a result, the layers of hidden units in a Convolutional Neural Network (CNN) are organized into a tensor with dimensions of rows, columns, and channels.

In Figure 7.11, the first layer of hidden units has four channels and that hidden layer consequently has dimension6×6×4.

As we add more convolutional layers, each kernel integrates information from all channels of the preceding layer, as illustrated in the second convolutional layer of Figure 7.11.

Each kernel in a convolutional layer is represented as a tensor with dimensions corresponding to kernel rows, kernel columns, and input channels For instance, in the second convolutional layer, a kernel size of 3×3×4 indicates its structure When all kernel parameters are aggregated into a weight tensor W, it takes on the dimensions of kernel rows, kernel columns, input channels, and output channels Specifically, the weight matrix W (2) in this layer measures 3×3×4×6 With multiple kernels present, each can detect various features in images, such as edges, lines, or circles, allowing for a comprehensive representation of the training data.

Full CNN architecture

A complete Convolutional Neural Network (CNN) architecture comprises several convolutional layers, where the dimensions of the hidden layers typically decrease while the number of channels increases, allowing the network to capture more high-level features After several convolutional layers, CNNs usually culminate in one or more dense layers, and for image classification tasks, a softmax layer is positioned at the end to produce outputs within the range of [0,1] The loss function used during CNN training aligns with those in regression and classification networks, depending on the specific problem being addressed An example of a complete CNN architecture is illustrated in Figure 7.11.

Training a neural network

Initialization

Previous optimization problems, such as LASSO and logistic regression, have been convex, allowing for guaranteed global convergence regardless of the initial parameter values In contrast, training neural networks involves non-convex cost functions, making the training process sensitive to the initial parameters Typically, parameters are initialized to small random values to ensure that different hidden units capture various aspects of the data When using ReLU activation functions, bias terms are often initialized to small positive values to keep them within the non-negative range of the ReLU.

Stochastic gradient descent

Deep learning problems often involve over a million training data points, which presents significant computational challenges Additionally, the design of neural networks typically includes more than a million parameters, further complicating the processing requirements.

A crucial component is the computation of the gradient required in the optimization routine (7.24)

When dealing with large datasets, calculating gradients can be computationally expensive However, since data sets often contain redundant information, the gradients computed from the first half of the dataset are nearly identical to those from the second half This redundancy suggests that it is inefficient to compute gradients using the entire dataset Instead, we can streamline the process by calculating the gradient from the first half, updating the parameters accordingly, and then obtaining the gradient for the updated parameters using the second half of the dataset This approach optimizes the gradient computation while maintaining accuracy in parameter updates.

These two steps would only require roughly half the computational time in comparison to if we had used the whole data set for each gradient computation.

In training deep neural networks, a common approach is to compute the gradient using a mini-batch of training data, which consists of a smaller subset of data points rather than relying on a single data point or the entire dataset Typically, mini-batches can range from 10 to 1000 data points, striking a balance between efficiency and accuracy in the training process.

When utilizing mini-batches in training, it's crucial to ensure that each mini-batch is balanced and representative of the entire dataset For instance, if a large dataset is organized by class, a mini-batch containing only samples from the first class would fail to provide an accurate gradient approximation for the overall dataset.

To optimize the training process, we randomly select data points from the training set to create mini-batches This involves shuffling the training data before dividing it into mini-batches sequentially A full cycle through the training data is referred to as an epoch After completing one epoch, we reshuffle the training data and conduct another pass through the dataset This method is known as stochastic gradient descent or mini-batch gradient descent, as outlined in Algorithm 8.

The neural network model consists of multiple layers, allowing for the efficient computation of the gradient of the loss function with respect to all parameters using the chain rule of differentiation, a process known as back-propagation For a more detailed understanding, readers are encouraged to refer to Goodfellow, Bengio, and Courville's work from 2016, specifically Section 6.5.

Algorithm 8:Mini-batch gradient descent

1 Initialize all the parametersθ 0 in the network and sett←1.

2 Fori= 1toE a) Randomly shuffle the training data{(xi,y i )} n i=1. b) Forj= 1to n n b i Approximate the gradient of the loss function using the mini-batch {(x i ,y i )} jn i=(j b − 1)n b +1 , ˆ g t = n 1 b

. ii Do a gradient stepθ t+1 =θ t −γgˆ t iii Update the iteration indext←t+ 1.

Learning rate

The learning rate (γ) is a crucial tuning parameter in stochastic gradient descent, determining the size of each gradient step taken during iterations A learning rate that is too low results in minimal changes to the estimate (θ t) from one iteration to the next, leading to slower learning progress This concept is visually represented in Figure 7.12a, which depicts a small optimization problem with a single parameter (θ).

A learning rate that is too high can cause the estimation to overshoot the optimal value, preventing convergence, as illustrated in Figure 7.12b Conversely, an appropriately balanced learning rate allows for convergence within a reasonable number of iterations To effectively determine an optimal learning rate, it is advisable to employ a strategic approach.

• if the error keeps getting worse or oscillates widely, reduce the learning rate

• if the error is fairly consistently but slowly increasing, increase the learning rate.

Convergence in gradient descent can be achieved with a constant learning rate, as the gradient approaches zero at the optimum However, this is not the case for stochastic gradient descent, where the gradient is merely an approximation and does not necessarily approach zero as the objective function reaches its minimum Consequently, large updates may occur near the optimum, preventing convergence To address this issue, practitioners typically adjust the learning rate, starting with a high value and gradually decaying it to a predetermined level, often using a formula such as γt=γ min + (γ max −γ min )e − τ t.

The learning rate begins at γ max and gradually decreases to γ min as time progresses Selecting the parameters γ min, γ max, and τ is more of an art than a science A general guideline is to set γ min at approximately 1% of γ max The choice of τ should take into account the dataset size and problem complexity, ensuring that several epochs occur before reaching γ min The approach for determining γ max can follow the same principles as those used in standard gradient descent.

Under specific regularity conditions and the fulfillment of the Robbins-Monro condition, which requires that the sum of learning rates approaches infinity while the sum of their squares remains finite, stochastic gradient descent can almost surely converge to a local minimum However, achieving the Robbins-Monro condition necessitates that the learning rate approaches zero as iterations progress In practical applications, this is often not feasible, leading to the adoption of a strategy where the learning rate stabilizes at a positive minimum value This approach has shown improved performance in various scenarios, even though it may compromise the theoretical guarantees of convergence for the algorithm.

Figure 7.12 illustrates the optimization of a cost function J(θ) through gradient descent, where θ represents a scalar parameter The subfigures demonstrate the effects of varying learning rates: subfigure (a) shows the impact of a too low learning rate, subfigure (b) depicts the consequences of a too high learning rate, while subfigure (c) exemplifies the effectiveness of an optimal learning rate.

Dropout

Neural network models, like other models, can experience overfitting if they are overly flexible compared to the complexity of the data To mitigate this issue, bagging is an effective technique that reduces variance and overfitting by training an ensemble of models on different bootstrapped datasets derived from the original training set For predictions, each model in the ensemble makes an individual prediction, and the final output is obtained by averaging these predictions.

Bagging can be utilized with neural networks, but it presents certain challenges Training a large neural network model typically requires significant time and involves managing a considerable number of parameters.

Training a large ensemble of neural networks can be prohibitively expensive in terms of both runtime and memory usage However, the dropout technique, introduced by Srivastava et al in 2014, offers a solution by enabling the combination of multiple neural networks without the necessity for separate training This method allows different models to share parameters, significantly lowering computational costs and memory demands.

Dropout is a technique used in neural networks to create sub-networks by randomly removing some hidden units, effectively forming an ensemble of models This process involves sampling units to drop with a predefined probability, ensuring that the selection of dropped units in one sub-network is independent of another When a unit is dropped, all its associated incoming and outgoing connections are also removed Importantly, dropout can be applied not only to hidden units but also to input variables, enhancing the model's robustness and preventing overfitting.

All sub-networks within the original network share certain parameters, such as β 55 (1) shown in Figure 7.13b, enabling efficient training of the ensemble This shared parameter structure enhances the collaborative learning process among the sub-networks.

To implement dropout during training, we utilize the mini-batch gradient descent algorithm In each gradient step, a mini-batch of data is employed to approximate the gradient Rather than calculating the gradient for the entire network, we create a random sub-network for this purpose.

In a neural network with two hidden layers, randomly dropping units creates independent sub-networks During training, the gradient for each sub-network is computed without considering the dropped units, allowing for a gradient step that updates only the parameters present in that sub-network This process is repeated with new mini-batches of data, where another set of units is randomly removed, and the corresponding parameters are updated This iterative approach continues until a specified terminal condition is met.

This procedure to generate an ensemble of models differs from bagging in a few ways:

• In bagging all models are independent in the sense that they have their own parameters In dropout the different models (the sub-networks) share parameters.

In bagging, each model undergoes training until it reaches convergence, while in dropout, each sub-network is trained for only one gradient step Despite this difference, all models benefit from shared parameters, allowing updates to occur across networks during training.

Dropout is a technique akin to bagging, where each model is trained on a randomly selected subset of the training data Unlike bagging, which utilizes a bootstrapped version of the entire dataset, dropout focuses on training each model using a randomly chosen mini-batch of data.

Dropout, while distinct from bagging in certain ways, has been empirically demonstrated to share similar benefits, such as mitigating overfitting and decreasing model variance.

Once the sub-networks are trained, we aim to make predictions using an unseen input data point, denoted as x In bagging, we assess all models within the ensemble and aggregate their outcomes However, this approach is impractical in dropout scenarios due to the vast number of potential sub-networks, which results in a combinatorial explosion.

In the trained neural network depicted in Figure 7.14, all units and connections are retained, but the weights from each unit are adjusted based on their inclusion probability during training This adjustment compensates for the dropout technique, where certain units were randomly omitted during training Each unit maintained a retention probability of r, while the probability of being dropped was 1 − r.

To achieve similar results without evaluating all possible sub-networks, we can assess the full network with all parameters To address the dropout effect during training, we adjust each estimated parameter by multiplying it with the probability of that unit being active This adjustment ensures that the expected input to each unit remains consistent between training and testing, as only a portion of the incoming links are active during training For example, if a unit is retained with probability \( p \) during training, we multiply all estimated parameters by \( p \) during testing before making predictions This method of approximating the average across all ensemble members has proven effective in practice, despite the lack of a robust theoretical foundation for its accuracy.

Dropout is an effective regularization method that helps reduce variance and prevent overfitting in neural networks Other regularization techniques include parameter penalties, such as ridge regression and LASSO, early stopping, and sparse representations like CNNs, which enforce many parameters to be zero Since its introduction, dropout has gained popularity due to its simplicity, cost-effective training and testing, and strong performance A recommended approach in neural network design is to expand the network until overfitting occurs, then extend it slightly further and incorporate dropout to mitigate the overfitting issue.

Perspective and further reading

Although the first conceptual ideas of neural networks date back to the 1940s (McCulloch and Pitts

In the late 1980s and early 1990s, neural networks gained prominence with the introduction of the back-propagation algorithm, enabling tasks like classifying handwritten digits from low-resolution images However, by the late 1990s, interest in neural networks waned as they were perceived to be inadequate for complex challenges in computer vision and speech recognition, where hand-crafted solutions leveraging domain-specific knowledge were preferred.

Since the late 2000s, the landscape of deep learning has transformed significantly, driven by advancements in software, hardware, and algorithm parallelization These developments have enabled the tackling of complex problems that were previously unimaginable just a few decades ago, particularly in areas such as image processing.

Deep models have become the leading approach in artificial intelligence, achieving near-human performance in various tasks (LeCun, Bengio, and Hinton 2015) Notable advancements in deep neural networks have led to the development of algorithms capable of learning to play computer games using only pixel data (Mnih et al 2015) and automatically generating captions by understanding image contexts (Xu et al 2015).

A fairly recent and accessible introduction and overview of deep learning is provided by LeCun, Bengio,and Hinton (2015), and a recent textbook by Goodfellow, Bengio, and Courville (2016).

Random variables

Marginalization

In the context of multivariate random variables, let \( z \) be defined as the combination of two components, \( z_1 \) and \( z_2 \), represented as \( z = [z_1^T, z_2^T]^T \) When the joint probability density function \( p(z) = p(z_1, z_2) \) is known, and the focus is solely on the marginal distribution of \( z_1 \), we can derive the marginal density \( p(z_1) \) through the process of marginalization, expressed mathematically as \( p(z_1) = \int p(z_1, z_2) \, dz_2 \).

The other marginalp(z 2 )is obtained analogously by integrating overz 1 instead In Figure A.1 a joint two-dimensional densityp(z 1 , z 2 )is illustrated along with their marginal densitiesp(z 1 )andp(z 2 ).

A.2 Approximating an integral with a sum

Conditioning

Consider again the multivariate random variablezwhich can be partitioned in two partsz= [z 1 T , z 2 T ] T

We can now define theconditionaldistribution ofz 1 , conditioned on having observed a valuez 2 =z 2 , as p(z 1 |z 2 ) = p(z 1 , z 2 ) p(z 2 ) (A.7)

To determine the conditional distribution of z2 given a specific value of z1, denoted as z1 = z1, we can follow a similar approach Figure A.1 illustrates the joint two-dimensional probability density function p(z1, z2) alongside the conditional probability density function p(z1 | z2).

From (A.7) it follows that the joint probability density functionp(z 1 , z 2 )can be factorized into the product of a marginal times a conditional, p(z 1 , z 2 ) =p(z 2 |z 1 )p(z 1 ) =p(z 1 |z 2 )p(z 2 ) (A.8)

If we use this factorization for the denominator of the right-hand-side in (A.7) we end up with the relationship p(z 1 |z 2 ) = p(z 2 |z 1 )p(z 1 ) p(z 2 ) (A.9)

This equation is often referred to asBayes’ rule.

Approximating an integral with a sum

Consider again the multivariate random variablezwhich can be partitioned in two partsz= [z 1 T , z 2 T ] T

We can now define theconditionaldistribution ofz 1 , conditioned on having observed a valuez 2 =z 2 , as p(z 1 |z 2 ) = p(z 1 , z 2 ) p(z 2 ) (A.7)

To determine the conditional distribution of z2 given a specific value of z1 (i.e., z1 = z1), we can follow a similar approach as before Figure A.1 illustrates the joint two-dimensional probability density function p(z1, z2) alongside the conditional probability density function p(z1 | z2).

From (A.7) it follows that the joint probability density functionp(z 1 , z 2 )can be factorized into the product of a marginal times a conditional, p(z 1 , z 2 ) =p(z 2 |z 1 )p(z 1 ) =p(z 1 |z 2 )p(z 2 ) (A.8)

If we use this factorization for the denominator of the right-hand-side in (A.7) we end up with the relationship p(z 1 |z 2 ) = p(z 2 |z 1 )p(z 1 ) p(z 2 ) (A.9)

This equation is often referred to asBayes’ rule.

A.2 Approximating an integral with a sum

An integral over a given smooth functionh(z)and a probability densityp(z)can be approximated with a sum overMsamples in the following fashion

Monte Carlo integration is a statistical technique where each sample \( z_i \) is drawn independently from the distribution \( p(z) \) As the number of samples \( M \) approaches infinity, the approximate equality in the integration process converges to an exact value with probability one.

Unconstrained optimization involves finding the value of a variable θ that minimizes or maximizes a cost function L(θ) defined in R^n This process is formulated as minθ L(θ), and is essential across various scientific and engineering disciplines, allowing for the identification of optimal solutions to specific problems A common application is in linear regression, where parameters are determined to maximize the likelihood function, leading to a least squares problem with explicit solutions known as normal equations However, most optimization challenges lack explicit solutions, necessitating the use of approximate numerical methods, as seen in deep learning and logistic regression scenarios This overview introduces the practical aspects of unconstrained numerical optimization.

To create an effective optimization algorithm, it is essential to develop a simple and practical model of the complex cost function L(θ) around the current θ value This model is typically local, applicable only within a neighborhood of the current point By utilizing this model, the algorithm selects a new θ that aims to minimize the cost function L(θ), leading to an iterative process common in numerical optimization While various methods exist for this optimization, they all encompass fundamental components that are crucial for understanding practical unconstrained optimization strategies For more detailed information, numerous textbooks on the subject are available for reference.

A general iterative solution

In the context of unconstrained minimization problems, a solution is defined as the global minimizer, denoted as θb, which satisfies L(θb) ≤ L(θ) for all θ in R^n However, locating the global minimizer can be challenging, leading to the reliance on local minimizers A point θb is considered a local minimizer if there exists a neighborhood M around θb where L(θb) ≤ L(θ) for all θ within M.

To identify a local minimizer, we begin with an initial point, denoted as θ₀ If θ₀ is not a local minimizer of the function L(θ), there exists an increment d₀ that can be added to θ₀, resulting in L(θ₀ + d₀) being less than L(θ₀) Similarly, if θ₁, defined as θ₀ + d₀, is also not a local minimizer, we can continue this process.

1 Note that it is sufficient to cover minimization problem, since any maximization problem can be considered as a minimization problem simply by changing the sign of the cost function.

Throughout the course, we have discussed various loss functions, which serve as examples of cost functions An increment, denoted as \( d_1 \), can be added to \( \theta_1 \) such that \( L(\theta_1 + d_1) < L(\theta_1) \) This process is repeated until no further increments can reduce the objective function's value, leading us to a local minimizer Most algorithms designed to solve this problem employ iterative procedures Additionally, the increment \( d \) is often expressed as two components, represented by \( d = \gamma p \).

In optimization algorithms, the scalar and positive parameter γ, known as the step length, determines how far the algorithm moves in the search direction, represented by the vector p ∈ R^n This process involves seeking a solution by adjusting the movement along the specified direction, raising important questions about the effectiveness and implications of the chosen parameters.

1 How can we compute a useful search directionp?

2 How big steps should we make, i.e what is a good value of the step lengthγ?

3 How do we determine when we have reached a local minimizer, and stop searching for new directions?

In this section, we will address key questions and ultimately present a general algorithm commonly employed for unconstrained minimization.

A straightforward way of finding a general characterization of all search directionspresulting in a decrease in the value of the cost function, i.e directionspsuch that

To construct a local model of the cost function L(θ) around the point θ, we utilize the inequality L(θ+p) < L(θ) Taylor’s theorem serves as a foundation for creating a local polynomial approximation of the function at a specific point of interest Consequently, a linear approximation of the cost function L(θ) near the point θ can be derived.

To improve the search direction in optimization, we can refine our approach by substituting the linear approximation of the objective function into the relevant equation This leads us to seek a direction \( p \) that satisfies the condition \( L(θ + p) < L(θ) \) Specifically, we need to determine when \( L(θ) + p^T ∇L(θ) < L(θ) \), which simplifies to the requirement that \( p^T ∇L(θ) < 0 \).

We define a search direction using the equation p = -V∇L(θ), where V is a positive definite scaling matrix, to introduce additional flexibility By substituting this expression into our objective function, we find that p^T L(θ) = -∇^T L(θ)V^T ∇L(θ) = -k∇L(θ)k^2_V < 0, demonstrating that the search direction p = -V∇L(θ) effectively decreases the objective function's value This type of search direction is referred to as a descent direction.

Algorithm 9 outlines the line search strategy, which emphasizes its iterative nature through the use of the subscript t This algorithm operates by initiating at the current iterate θ t and progressing along the search direction p t The extent of movement along this line is determined by minimizing the cost function, expressed as minγ L(θ t + γp t).

Commonly used search directions

Steepest descent direction

The descent condition for the scalar product requires that the search direction \( p \) satisfies the inequality \( p^T \nabla L(\theta_t) = \|p\|_2 \|\nabla L(\theta_t)\|_2 \cos(\phi) < 0 \), where \( \phi \) is the angle between the vectors \( p \) and \( \nabla L(\theta_t) \) To optimize the search direction, we can fix the length of \( p \) and minimize the scalar product by choosing \( \phi = \pi \), which leads to the selection of \( p = -\nabla L(\theta_t) \).

The gradient vector at a given point indicates the direction of the steepest ascent of a function, which is why the search direction mentioned in (B.10) is known as the steepest descent direction.

3 The scalar (or dot) product of two vectors a and b is defined as a T b = kakkbk cos(ϕ), where kak denotes the length(magnitude) of the vector a and ϕ denotes the angle between a and b.

The steepest descent method can often be inefficient due to its limited use of information regarding the cost function To address this, Newton and quasi-Newton methods leverage additional insights into the local geometry of the cost function, utilizing a more comprehensive local model for improved optimization.

Newton direction

We can enhance our model of the objective function by incorporating the quadratic term from the Taylor expansion This leads to a quadratic approximation, denoted as m(θ t , p t ), of the cost function, which is centered around the current iterate θ t.

In equation (B.11), the gradient of the cost function is represented as g_t = ∇L(θ)|θ=θ_t, while H_t = ∇²L(θ)|θ=θ_t denotes the Hessian, both evaluated at the current iterate θ_t The core concept of the Newton direction involves choosing a search direction that minimizes the quadratic model presented in (B.11), achieved by setting its derivative to zero.

∂p t =gt+Htpt (B.12) to zero, resulting in p t =−H t −1 g t (B.13)

Computing the Hessian can be challenging and costly, leading to the creation of search directions that utilize an approximation of the Hessian These methods are collectively referred to as quasi-Newton directions.

Quasi-Newton

The quasi-Newton method utilizes a local quadratic model of the cost function to determine the direction, similar to the approach used in deriving the Newton direction Instead of relying on a readily available Hessian matrix, this method learns the Hessian from the available data, specifically the values of the cost function and its gradients.

Let us first denote the line segment connecting two adjacent iteratesθ t andθ t+1 by r t (τ) =θ t +τ(θ t+1 −θ t ), τ ∈[0,1] (B.14)

From the fundamental theorem of calculus we know that

∂τ∇L(r t (τ))dτ =∇L(r t (1))− ∇L(r t (0)) =∇L(θ t+1 )− ∇L(θ t ) =g t+1 −g t , (B.15) and from the chain rule we have that

∂τ =∇ 2 L(r t (τ))(θ t+1 −θ t ) (B.16) Hence, in combining (B.15) and (B.16) we obtain yt Z 1 0

0 ∇ 2 L(rt(τ))stdτ (B.17) where we have definedy t =g t+1 −g t ands t =θ t+1 −θ t An interpretation of the above equation is that the difference between two consecutive gradientsy t is given by integrating the Hessian timess t for

Tiêu đề	Supervised Machine Learning Lecture Notes for the Statistical Machine Learning Course
Tác giả	Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, Thomas B. Schön
Trường học	Uppsala University
Chuyên ngành	Statistical Machine Learning
Thể loại	Lecture Notes
Năm xuất bản	2019
Thành phố	Uppsala

Định dạng
Số trang	112
Dung lượng	1,96 MB

Supervised Machine Learning Lecture notes for the Statistical

What is machine learning all about?

Regression and classification

Overview of these lecture notes

Further reading

The regression problem

The linear regression model

Describe relationships — classical statistics

Predicting future outputs — machine learning

Learning the model from training data

Maximum likelihood

Least squares and the normal equations

Nonlinear transformations of the inputs – creating more features

Qualitative input variables

Regularization

Ridge regression

LASSO

General cost function regularization

Further reading

The classification problem

Logistic regression

Learning the logistic regression model from training data

Decision boundaries for logistic regression

Logistic regression for more than two classes

Linear and quadratic discriminant analysis (LDA & QDA)

Using Gaussian approximations in Bayes’ theorem

Using LDA and QDA in practice

Bayes’ classifier — a theoretical justification for turning p(y | x) into y b

Bayes’ classifier

Optimality of Bayes’ classifier

Bayes’ classifier in practice: useless, but a source of inspiration

Is it always good to predict according to Bayes’ classifier?

More on classification and classifiers

Linear and nonlinear classifiers

Regularization

Evaluating binary classifiers

k-NN

Decision boundaries for k-NN

Choosing k

Normalization

Trees

Basics

Training a classification tree

Boosting vs bagging: base models and ensemble size

Robust loss functions and gradient boosting

Neural networks for regression

Generalized linear regression

Two-layer neural network

Matrix notation

Deep neural network

Learning the network from data

Neural networks for classification

Learning classification networks from data

Convolutional neural networks

Data representation of an image

The convolutional layer

Condensing information with strides

Multiple channels

Full CNN architecture

Training a neural network

Initialization

Stochastic gradient descent

Learning rate

Dropout

Perspective and further reading

Random variables

Marginalization

Conditioning

Approximating an integral with a sum

A general iterative solution

Commonly used search directions

Steepest descent direction

Newton direction

Quasi-Newton

Further reading