Supervised Machine Learning Lecture notes for the Statistical Machine Learning course Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, Thomas B Schön Version March 12, 2019 Department of Informat.
What is machine learning all about?
Machine learning enables computers to learn from data without explicit programming, utilizing mathematical models to uncover unknown variables While a simple example is fitting a straight line to data, machine learning often employs more complex models The primary goal is to enable conclusions to be drawn from new, unseen data For instance, a model trained on 1,000 puppy images can accurately determine whether a new image depicts a puppy, demonstrating the concept of generalization.
The science of machine learning is about learning models that generalize well.
Supervised learning involves working with labeled data, represented as pairs {xi, yi}, where xi denotes inputs and yi denotes outputs This method is exemplified in medical diagnostics, such as using an electrocardiogram (ECG) to assess heart disease In this context, the ECG readings serve as the input (x), while the doctor's diagnosis represents the output (y) By utilizing a substantial dataset of labeled ECG readings and corresponding diagnoses, supervised machine learning can develop a model to understand the relationship between x and y Once trained, this model can predict diagnoses (denoted as yb) for new ECG readings, aiming for accurate predictions that generalize well beyond the training data.
Supervised learning faces significant challenges due to its reliance on labeled data, which is costly and often difficult to obtain, as it necessitates human interpretation of inputs and outputs This issue is compounded by the fact that advanced methods typically require large datasets to achieve optimal performance Consequently, there has been a rise in the development of unsupervised learning techniques that utilize only unlabeled input data A key aspect of this approach is clustering, which organizes data into groups based on similarity Additionally, semi-supervised learning has emerged as a valuable middle ground, leveraging both labeled and unlabeled data to enhance learning outcomes.
Access to a vast amount of unlabeled data, paired with a limited quantity of labeled data, presents a unique opportunity The small set of labeled data can be extremely valuable when combined with the larger unlabeled dataset, enhancing the overall analysis and insights derived from the data.
In reinforcement learning, a key branch of machine learning, the focus extends beyond merely utilizing measured data for predictions or situational understanding; it involves learning through interactions and feedback to optimize decision-making processes.
1 Some common synonyms used for the input variable include feature, predictor, regressor, covariate, explanatory variable, controlled variable and independent variable.
In the realm of machine learning, the output variable is often referred to by various synonyms, including response, regressand, label, explained variable, predicted variable, and dependent variable Our objective is to create a system capable of learning to take actions in the real world, primarily by maximizing rewards that encourage desired environmental states This approach is closely related to reinforcement learning and control theory Additionally, we highlight the emerging field of causal learning, which seeks to address the more complex challenge of understanding cause-and-effect relationships, moving beyond mere associations or correlations that are typically explored in other areas of machine learning.
Regression and classification
Supervised machine learning algorithms can be effectively categorized based on the type of output variable they utilize, distinguishing between quantitative and qualitative variables Understanding when to classify a variable as quantitative or qualitative is essential for selecting the appropriate algorithm for a given problem For further clarification, refer to Table 1.1, which provides examples of both types of variables.
Table 1.1: Examples of quantitative and qualitative variables.
Variable type Example Handle as
Numeric (continuous) 32.23 km/h, 12.50 km/h, 42.85 km/h Quantitative
Numeric (discrete) with natural ordering 0 children, 1 child, 2 children Quantitative Numeric (discrete) without natural ordering 1 = Sweden, 2 = Denmark, 3 = Norway Qualitative Text (not numeric) Uppsala University, KTH, Lund University Qualitative
Depending on whether the output of a problem is quantitative or qualitative , we refer to the problem as either regression or classification.
Regressionmeans the output is quantitative, andclassificationmeans the output is qualitative.
This means that whether a problem is about regression or classification depends only on its output The input can be either quantitative or qualitative in both cases.
The differentiation between quantitative and qualitative methods, as well as between regression and classification, is often subjective For instance, one might contend that the absence of children represents a qualitatively distinct category compared to having children, leading to a classification approach with a binary output of “children: yes/no” instead of a numerical count like “0, 1, or 2 children.” This illustrates how a regression problem can be redefined as a classification problem.
Overview of these lecture notes
1.3 Overview of these lecture notes
The following sketch gives an idea on how the chapters are connected.
Chapter 2: The regression problem and linear regression
Chapter 3: The classification problem and three parametric classifiers
Chapter 4: Non-parametric methods for regression and classification: k-NN and trees
Chapter 5: How well does a method perform? Chapter 7: Neural networks and deep learning Chapter 6: Ensemble methods needed recommended
Further reading
There are numerous comprehensive textbooks on machine learning that present the subject in various ways Notably, Hastie, Tibshirani, and Friedman (2009) offer a mathematically rigorous yet accessible introduction to statistical machine learning, followed by a lighter version in 2013 that effectively conveys core concepts While these works do not delve deeply into Bayesian methods, several complementary texts, such as those by Barber (2012), Bishop (2006), and Murphy (2012), cover this area well MacKay (2003) provides an early perspective linking machine learning to information theory, which remains relevant today Additionally, Efron and Hastie (2016) adopt a historical lens to explore the evolution of data analysis with the advent of computers For a contemporary mathematical introduction to machine learning, Deisenroth, Faisal, and Ong (2019) are recommended, along with recent papers by Ghahramani (2015) and Jordan and Mitchell (2015) that further elucidate the field.
The field of Machine Learning is currently thriving, with key annual conferences such as the International Conference on Machine Learning (ICML) and the Conference on Neural Information Processing Systems (NeurIPS) showcasing cutting-edge research, all of which is accessible online through their respective websites Other notable conferences include the International Conference on Artificial Intelligence and Statistics (AISTATS) and the International Conference on Learning Representations (ICLR) Additionally, the Journal of Machine Learning Research (JMLR) stands out as a leading publication in this dynamic discipline.
IEEE Transactions on Pattern Analysis and Machine Intelligence There are also quite a lot of relevant work published within statistical journals, in particular within the area of computational statistics.
2 The regression problem and linear regression
The regression problem is a key focus in our study, alongside classification Linear regression serves as an essential method for addressing the regression problem Despite its simplicity, linear regression proves to be highly effective and lays the groundwork for more complex techniques, including deep learning, which will be discussed in Chapter 7.
The regression problem
Regression involves understanding the relationships between input variables, which can be either qualitative or quantitative, and a quantitative output variable Mathematically, it is represented as a model \( y = f(x) + \epsilon \), where \( \epsilon \) denotes a noise or error term that accounts for factors not captured by the model From a statistical viewpoint, \( \epsilon \) is treated as a random variable that is independent of the input variables \( x \) and has a mean of zero.
In this chapter, we will utilize the car stopping distances dataset from Example 2.1 to demonstrate regression analysis Our objective is to develop a regression model that predicts the stopping distance required for a car to halt completely based on its current speed.
Ezekiel and Fox (1959) provide a dataset consisting of 62 observations that measure the stopping distance required for various cars at different initial speeds The dataset includes two key variables that are essential for analyzing the braking performance of vehicles.
- Speed: The speed of the car when the break signal is given.
- Distance: The distance traveled after the signal is given until the car has reached a full stop.
We decide to interpret Speed as the input variable x, and Distance as the output variable y.
Our objective is to apply linear regression to predict the stopping distance for initial speeds of 33 mph and 45 mph, despite the absence of recorded data at these speeds While the dataset may be outdated, we encourage readers to consider the findings in the context of their preferred modern vehicles.
1 We will start with quantitative input variables, and discuss qualitative input variables later in 2.5.
The linear regression model
Describe relationships — classical statistics
In fields like medicine and sociology, a common inquiry is whether a correlation exists between certain variables, such as the impact of seafood consumption on longevity This can be analyzed using the parameters in a linear regression model, particularly focusing on the coefficient β1 If β1 equals zero, it suggests no correlation between the variables x1 and y, unless other factors influence x1 By estimating β1 and constructing a confidence interval, researchers can determine if x1 and y are uncorrelated; if zero is not included in the interval, a correlation is likely present This approach, known as hypothesis testing, is a fundamental aspect of classical statistics However, the primary focus here will be on utilizing the linear regression model for predictive purposes.
Predicting future outputs — machine learning
In machine learning, the focus is on predicting unseen outputs \( y_b \) for new inputs \( x? = [x?1, x?2, \ldots, x?p]^T \) To generate a prediction for a test input \( x? \), we incorporate it into the model Assuming that the error term \( \epsilon \) has a mean value of zero, the prediction is expressed as \( \hat{y} = \beta_0 + \beta_1 x?1 + \beta_2 x?2 + \ldots + \beta_p x?p \).
In our notation, we use the symbol \( \hat{y} \) to represent a prediction or our best estimate Conversely, if we could directly observe the actual output from \( x \), it would be denoted simply as \( y \) (without the hat).
Learning the model from training data
Maximum likelihood
Our approach to estimating the unknown parameters β from the training data T utilizes the maximum likelihood method This method focuses on maximizing the likelihood function, which is a statistical measure that helps identify the value of β that makes the observed data y most probable Specifically, we aim to maximize the expression β p(y|X,β), where p(y|X,β) represents the probability density of the data y given specific parameter values β The outcome of this optimization process yields the learned parameters, denoted as βb = [βb0, βb1, , βbp]ᵀ In a more concise form, we express this as βb = arg max β p(y|X,β).
To understand the concept of 'likely' in a mathematical context, it is essential to make specific assumptions regarding the noise term ε A widely accepted assumption is that ε is normally distributed, characterized by a mean of zero and a variance of σ ε 2, expressed as ε ∼ N(0, σ ε 2).
This implies that the conditional probability density function of the outputyfor a given value of the input xis given by p(y|x,β) =N y|β 0 +β 1 x 1 +ã ã ã+β p x p , σ ε 2
Furthermore, thenobserved training data points are assumed to beindependentrealizations from this statistical model This implies that the likelihood of the training data factorizes as p(y|X,β) Yn i=1 p(y i |x i ,β) (2.12)
Putting (2.11) and (2.12) together we get p(y|X,β) = 1
2.3 Learning the model from training data
To maximize the likelihood concerning β, as mentioned in section (2.8), we note that equation (2.13) relies on β solely through the exponent's sum Given that the exponential function is monotonically increasing, maximizing (2.13) translates to minimizing the corresponding expression.
The least squares method involves calculating the sum of the squared differences between the actual output data (yi) and the model's predicted output (ybi = β0 + β1xi1 + + βpxip) This approach aims to minimize these differences, which is why it is commonly known as minimizing least squares.
In this article, we will explore the computation of the values βb 0, βb 1, , βb p It is important to note that, while a Gaussian distribution is commonly assumed for the error term ε, considering alternative distributions can be beneficial For example, assuming a Laplace distribution for ε can lead to a different cost function, potentially enhancing the model's performance.
The method involves calculating the sum of the absolute values of all differences instead of their squares A key advantage of the Gaussian assumption is that it provides a closed-form solution for the parameters βb 0, βb 1, , βb p In contrast, alternative assumptions regarding ε typically necessitate more computationally intensive approaches.
Remark 2.2 With the terminoloy we will introduce in the next chapter, we could refer to(2.13)as the likelihood function, which we will denote by`(β).
Remark 2.3 It is not uncommon in the literature to skip the maximum likelihood motivation, and just state(2.14)as a (somewhat arbitrary) cost function for optimization.
Least squares and the normal equations
Assuming the noise/error ε follows a Gaussian distribution, the maximum likelihood parameters βb are derived from the optimization problem outlined in (2.14) This is visually represented in Figure 2.2 The least squares problem can be expressed in compact matrix and vector notation as minimizing β0, β1, , βp ||Xβ - y||², where ||·|| denotes the Euclidean vector norm From a linear algebra perspective, this involves identifying the closest vector to y within the subspace of Rⁿ spanned by the columns of X The solution is achieved through the orthogonal projection of y onto this subspace, ensuring that the resulting βb satisfies the conditions discussed in Section 2.A.
Equation (2.17) is often referred to as thenormal equations, and gives the solution to the least squares problem (2.14, 2.16) IfX T Xis invertible, which often is the case,βbhas the closed form βb = (X T X) − 1 X T y (2.18)
The existence of a closed-form solution for least squares is significant, contributing to its popularity and widespread application In contrast, alternative assumptions about errors, such as those beyond Gaussianity, can lead to complications, including scenarios where a closed-form solution is not available.
Time to reflect 2.1:What does it mean in practice thatX T Xis not invertible? input x output y
The least squares criterion aims to select a model, represented by a blue line, that minimizes the sum of the squared errors, depicted in orange, for each individual error, shown in green This process focuses on reducing the overall orange area, which is why it is referred to as "least squares."
When the columns of matrix X are linearly independent and p equals n minus one, X spans the entire R^n, leading to a unique solution where y equals Xβ exactly This indicates that the model perfectly fits the training data, simplifying the equation to β equals X^(-1)y However, achieving a perfect fit is not always desirable, as it may lead to overfitting, where the model captures noise instead of the underlying pattern.
By inserting the matrices (2.7) from Example 2.2 into the normal equations (2.6), we obtain β b 0 = − 20.1 and β b 1 = 3.1 If we plot the resulting model, it looks like this:
With this model, the predicted stopping distance for x ? = 33 mph is y b ? = 84 feet, and for x ? = 45 mph it is b y ? = 121 feet.
Nonlinear transformations of the inputs – creating more features
Linear regression derives its name from the fact that the output is modeled as a linear combination of the inputs While speed is typically considered an input, other variables, such as kinetic energy, can also be included as inputs through nonlinear transformations of the original variables This flexibility allows for the incorporation of arbitrary nonlinear transformations in the linear regression model, even when working with one-dimensional data.
2 And also the constant 1, corresponding to the offset β 0 For this reason, affine would perhaps be a better term than linear.
2.4 Nonlinear transformations of the inputs – creating more features input x output y
In the linear regression model, the maximum likelihood solution can be represented using a second-order polynomial, resulting in a non-linear relationship as illustrated in Figure 2.1 However, this appearance of curvature is simply a visual artifact; when plotted in three dimensions with each feature, including x and x², on separate axes, the relationship remains an affine set.
In the linear regression model, the maximum likelihood solution utilizing a 4th order polynomial involves five unknown coefficients, indicating that the model is capable of fitting five data points precisely.
Figure 2.3: A linear regression model with 2nd and 4th order polynomials in the input x, as shown in (2.20). inputx, the vanilla linear regression model is y=β0+β1x+ε (2.19)
However, we can also extend the model with, for instance, x 2 , x 3 , , x p as inputs, and thus obtain a linear regression model which is a polynomial inx, y=β0+β1x+β2x 2 +ã ã ã+βpx p +ε (2.20)
This article discusses linear regression models, emphasizing that unknown parameters are represented linearly with inputs such as x, x², and xᵖ While the parameters βb are learned similarly, the matrix X differs between models (2.19) and (2.20) The transformed inputs are referred to as features, and in more complex scenarios, the difference between original inputs and transformed features may blur, allowing for interchangeable use of the terms feature and input.
Figure 2.3 illustrates two linear regression models that utilize transformed (polynomial) inputs, prompting the question of how a linear regression can produce a curved line While linear regression models are typically associated with linear or affine straight lines, the outcome depends on the dimensionality of the plot In a two-dimensional plot (Figure 2.3(a)) with original inputs x and y, the relationship appears linear; however, a three-dimensional plot incorporating x, x², and y remains affine Similarly, Figure 2.3(b) would require a five-dimensional plot to maintain its affine nature.
While the model in Figure 2.3(b) fits all data points perfectly, it demonstrates that higher-order polynomials can lead to unusual behavior between and beyond these points, making them less practical in machine learning Consequently, higher-order polynomials are seldom utilized A more common alternative is the radial basis function (RBF) kernel, which offers a more reliable approach for modeling data.
, (2.21) i.e., a Gauss bell centered aroundc It can be used, instead of polynomials, in the linear regression model as y=β 0 +β 1 K c 1 (x) +β 2 K c 2 (x) +ã ã ã+β p K c p (x) +ε (2.22)
In this model, 'bumps' are identified at specific locations c1, c2, , cp, with the user responsible for determining these locations and the length scale Only the parameters β0, β2, , βp are derived from the data during linear regression, as illustrated in Figure 2.4 Radial Basis Function (RBF) kernels are generally favored over polynomial models due to their local properties, where a minor adjustment in one parameter impacts the model primarily in the vicinity of that kernel, unlike polynomial models where changes affect the entire model.
We continue with Example 2.1, but this time we also add the squared speed as a feature, i.e., the features are now x and x 2 This gives the new matrices (cf (2.7))
, (2.23) and when we insert them into the normal equations (2.17), the new parameter estimates are β b 0 = 1.58, β b 1 = 0.42 and β b 2 = 0.07 (Note that β b 0 and β b 1 change, compared to Example 2.3.) This new model looks like
The model predicts a stopping distance of 87 feet at 33 mph and 153 feet at 45 mph, which can be contrasted with the predictions from Example 2.3 Although we cannot definitively claim this model is the "true model" based solely on the data, a visual comparison suggests that this model, which incorporates more features, aligns more closely with the observed data To systematically evaluate different features beyond visual assessments, cross-validation is recommended, as detailed in Chapter 5.
In a linear regression model utilizing Radial Basis Function (RBF) kernels, each kernel is positioned at specific centers (c1, c2, c3, and c4) During the model training process, parameters β0, β1, , βp are optimized to ensure that the combined output of all kernels, represented by a solid blue line, closely aligns with the data, typically through a least squares fitting approach.
Polynomials and RBF kernels represent specific instances of nonlinear transformations applied to input data To differentiate between the original inputs and the newly transformed inputs, the term "features" is commonly utilized for the latter One effective method for selecting appropriate features involves conducting comparisons.
Qualitative input variables
The regression problem involves predicting a quantitative output, denoted as y, while the nature of the inputs, represented as x, can vary widely Although we have primarily focused on quantitative inputs, it is important to note that qualitative inputs can also be effectively utilized in regression analysis.
In our analysis, we consider a qualitative input variable that can assume one of two distinct values: type A and type B To facilitate our examination, we introduce a dummy variable, denoted as x, where x equals 0 for type A and 1 for type B This approach allows us to effectively quantify and analyze the impact of these categorical classes in our study.
1 if type B (2.24) and use this variable in the linear regression model This effectively gives us a linear regression model which looks like y=β 0 +β 1 x+ε(β 0 +ε if type A β 0 +β 1 +ε if type B (2.25)
The selection of variable types A and B is flexible, allowing for alternatives such as x = 1 or x = -1 This method can be extended to qualitative input variables with multiple categories, such as types A, B, C, and D For four distinct categories, we can generate three dummy variables, where x1 represents type B.
0 if not type B, x 2 (1 if type C
0 if not type C, x 3 (1 if type D
0 if not type D (2.26) which, altogether, gives the linear regression model y=β 0 +β 1 x 1 +β 2 x 2 +β 3 x 3 +ε
β 0 +ε if type A β 0 +β 1 +ε if type B β 0 +β 2 +ε if type C β0+β3+ε if type D
Qualitative inputs can be handled similarly in other problems and methods as well, such as logistic regression,k-NN, deep learning, etc.
Regularization
Ridge regression
Inridge regression(also known asTikhonov regularization,L2regularization, orweight decay) the least squares criterion (2.16) is replaced with the modified minimization problem minimize β 0 ,β 1 , ,β p kXβ−yk 2 2+γkβk 2 2 (2.28)
The regularization parameter, denoted as γ ≥ 0, must be selected by the user Setting γ = 0 reverts to the original least squares problem, while increasing γ towards infinity drives all parameters β j towards zero An optimal value for γ typically lies between these extremes and varies based on the specific problem at hand This value can be determined through manual tuning or more systematically via cross-validation techniques.
It is actually possible to derive a version of the normal equations (2.17) for (2.28), namely
(X T X+γI p+1 )βb =X T y, (2.29) whereI p+1 is the identity matrix of size(p+ 1)×(p+ 1) Ifγ >0, the matrixX T X+γI p+1 is always invertible, and we have the closed form solution βb = (X T X+γI p+1 ) −1 X T y (2.30)
LASSO
LASSO, which stands for Least Absolute Shrinkage and Selection Operator, utilizes L1 regularization to modify the least squares criterion by minimizing the expression kXβ−yk² + γkβk₁, where k·k₁ denotes the Manhattan norm Unlike ridge regression, LASSO does not have a closed-form solution for this optimization problem; however, it remains a convex problem that can be efficiently solved through numerical optimization techniques.
Ridge regression and LASSO both require the user to select a regularization parameter, γ Setting γ to 0 results in a least squares problem, while increasing γ towards infinity leads to all coefficients being zero However, the two methods yield different outcomes; ridge regression shrinks all coefficients towards small values, whereas LASSO encourages sparse solutions, retaining only a few non-zero parameters while setting the rest to zero This characteristic makes LASSO particularly effective for variable selection.
‘switch some of the inputs off’ by setting the corresponding parameters to zero and it can therefore be used as an input (or feature) selection method.
2.6 Regularization Example 2.5: Regularization in a linear regression RBF model
This article discusses the challenge of learning a linear regression model using eight radial basis function (RBF) kernels as features from nine data points Given that the number of features (p) equals the number of data points (n) minus one, the model is expected to fit the data perfectly However, the analysis reveals that the model overfits, exhibiting excessive adaptation to the data and displaying erratic behavior between the data points.
To address this issue, ridge regression and LASSO are effective remedies While the final models from both methods may appear similar, their parameters differ significantly; specifically, LASSO utilizes only 5 out of 8 radial basis functions, resulting in a sparse solution The choice between these approaches ultimately depends on the specific problem at hand.
Model learned with least squares Data
The least squares model, while perfectly fitting the data, often leads to overfitting, resulting in implausible behavior both between data points and beyond the observed range This indicates that the model is overly tailored to the dataset, with parameter values β b approximately equal to 30 and −30 Therefore, it is crucial to seek models that generalize better rather than merely conforming to the existing data.
Model learned with ridge regression Data
The ridge regression model, utilizing a specific value of γ, demonstrates a more effective balance between fitting the training data and preventing overfitting compared to the previous model Although it may not perfectly align with the training data, it offers a more sensible solution for practical applications Additionally, the parameter values β b are now more evenly distributed within the range of −0.5 to 0.5, indicating improved stability in the model's predictions.
Model learned with LASSO Data
(c) The same model again, this time learned with
The LASSO model (2.31) demonstrates a balanced approach to fitting training data, achieving a better compromise between accuracy and overfitting compared to model (a) While not perfectly tailored to the training data, it is generally more applicable in various scenarios Notably, this model has three out of nine parameters set to exactly zero, while the remaining parameters fall within the range of -1 to 1.
General cost function regularization
Ridge Regression and LASSO are widely used regularization techniques for linear regression that enhance model performance by modifying the cost function Both methods serve as specific examples of a broader regularization framework aimed at minimizing the objective function V(β, X, y).
The equation (2.32) encompasses three key components: first, it includes a term that evaluates the model's fit to the data; second, it features a term that imposes a penalty for model complexity, particularly for large parameter values; and third, it introduces a trade-off parameter, γ, which balances these two aspects.
Further reading
Linear regression, a statistical method that has been in use for over 200 years, was independently introduced by Adrien-Marie Legendre in 1805 and Carl Friedrich Gauss in 1809 through the discovery of the least squares method Its significance is widely recognized in numerous statistics and machine learning textbooks, including works by Bishop (2006) and Gelman et al (2013), as well as Hastie, Tibshirani, and Friedman.
The basic least squares technique has a long history, while its regularized versions, such as Ridge regression and LASSO, are relatively recent developments Ridge regression was independently introduced by Hoerl and Kennard in 1970, also known as Tikhonov regularization in numerical analysis The LASSO, introduced by Tibshirani in 1996, has gained significant attention in recent years A comprehensive overview of sparse models and the LASSO can be found in the monograph by Hastie, Tibshirani, and Wainwright published in 2015.
2.A Derivation of the normal equations
X T Xβb =X T y. can be derived from (2.16) βb =argmin β kXβ−yk 2 2, in different ways We will present one based on (matrix) calculus and one based on geometry and linear algebra.
No matter how (2.17) is derived, ifX T Xis invertible, it (uniquely) gives βb = (X T X) −1 X T y,
IfX T Xis not invertible, then (2.17) has infinitely many solutionsβb, which all are equally good solutions to the problem (2.16).
V(β) =kXβ−yk 2 2= (Xβ−y) T (Xβ−y) =y T y−2y T Xβ+β T X T Xβ, (2.33) and differentiateV(β)with respect to the vectorβ,
SinceV(β)is a positive quadratic form, its minimum must be attained at ∂β ∂ V(β) = 0, which characterizes the solutionβb as
2.A Derivation of the normal equations
To minimize the expression \( kX\beta - yk^2_2 \), it is essential to select \( \beta \) such that \( X\beta \) represents the orthogonal projection of \( y \) onto the subspace formed by the columns \( c_j \) of \( X \) This orthogonal projection can be determined using the normal equations.
We can express the vector \( y \) as the sum of two components: \( y^\perp \), which is orthogonal to the subspace spanned by all columns \( c_i \), and \( y_k \), which lies within that subspace Since \( y^\perp \) is orthogonal to both \( y_k \) and \( X\beta \), we derive that the norm \( \|X\beta - y_k\|^2 \) is greater than or equal to \( \|y^\perp\|^2 \) Additionally, applying the triangle inequality results in the conclusion that \( \|X\beta - y\|^2 \) is less than or equal to the sum of \( \|y^\perp\|^2 \) and \( \|X\beta - y_k\|^2 \).
To achieve the minimum of the criterion \( kX\beta - yk^2 \), we must select \( \beta \) such that \( X\beta = y_k \) Consequently, the optimal solution \( \beta_b \) should ensure that the residual \( X\beta_b - y \) is orthogonal to the subspace formed by all columns \( c_i \).
(remember that two vectorsu,vare, by definition, orthogonal if their scalar product,u T v, is0.) Since the columnsc j together form the matrixX, we can write this compactly as
(y−Xbβ) T X= 0, (2.39) where the right hand side is thep+ 1-dimensional zero vector This can equivalently be written as
3 The classification problem and three parametric classifiers
In this article, we will explore the classification problem, which involves qualitative outputs, in contrast to the quantitative outputs seen in regression problems A classifier is a method used for classification, with logistic regression being our initial focus Additionally, we will introduce linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) in this chapter More advanced classification techniques, including classification trees, boosting, and deep learning, will be discussed in subsequent chapters.
The classification problem
Classification involves predicting a qualitative output based on various input types, whether quantitative or qualitative The output is limited to a finite set of values, denoted by K, which represents the number of possible output classes or labels For example, K can equal 2 for binary outcomes like {false, true}, or K can equal 4 for a set like {Sweden, Norway, Finland, Denmark} Throughout this discussion, we assume the number of classes K is known and use integers 1 through K for notation purposes It's important to note that these integer labels do not imply any inherent order among the classes.
In binary classification, a special case arises when there are only two classes, denoted as K = 2 Typically, the classes are labeled as 0 and 1, with 1 representing the positive class (class k = 1) and 0 representing the negative class (class k = 0) This labeling choice is made primarily for mathematical convenience, allowing for clearer analysis and interpretation of the data.
Classification involves predicting the output based on the input, specifically by estimating class probabilities p(y|x), where y represents the output class (1, 2, , or K) and x denotes the input This probability, p(y|x), encompasses both qualitative probability masses and quantitative probability densities, indicating the likelihood of a specific class label y given the input x Understanding this probability is crucial, as it treats the class label y as a random variable, reflecting the inherent randomness present in the real-world data from which our models are derived.
1 In Chapter 6 we will use k = 1 and k = −1 instead.
Example 3.1: Modeling voting behavior—randomness in the class labely
Voting preferences among different population groups vary significantly, as not all individuals within a specific demographic will support the same political party Mathematically, we can represent voting outcomes as a random variable that adheres to a specific probability distribution For instance, in the demographic of 45-year-old women, the voting percentages are 13% for the cerise party, 39% for the turquoise party, and 48% for the purple party This can be expressed as p(y=cerise party|45 year old women) = 0.13, p(y=turquoise party|45 year old women) = 0.39, and p(y=purple party|45 year old women) = 0.48.
In this way, we use probabilitiesp(y|x)to describe the non-trivial fact that
(a) not all 45 year old women vote for the same party, but
The selection of political parties among 45-year-old women shows a distinct preference, with the purple party emerging as the most favored option, while the cerise party ranks as the least popular choice.
The number of output classes in this example isK = 3.
Logistic regression
Learning the logistic regression model from training data
The logistic function enables the transformation of linear regression into logistic regression, shifting from a regression method to a classification approach However, this shift means we can no longer utilize the convenient normal equations for estimating β in logistic regression as we did in linear regression Similar to linear regression, our goal is to estimate β from the training data T = {(x i , y i )} n i=1 through the maximum likelihood method, specifically by solving βb = arg max β.
Let us now work out a detailed expression for the likelihood function 2 ,
To optimize the function concerning β, it is advisable to focus on the logarithm of `(β) for numerical efficiency Since the logarithm is a monotonic function, maximizing log`(β) = Σ(i:y i =1) β^T x_i - log ensures that the optimal argument remains unchanged.
The simplification in the second equality relies on the chosen labeling, thaty i = 0ory i = 1, which is indeed the reason for why this labeling is convenient.
A necessary condition for the maximum oflog`(β)is that its gradient is zero,
2 We now add β to the expression p(y | x), to explicitly show its dependence also on β.
This equation represents a vector-valued system comprising p + 1 equations and p + 1 unknown elements of the vector β Unlike the linear regression model with Gaussian noise discussed in Section 2.3.1, this maximum likelihood problem leads to a nonlinear system of equations that does not have a general closed-form solution Consequently, a numerical solver is required, as outlined in Appendix B The Newton–Raphson algorithm, which is equivalent to the iteratively reweighted least squares algorithm, is the standard approach for solving this problem, as referenced in Hastie, Tibshirani, and Friedman (2009, Chapter 4.4).
Algorithm 1:Logistic regression for binary classification
Data: Training data{x i , y i } n i=1(with output classesy= 0,1and test inputx ?
Decision boundaries for logistic regression
Logistic regression is utilized to model class probabilities p(y=0|x) and p(y=1|x) To make predictions for a test input x?, we must follow a specific process: first, we estimate the parameters β from the training data Next, we compute the probabilities p(y=0|x?) and p(y=1|x?) Finally, we determine the predicted class by selecting the one with the highest probability, represented as by? = arg max k=0,1 p(y=k|x?).
This is illustrated in Figure 3.2 for a one-dimensional inputx.
A classifier maps all potential test input points (x) to a corresponding prediction (y) Typically, it creates distinct regions that share the same prediction, with the decision boundary serving as the curve that differentiates between these various class predictions For a visual representation, the decision boundary for logistic regression is depicted in Figure 3.2 for a one-dimensional input scenario and in Figure 3.3 for two-dimensional input cases.
We can find the decision boundary by solving the equation p(y= 1|x) =p(y= 0|x) (3.11) which with logistic regression gives e β T x
The equationβ T x= 0parameterizes a (linear) hyperplane Hence, the decision boundaries in logistic regression always have the shape of a (linear) hyperplane.
We distinguish between different types of classifiers by the shape of their decision boundary: Since logistic regression only haslineardecision boundaries, it is consequently called alinear classifier.
In binary classification, where the output can be either 0 or 1, logistic regression models the probabilities of class membership based on a scalar input x After learning the parameter β from training data, the model predicts p(y = 1 | x) and p(y = 0 | x) for any given test input The predicted class is determined by selecting the one with the highest probability, leading to a decision boundary where the prediction shifts from one class to the other.
Logistic regression for two classes (K = 2) consistently produces a linear decision boundary In this context, the red dots and green circles represent training data from distinct classes, while the intersection of the red and green areas illustrates the decision boundary established by the logistic regression classifier based on the training data.
(b) Logistic regression for K = 3 classes We have now introduced training data from a third class, marked with blue crosses The decision boundary between any pair of two classes is still linear.
Figure 3.3: Examples of decision boundaries for logistic regression.
Logistic regression for more than two classes
Logistic regression can be extended to handle multi-class problems where the number of classes, K, is greater than two This generalization involves two key steps: implementing one-hot encoding for the classes and substituting the logistic function with the softmax function This approach not only facilitates multi-class classification but also lays the groundwork for concepts used in deep learning.
One-hot encoding is a technique used to represent categorical variables in machine learning Instead of assigning an integer value to each category, one-hot encoding transforms the output into a K-dimensional vector For instance, if there are K categories, the vector will have a value of 1 in the position corresponding to the category and 0s in all other positions For example, with K=3, the one-hot encoded vector effectively represents each category distinctly, enhancing the model's ability to interpret categorical data.
Vanilla encoding One-hot encoding y i = 1 y i 1 0 0T y i = 2 y i 0 1 0T y i = 3 y i 0 0 1T
Since we now have a vector-valued output y, we also need a vector-valued alternative to the logistic function To this end, we introduce the vector-valued softmax function softmax(z), 1
The softmax function, represented by the K-dimensional input vector z = [z1, z2, , zK], possesses key properties: its output vector always sums to 1, and each element lies within the range of [0, 1] In a manner akin to the integration of linear regression with the logistic function for binary classification, we now merge linear regression with the softmax function to effectively model class probabilities.
In multi-class logistic regression, we utilize K vectors β₁, β₂, , βₖ, where each vector corresponds to a class, resulting in an increase in the number of parameters to be learned as K increases Similar to binary logistic regression, we apply the maximum likelihood approach to estimate these parameters, denoted collectively by θ By employing one-hot encoding, the likelihood function is expressed as log ` (θ) = log p(y|X;θ), which can be expanded to sum over all instances: log p(yᵢ|xᵢ;θ) for i from 1 to n This leads to a detailed breakdown of the likelihood into separate components for each class, allowing us to compute the likelihood for each class k, represented as log p(k|xᵢ;θ).
The likelihood function represented by XK k=1 y ik logp(k|xi;θ) (3.16) utilizes the elements of one-hot encoding vectors, similar to the binary case This function can serve as an objective function in numerical optimization The specific form of this equation is commonly associated with cross-entropy and is frequently encountered when employing one-hot encoding.
Linear and quadratic discriminant analysis (LDA & QDA)
Using Gaussian approximations in Bayes’ theorem
From probability theory, Bayes’ theorem might be familiar, which says that p(y|x) = p(x|y)p(y)
In classification tasks, our primary focus is on the conditional probability p(y|x), but in practical machine learning scenarios, we lack direct knowledge of this and its counterpart p(x|y), relying solely on training data without explicit equations While logistic regression models p(y|x) directly, Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) approach the problem differently by assuming that p(x|y) follows a Gaussian distribution, regardless of the actual data characteristics Given that p(x|y) represents a distribution over the input x, which typically has multiple dimensions, it is modeled as a multivariate Gaussian distribution characterized by a mean vector and a covariance matrix.
In order to create an effective classifier, it's essential to learn from the training data by estimating the parameters of the Gaussian distribution, specifically the mean vector (à) and the covariance matrix (Σ) In Linear Discriminant Analysis (LDA), the mean vector is considered distinct for each class, while the covariance matrix remains constant across classes Conversely, Quadratic Discriminant Analysis (QDA) assumes both the mean vector and the covariance matrix vary for each class These parameters are denoted with a hat symbol (àb and Σb) to indicate they are derived from the data Additionally, Bayes' theorem incorporates the term p(y), representing the probability of a random data point belonging to class y without knowledge of its inputs x To estimate p(y), we use the frequency of class k in the training data, represented as πk, leading to the approximation p(y) ≈ p(k) For instance, if 22% of the training data is labeled as class 1, we estimate p(1) as π1 = 0.22.
Thus, in LDAp(y|x)is modeled as p(y=k|x ? ) = bπ k N x ? |àb k ,Σb
P K j=1πb j N x ? |àb j ,Σb, (3.18) fork= 1,2, , K This is what is shown in Figure 3.4.
3 Note to be confused with Latent Dirichlet Allocation, which is a completely different machine learning method.
4 TODO: This actually assumes that x is quantiative How are qualitative inputs handled in LDA/QDA? x 1 x 2
Training data points x i with label y i = 0
Training data points x i with label y i = 1
In Linear Discriminant Analysis (LDA), the input variable x is presumed to follow a Gaussian distribution for each output class y, with distinct means for each class while maintaining a common covariance across all classes This assumption shapes our understanding of the training data's distribution as we develop the LDA model.
Level curves forp(x| y = 0) Level curves forp(x | y = 1) Training data points x i with label y i = 0 Training data points x i with label y i = 1
In Quadratic Discriminant Analysis (QDA), it is assumed that the input variable x, corresponding to a specific output variable y, follows a Gaussian distribution Unlike Linear Discriminant Analysis, QDA allows both the mean and covariance to vary across different classes The accompanying plot illustrates the expected distribution of the training data used in the QDA derivation process.
LDA and QDA are statistical methods based on the assumption that the conditional probability distribution p(x | y) follows a Gaussian distribution, treating input variables as random with a specific distribution In Linear Discriminant Analysis (LDA), the covariance of the input distribution remains constant across all classes, differing only in their mean locations Conversely, Quadratic Discriminant Analysis (QDA) allows for varying covariances among different classes, leading to distinct shapes of the level curves for each class.
In practical applications of Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA), the assumptions regarding the distribution of input variables are often not met Despite this, these assumptions serve as a foundational motivation for employing these methods.
In full analogy, for QDA we have (the only difference is the covariance matrixΣb) p(y=k|x ? ) = bπ k N x ? |àb k ,Σb k
We have not imposed any limitations on K, allowing us to utilize LDA and QDA for both binary and multi-class classification In the following sections, we will explore various aspects of these methods in greater detail.
Using LDA and QDA in practice
In our exploration of LDA and QDA, we focused on Bayes' theorem and made the assumption that the conditional probability p(x|y) follows a Gaussian distribution Although this assumption may not always be valid in practical scenarios, both LDA and QDA have proven to be effective classifiers, demonstrating their utility even when the Gaussian distribution condition is not met.
To effectively learn an LDA or QDA classifier from training data {xi, yi}ni=1 without prior knowledge of the true distribution p(x|y), we start by estimating the necessary parameters from the training set This involves calculating the means and covariances for each class based on the available data Once we have these parameters, we can apply the LDA or QDA formulas to classify new observations by determining the posterior probabilities and selecting the class with the highest probability This process allows us to make informed predictions despite the lack of direct information about the underlying distribution.
To effectively implement Linear Discriminant Analysis (LDA) or Quadratic Discriminant Analysis (QDA), it is essential to estimate the parameters πb k, àb k, and Σb for LDA or Σb k for QDA for each class k, where k ranges from 1 to K The parameter πb k, representing the relative frequency of class k within the training dataset, is particularly straightforward to calculate using the formula b π k = n k / n, where n k denotes the number of instances of class k and n is the total number of instances.
3.3 Linear and quadratic discriminant analysis (LDA & QDA) wheren k is the number of training data samples in classk Consequently, alln k must sum ton, and therebyP kbπ k = 1 Further, the mean vectorà k of each class is learned as b à k = 1 n k
X i:y i =k xi, (3.20b) the empirical mean among all training samples of classk For LDA, the common covariance matrixΣfor all classes is usually learned as Σb = 1 n−K
(x i −àb k )(x i −àb k ) T (3.20c) which can be shown to be an unbiased estimate of the covariance matrix 5 For QDA, one covariance matrixΣ k has to be learned for each classk= 1, , K, usually as Σb k = 1 n k −1
(xi−àb k )(xi−àb k ) T , (3.20d) which similarly also can be shown to be an unbiased estimate.
Remark 3.3 To derive the learning of LDA and QDA, we did not make use of the maximum likelihood idea, in contrast to linear and logistic regression Furthermore, learning LDA and QDA amounts to inserting the training data into the closed-form expressions(3.20), similar to linear regression (the normal equations), but different from logistic regression (which requires numerical optimization).
After determining the parameters \( \pi_k, \alpha_k, \Sigma \) or \( \Sigma_k \) for all classes \( k = 1, \ldots, K \), we establish a model for \( p(y|x) \) that enables us to predict outcomes for a test input \( x ? \) Similar to logistic regression, we convert \( p(y|x ?) \) into concrete predictions by selecting the class with the highest probability, expressed as \( b y ? = \arg \max_k p(y=k|x ?) \).
We summarize this by algorithm 2 and 3, and illustrate by Figure 3.5 and 3.6.
Algorithm 2:Linear Discriminant Analysis, LDA
Data: Training data{x i , y i } n i=1 (with output classesk= 1, , K) and test inputx ?
8 Find largestp(y=k|x ? )and setby ? to thatk
5 This means that the if we estimate Σ b like this for new training data over and over again, the average would be the true covariance matrix of p(x).
Algorithm 3:Quadratic Discriminant Analysis, QDA
Data: Training data{x i , y i } n i=1(with output classesk= 1, , K) and test inputx ?
7 Find largestp(y=k|x ? )and setby ? to thatk
Figure 3.5 illustrates the Linear Discriminant Analysis (LDA) for three classes (K = 3) with a one-dimensional input (p = 1) The upper left panel depicts the Gaussian model of the conditional probability p(x | k), characterized by parameters α, β, and Σ, which are derived from training data In this case, Σ represents a scalar variance due to the one-dimensional input The upper right panel shows the prior probabilities π_k, which approximate p(k) Utilizing Bayes' theorem, the bottom panel computes the posterior probability P(k | x), where the final class prediction corresponds to the highest probability, indicated by the top solid colored line The decision boundaries, represented by vertical dotted lines in the bottom plot, are determined by the intersections of these solid colored lines.
3.3 Linear and quadratic discriminant analysis (LDA & QDA)
Figure 3.6 illustrates Quadratic Discriminant Analysis (QDA) for three classes, highlighting a key difference from Linear Discriminant Analysis (LDA) as shown in Figure 3.5 In QDA, the variance Σ b k of p(x | k) varies for each class k, leading to more complex decision boundaries This complexity is evident in the decision boundaries, particularly the narrow segment of class b y = 3 (blue) positioned between classes b y = 1 (red) and b y = 2 (green) around the value of -0.5.
Decision boundaries for LDA and QDA
After learning the parameters from the training data, we can compute the predicted class for a test input \( x \) by applying the equations (3.18) and (3.19) for each class \( k \) and selecting the class with the highest probability \( p(y|x) \) These equations are straightforward enough that we can analyze the decision boundary, which is the delineation in the input space where the predictions transition between different classes, using just pen and paper.
In Latent Dirichlet Allocation (LDA), the maximizing argument (arg max k) for classifying a given input x can be expressed as b y LDA = arg max k p(y=k|x) This can be simplified to b y LDA = arg max k log p(y=k|x), which further breaks down to b y LDA = arg max k (log π k + log N x|àb k ,Σb) Importantly, neither the logarithm nor the terms independent of k affect the location of this maximizing argument.
= arg max k logπ k + logN x|àb k ,Σb
2(x−àb k ) T Σb −1 (x−àb k ) = arg max k logπ k −1
The function δ k LDA (x), often referred to as the discriminant function, plays a crucial role in defining the decision boundary between two class predictions, such as k=0 and k=1 This boundary is characterized by the condition δ 0 LDA (x) = δ 1 LDA (x), which indicates that the set of points x fulfilling this equation represents the decision boundary between the two classes This relationship can be expressed mathematically as logπ 0 - 1.
2àb T 1 Σb −1 àb 1 +x T Σ − 1 àb 1 ⇔ x T Σ − 1 (àb 0 −àb 1 ) = logbπ1−logbπ0−1
In linear algebra, the equation {x: x^T A = c} represents a hyperplane in x-space, where A is a matrix and c is a constant Consequently, the decision boundary in Linear Discriminant Analysis (LDA) is inherently linear, which is reflected in its name, Linear Discriminant Analysis.
For QDA we can do a similar derivation b y QDA = arg max k logπ k −1
(3.25) and setδ QDA 0 (x) =δ QDA 1 (x)to find the decision boundary as the set of pointsxfor which logbπ 0 −1
This is now on the format{x :x T A+x T Bx =c}, aquadratic form, and the decision boundary for
QDA is thus alwaysquadratic(and thereby also nonlinear!), which is the reason for its namequadratic discriminant analysis.
3.3 Linear and quadratic discriminant analysis (LDA & QDA) x 1 x 2
Linear Discriminant Analysis (LDA) with two classes consistently produces a linear decision boundary In this context, the red dots and green circles represent training data from distinct classes, while the overlapping area between the red and green regions illustrates the decision boundary established by the LDA classifier based on the training data.
(b) LDA for K = 3 classes We have now introduced training data from a third class, marked with blue crosses. The decision boundary between any two pair of classes is still linear. x 1 x 2
(c) QDA has quadratic (i.e., nonlinear) decision boundaries, as in this example where a QDA classifier is learned from the shown training data. x 1 x 2
(d) With K = 3 classes are the decision boundaries for QDA possibly more complex than with LDA, as in this case (cf (b)).
Figure 3.7 illustrates the decision boundaries for Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA), which can be contrasted with the logistic regression decision boundary presented in Figure 3.3 While both LDA and logistic regression exhibit linear decision boundaries, they differ in their specific configurations.
(a) With K = 2 classes, Bayes’ classifier tell us to take the class which has probability > 0.5 as the prediction y Here, the prediction would therefore b be y b = 1. y = 1 y = 2 y = 3 y = 4
In the case of K = 4 classes, the Bayes' classifier indicates that the predicted class \( y_b \) should be assigned to the class with the highest probability, resulting in \( y_b = 4 \) Unlike K = 2 classes, where it is possible for no class to exceed a probability of 0.5, the classification with four classes provides a clearer decision based on the highest likelihood.
Bayes’ classifier — a theoretical justification for turning p(y | x) into y b
Bayes’ classifier
To design an effective classifier that minimizes misclassification errors, the goal is for the predicted output label to match the true output label for as many test data points as possible If we had precise knowledge of the probabilities p(y|x), the optimal classifier can be determined by selecting the label that maximizes these probabilities, expressed mathematically as b y = arg max k p(y=k|x) However, in practice, classifiers like logistic regression, LDA, and QDA only provide approximations of p(y|x), rather than exact values.
The optimal classifier predicts the label with the highest probability based on the input x, known as the Bayes’ classifier, as illustrated in Figure 3.8 This classification method is optimal, and we will first demonstrate its effectiveness before exploring its relationship with other classifiers.
Optimality of Bayes’ classifier
To maximize the accuracy of our predictions, we aim to optimize the likelihood of our chosen variable, y, aligning with the random variable, y, represented by the expected value over its distribution, denoted as Ey∼p(y | x) This relationship can be quantified using the indicator function I{}, which assigns a value of one when its condition is met and zero otherwise.
To maximize the probability \( p(y=k|x) \), we can express it as \( p(yb|x) \) by applying the definition of expected value and omitting zero terms This indicates that we should choose \( yb \) to maximize \( p(by|x) \), aligning with our earlier assertion in equation (3.27).
3.4 Bayes’ classifier — a theoretical justification for turningp(y|x)intoyb
Bayes’ classifier in practice: useless, but a source of inspiration
Bayes' classifier utilizes the probability distribution p(y|x), which we typically assume to be known In scenarios where p(y|x) is available, Bayes' classifier is the optimal choice and no alternative methods are necessary However, in most machine learning applications, p(y|x) remains unknown, highlighting the essence of machine learning: our limited understanding of how y relates to x, relying primarily on insights derived from the training data.
Bayes' classifier remains a valuable concept in machine learning, as many classifiers can be viewed as different approximations of it Essentially, these methods enable us to estimate the conditional probability p(y|x) from the training data Although not all methods have been introduced yet, we will provide a brief overview of how certain classifiers connect to this foundational concept.
• In binary logistic regressionp(y|x)is modeled as
In linear and quadratic discriminant analysis (LDA and QDA), the probability p(y|x) is determined using Bayes’ theorem, where p(x|y) is modeled as a Gaussian distribution, with its mean and variance derived from training data Additionally, p(y) is represented by the empirical distribution obtained from the training dataset.
• Ink-nearest neighbor (k-NN),p(y|x) is modeled as the empirical distribution in thek-nearest samples in the training data.
• In tree-based methods,p(y|x)is modeled as the empirical distribution among the training data samples in the same leaf node.
• In deep learning,p(y|x)is modeled using a deep neural network and a softmax function.
Different classifiers utilize various methods to model and approximate the conditional probability p(y|x) The standard approach is to employ Bayes’ classifier, which predicts the class y that has the highest probability p(y|x) In the case of binary classification, this means selecting the class with a probability greater than 0.5.
Is it always good to predict according to Bayes’ classifier?
Bayes' classifier, while often not directly accessible in practice since we only have data without explicit knowledge of p(y|x), can still inform our predictions by guiding the approximation of p(y|x) Throughout this chapter, we have implicitly relied on this concept However, it does not necessarily imply that we should always select the prediction yb as the class with the highest assigned probability Although this approach is a reasonable starting point, it is essential to consider additional factors before making a final decision.
Bayes’ classifier is considered optimal when the primary objective is to minimize misclassifications However, this goal can be more complex in certain scenarios, such as predicting a patient's health status, where the consequences of false predictions can differ significantly For instance, incorrectly classifying a patient as 'well' may have more severe implications than misclassifying them as 'bad', or vice versa In these cases, the classification goal becomes asymmetric, rendering Bayes’ classifier less effective for such situations.
Bayes' classifier achieves optimal performance only when the conditional probability p(y|x) is known precisely In cases where we only have an approximation of p(y|x), the effectiveness of the classifier may not be assured, and the previously established methods may not yield the best results.
6 Sometimes this is not very explicit in the method, but if you look carefully, you will find it.
More on classification and classifiers
Linear and nonlinear classifiers
Linear regression refers to a regression model that is linear in its parameters In the context of classification, "linear" describes classifiers with linear decision boundaries, while nonlinear classifiers can have more complex boundaries Among the classifiers discussed, logistic regression and Linear Discriminant Analysis (LDA) are categorized as linear classifiers, whereas Quadratic Discriminant Analysis (QDA) is a nonlinear classifier It's important to note that although both logistic regression and LDA are linear classifiers, their decision boundaries differ Subsequent chapters will introduce additional classifiers, excluding decision stumps.
Linear regression can be enhanced by incorporating nonlinear transformations of the inputs, allowing the linear classifier to achieve complex decision boundaries However, this process necessitates manual crafting and selection of these transformations A more efficient and automated method for developing a sophisticated classifier from a simpler one is boosting, which is discussed in Chapter 6.
Regularization
Overfitting can occur in classification models, similar to linear regression, especially when the number of training samples is not significantly larger than the number of inputs To address this issue, regularization techniques are employed, such as applying a Ridge Regression-like penalty to the coefficients in logistic regression Additionally, for Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA), regularizing the estimation of the covariance matrix can also be beneficial Further discussion on overfitting will be provided in Chapter 5.
Evaluating binary classifiers
Binary classification, where K = 2, plays a crucial role in detecting the presence of various entities, such as diseases or objects on radar In this context, the convention assigns a value of y = 1 to indicate presence ("positive") and y = 0 to signify absence ("negative") These applications are characterized by their significant implications in various fields.
In many datasets, the majority of instances are labeled as 0, leading to a situation where a classifier that consistently predicts 0 may achieve high accuracy For example, a medical support system that always identifies patients as "healthy" could appear to be effective due to the prevalence of healthy cases However, despite high accuracy, such a system is ultimately ineffective and fails to provide meaningful insights or support in medical decision-making.
(i) A missed detection (predictingyb= 0, when in facty = 1) might have much more sever consequences than a false detection (predictingyb= 1, when in facty= 0).
For such classification problems, there is a set of analysis tools and terminology which we will introduce now.
3.5 More on classification and classifiers
FP/N False positive rate, Fall-out, Probability of false alarm
TN/N True negative rate, Specificity, Selectivity
TP/P True positive rate, Sensitivity, Power, Recall, Probability of detection
FN/P False negative rate, Miss rate
TP/P* Positive predictive value, Precision
Table 3.1: Common terminology related to the quantities (TN, FN, FP, TP) in the confusion matrix.
A confusion matrix is a valuable tool for visualizing the results of a binary classifier when evaluated on a test dataset It categorizes the test data into four groups based on the actual output (y) and the predicted output (b) by the classifier The matrix is structured as follows: True Negatives (TN) and False Negatives (FN) are displayed for the actual output of y=0, while False Positives (FP) and True Positives (TP) are shown for y=1, providing a clear overview of the classifier's performance.
In the following example, the terms TN, FN, FP, and TP will be substituted with their actual numerical values Additionally, a comprehensive list of terminology associated with the confusion matrix is presented in Table 3.1.
The confusion matrix offers a concise and insightful summary of a classifier's performance It is essential to differentiate between false positives (FP, or type I errors) and false negatives (FN, or type II errors), as their implications can vary based on the specific application While the ideal scenario is to have both FP and FN equal to zero, this outcome is seldom achieved in real-world situations.
Inspired by the Bayes’ classifier, we typically convert the probability p(y = 1|x) into predictions using a threshold t, where if p(y = 1|x) ≥ t, we set yb = 1, and if p(y = 1|x) < t, we set yb = 0, with t commonly set at 0.5 However, if our goal is to reduce the false positive rate, we can increase the threshold t, acknowledging that this may lead to a higher false negative rate, and conversely, lowering the threshold can help decrease false negatives at the cost of increased false positives.
Tuning the threshold (3.30) is vital for optimizing performance in binary classification tasks To effectively compare classifiers, such as logistic regression and QDA, the ROC curve proves to be a valuable tool The term ROC stands for "receiver operating characteristics," reflecting its origins in communications theory.
To create an ROC curve, plot the true positive rate (TP/P) against the false positive rate (FP/N) across all values in the range of [0,1] The resulting curve typically resembles the one illustrated in Figure 3.9 A perfect classifier, which accurately predicts outcomes with complete certainty, will touch the upper left corner of the graph, while a classifier that makes random guesses will produce a straight diagonal line.
The area under the ROC curve, known as AUC, serves as a concise summary of the ROC curve's performance A perfect classifier achieves an AUC of 1, while a classifier that makes random guesses has an AUC of 0.5.
Example 3.2: Confusion matrix in thyroid disease detection
The thyroid is a crucial endocrine gland that regulates metabolic rate and protein synthesis, and disorders of this gland can lead to significant health issues This study focuses on detecting thyroid diseases using a dataset from the UCI Machine Learning Repository, which includes 7,200 data points and 21 medical indicators The qualitative diagnosis categories—normal, hyperthyroid, and hypothyroid—are simplified into a binary classification of normal and not normal The dataset is divided into training (3,772 samples) and testing (3,428 samples) sets A logistic regression classifier is trained on the training data and applied to the test data, yielding a confusion matrix with results indicating 3,177 true positives for normal, 237 false positives, 1 false negative, and 13 true negatives for not normal.
Most test data points are correctly predicted asnormal, but a large part of thenot normaldata is also falsely predicted as normal This might indeed be undesired in the application.
To change the picture, we change the threshold tot= 0.15, and obtain new predictions with the following confusion matrix instead: y =normal y =not normal b y =normal 3067 165 b y=not normal 111 85
The recent change has improved the true positive rate significantly, identifying 85 patients correctly as not normal, compared to only 13 previously However, this enhancement comes with a trade-off, as the false positive rate has increased dramatically, with 111 patients now incorrectly predicted as not normal instead of just 1 The evaluation of whether this trade-off is beneficial depends on the specific application and the severity of consequences associated with each type of error.
Relying solely on total accuracy, or misclassification rate, can be misleading in evaluating predictive models For instance, a predictor that consistently predicts "normal" can achieve an impressive accuracy of nearly 93% However, a different model may yield an accuracy of 92% yet prove to be significantly more beneficial in real-world applications.
Typical example Perfect classifier Random guess
4 Non-parametric methods for regression and classification: k -NN and trees
The methods we have explored, including linear regression, logistic regression, LDA, and QDA, utilize a fixed set of parameters learned from training data, which can be discarded after training While increasing the amount of training data allows for more accurate parameter estimation with reduced variance, it does not enhance the model's flexibility or expressiveness For instance, logistic regression remains limited to linear decision boundaries, regardless of the quantity of training data available.
Another category of methods adapts to training data rather than relying on a fixed structure and parameters Two notable techniques in this category are k-nearest neighbors (k-NN) and tree-based methods While both can be applied to classification and regression tasks, this discussion will concentrate on their use in classification.
k-NN
Decision boundaries for k-NN
In Example 4.1, we initially calculated a prediction for a single test data point, x ? When this point is shifted left to x alt ? = [0 2] T, the three nearest training data points remain i= 6 and i= 2, but i= 2 is replaced by i= 1 For k= 3, this results in the approximation p(Red|x ?) = 2/3, leading to the prediction yb= Red The point located at [0.5 2] T, equidistant from i= 1 and i= 2, lies on the decision boundary between the two classes This reasoning allows us to outline the complete decision boundaries shown in Figure 4.1, demonstrating that k-NN is a nonlinear classification method capable of forming non-linear decision boundaries.
Figure 4.1: Decision boundaries for the problem in Example 4.1 for the two choices of the parameter k.
Choosing k
The user has to decide on whichkto use ink-NN and this decision has a big impact on the final classifier.
Figure 4.2 presents a scenario featuring two input variables (p=2) and three classes (K=3), accompanied by a substantial amount of training data samples The decision boundaries for a k-NN classifier are depicted in two subfigures, showcasing the differences between k=1 and k=11.
In the k-NN classification method, setting k=1 ensures that all training data points are classified accurately, aligning the decision boundaries closely with the specific characteristics of the training data, including its noise However, when using k=11, the averaging process may lead to some training points being misclassified, resulting in decision boundaries that are less tailored to the training dataset Despite the perfect fit of k-NN with k=1, k=11 is often favored due to its lower risk of overfitting, suggesting it may yield better performance on unseen test data To systematically determine the optimal value of k, cross-validation is a recommended approach, which will be further explored in Chapter 5.
(a) Decision boundary for k-NN with k = 1 for a 3-class problem A complex, but possibly also overfitted, decision boundary. x 1 x 2
(b) Decision boundary for k-NN with k = 11 A more rigid and less flexible decision boundary.
Figure 4.2: Decision boundaries for k-NN.
Normalization
One crucial aspect of k-NN is the normalization of input data, as it directly affects the accuracy of distance measurements Since k-NN relies on Euclidean distances to determine the proximity of data points, it is essential that these distances accurately reflect the closeness between them For instance, consider a training dataset with two input variables, where one variable ranges from 0 to 100 and the other from 0 to 1 This disparity can arise when the variables represent different physical quantities with varying units Without proper normalization, the Euclidean distance between a test point and a training point may not provide a valid measure of similarity, potentially leading to inaccurate predictions.
(x i1 −x ?1 ) 2 + (x i2 −x ?2 ) 2 would almost only depend on the first term(x i1 −x ?1 ) 2 and the values of the second componentx i2 would have a small impact.
To effectively normalize input data, one method involves dividing the first component by 100, resulting in a new value defined as \( x_{new}^{i1} = \frac{x^{i1}}{100} \), ensuring both components fall within the range [0, 1] More broadly, this normalization process can be expressed as \( x_{new}^{ij} = \frac{x^{ij} - \min(x^{ij})}{\max(x^{ij}) - \min(x^{ij})} \) for all \( j = 1, \ldots, p \) and \( i = 1, \ldots, n \) Alternatively, another widely used normalization technique employs the mean and standard deviation of the training data, represented by the formula \( x_{new}^{ij} = \frac{x^{ij} - \bar{x}_j}{\sigma_j} \), where \( \bar{x}_j \) and \( \sigma_j \) denote the mean and standard deviation for each input variable, respectively.
Trees
Basics
A classification tree models the function p(y|x) using a series of rules based on input variables x1, , xp, which can be visually represented as a binary tree This structure partitions the input space into distinct regions, with each region assigned a constant value for the predicted class probability p(y|x) An example will further clarify this concept.
Example 4.2: Predicting colors with a classification tree
In a scenario involving two input variables, x1 and x2, and a quantitative output y represented by the colors red or blue, a classification tree can effectively facilitate the classification process To classify a new point, denoted as x? = [x?1, x?2] T, one begins at the top of the tree and navigates downward until reaching a terminal branch Each terminal branch provides a constant predicted class probability, specifically p(Red|x?) For instance, conditions such as x2 < 3.0 and x1 < 5.0 can guide the classification outcome.
A classification tree consists of internal nodes and leaf nodes, where each internal node represents a decision rule in the format x j < s k, directing the flow to the left branch, while the right branch corresponds to x j ≥ s k This specific tree features two internal nodes and three leaf nodes, illustrating its structure and decision-making process.
The classification tree is divided into distinct regions, with each region representing a leaf node The borders between these regions indicate the splits in the tree Each region is color-coded to reflect the highest predicted class probability, providing a visual representation of the classification outcomes.
A pseudo code for classifying a test input with the tree above would look like if x_2 < 3.0 then return p(Red|x)=0 else if x_1 < 5.0 then return p(Red|x)=1/3 else return p(Red|x)=1 end end
In a classification tree, when evaluating a test point x? = [2.5, 3.5] T, the first decision leads to the right branch since x?2 = 3.5 is greater than or equal to 3.0, and the second decision leads to the left branch because x?1 = 2.5 is less than 5.0 As a result, the probabilities are calculated as p(Red|x?) = 1/3 and p(Blue|x?) = 2/3 This process illustrates how the input space is divided into multiple rectangular-shaped regions, as depicted in the accompanying figure.
In the context of decision trees, the endpoints of branches, labeled as R1, R2, and R3, are referred to as leaf nodes, while the conditions for internal splits, such as x2 < 3.0 and x1 < 5.0, are identified as internal nodes The connections between these nodes are known as branches Although visualizing the region partition becomes complex with more than two input variables, the structure of the tree remains consistent, with each internal node producing exactly two branches.
This example demonstrates the application of a classification tree for making predictions The process of learning the tree from training data will be detailed in the following section.
Training a classification tree
The classification tree models the class probability \( p(k|x) \) as a constant \( c_{mk} \) within each region \( R_m \) for each class \( k = 1, 2, \ldots, K \) This can be expressed mathematically as \( p(y=k|x) = \sum_{m=1}^{M} c_{mk} I\{x \in R_m\} \), where \( M \) represents the total number of regions (leaf nodes) in the tree The indicator function \( I\{x \in R_m\} \) equals 1 if \( x \) is in region \( R_m \) and 0 otherwise Additionally, it is important to note that the probabilities must sum to 1 within each region, ensuring a proper classification framework.
The primary objective in building a classification tree from training data \({x_i, y_i}^{n}_{i=1}\) is to create a tree that optimally represents the observed training data This method, known as the maximum likelihood approach, parallels our earlier application in solving the logistic regression problem To achieve this, maximizing the likelihood translates to minimizing the negative logarithm of the likelihood Consequently, our goal is to identify a tree \(T\) that minimizes this specific expression.
By inserting the model stated in (4.4) into (4.5), we get
Hereπb mk is the proportion of training data points in regionR m that are from classkwithn m being the total number of training data points in regionm We can show 1 that
XK k=1 b π mk logc mk XK k=1 b π mk logπ mk c mk
To minimize the expression given by XK k=1 b π mk logbπ mk, we achieve equality when c mk equals bπ mk Consequently, by minimizing the related expression (4.6) with respect to c mk, we determine that c mk is equal to πb mk The next step involves identifying the regions R m, as our goal is to select these regions to achieve minimization.
1 We use the so called log sum inequality and the two constraints P K k=1 c mk = 1 and P K k=1 b π mk = 1 for all m = 1, , M.
4.2 Trees is known as theentropyfor regionm 2
Finding the optimal tree structure that minimizes the value of 4.8 presents a combinatorial challenge that is computationally impractical To address this, we utilize a greedy algorithm called recursive binary splitting, which optimizes each node split individually rather than the entire tree simultaneously This method initiates at the top of the tree and progressively splits the input, creating two new branches with each division Consequently, it constructs a tree that resembles the earlier example, as it focuses on one split at a time without considering the overall tree structure.
Consider the setting when we are about to do our first split into a pair of half-planes
The split depends on the indexjof the input variable at which the split is performed and the cutpoints. The corresponding proportionsπb mk will also depend onjands. b π 1k (j, s) = 1 n 1
We seek the splitting variablejand cutpointsthat solve minj,s
To optimize each input variable, we evaluate all potential splits and select the pair (j, s) that minimizes the objective function This process is iteratively applied to generate new splits by determining the optimal values (j, s) for each resulting branch The iteration continues until a predetermined stopping criterion is met, such as when no region has more than five training data points.
The tree in Example 4.2 has been constructed based on the methodology outlined above, which we will illustrate in the example below.
Example 4.3: Learning a classification tree (continuation of Example 4.2)
We consider the same setup as in Example 4.2 with the following dataset x 1 x 2 y 9.0 2.0 Blue 1.0 4.0 Blue 4.0 6.0 Blue 4.0 1.0 Blue 1.0 2.0 Blue 1.0 8.0 Red 6.0 4.0 Red 7.0 9.0 Red 9.0 8.0 Red
We want to learn a classification tree, by using the entropy criteria in (4.8) and growing the tree until there are no regions with more than five data points left.
There are countless potential splits available, but only those that result in the same partition of data points are considered identical Consequently, in practical applications, we are left with just nine distinct splits.
In the dataset, if any π b mk equals 0, the expression 0 log 0 is defined as zero, aligning with the limit as r approaches 0 from the positive side, where r log r equals 0 The various splits in the data are illustrated by the dashed lines in the figure above.
We analyze all nine splits, beginning with the split at x1 = 2.5, which divides the input space into two regions: R1 (x1 < 2.5) and R2 (x1 ≥ 2.5) In region R1, there are two blue data points and one red data point, resulting in a total of n1 = 3 data points Consequently, the class proportions in region R1 are π1B = 2/3 for blue and π1R = 1/3 for red The entropy for this distribution is then calculated.
In regionR 2 we haven 2 = 7data points with the proportionsbπ 2B = 3/7andbπ 2R = 4/7 The entropy for this regions will be
7) = 0.68 (4.11) and the total weighted entropy for this split becomes n 1 Q 1 (T) +n 2 Q 2 (T) = 3ã0.64 + 7ã0.68 = 6.69 (4.12)
We compute the cost for all other splits in the same manner, and summarize it in the table below.
From the table we can read that the two splits atx2 0 This approach has proven to be more effective in various scenarios, even though it compromises the theoretical convergence guarantees of the algorithm.
Figure 7.12 illustrates the optimization of a cost function J(θ) using gradient descent, where θ represents a scalar parameter The subfigures demonstrate the effects of varying learning rates: subfigure (a) shows the impact of a too low learning rate, subfigure (b) depicts the consequences of a too high learning rate, and subfigure (c) highlights the effectiveness of an optimal learning rate.
Dropout
Neural network models, like all models discussed in this course, can experience overfitting when their flexibility exceeds the complexity of the data One effective method to mitigate variance and reduce overfitting is bagging, as outlined by James et al (2013, Chapter 8.2) In the bagging approach, an ensemble of models is trained, with each model being fitted on a different bootstrapped subset of the original training data To generate a final prediction, individual predictions from each model are averaged, resulting in a more robust overall prediction.
Bagging can be effectively applied to neural networks, but it presents certain challenges Training a large neural network model typically requires significant time and involves managing numerous parameters for storage.
Training multiple large neural networks simultaneously can be prohibitively expensive in terms of runtime and memory However, the dropout technique, introduced by Srivastava et al in 2014, offers a solution similar to bagging by enabling the combination of several neural networks without the need for separate training This approach allows different models to share parameters, significantly lowering both computational costs and memory usage.
Dropout is a technique used in neural networks to create sub-networks by randomly removing certain hidden units, which effectively forms an ensemble of models This process involves dropping units based on a predefined probability, ensuring that the selection of dropped units in one sub-network is independent of another When a unit is removed, all its connections, both incoming and outgoing, are also eliminated Additionally, dropout can be applied not only to hidden units but also to input variables, enhancing the network's robustness and preventing overfitting.
All sub-networks originate from the same primary network, enabling them to share certain parameters For instance, parameter β 55 (1) is utilized in both sub-networks, as illustrated in Figure 7.13b This shared parameterization facilitates efficient training of the ensemble of sub-networks.
To implement dropout training, we utilize the mini-batch gradient descent algorithm, as outlined in Algorithm 8 During each gradient update, a mini-batch of data is employed to approximate the gradient Instead of calculating the gradient for the entire network, we create a random sub-network for this purpose.
In a neural network with two hidden layers, randomly dropping units creates two independent sub-networks The gradient for each sub-network is computed without the dropped units, allowing for a gradient step that updates only the parameters present in that sub-network while leaving others unchanged This process is repeated for each mini-batch of data, with a new set of randomly selected units dropped during each iteration, continuing until a specified terminal condition is met.
This procedure to generate an ensemble of models differs from bagging in a few ways:
• In bagging all models are independent in the sense that they have their own parameters In dropout the different models (the sub-networks) share parameters.
In bagging, each model undergoes training until it converges, whereas in dropout, each sub-network is trained for just one gradient step Despite this difference, all models benefit from shared parameters, leading to updates across networks during the training of others.
Dropout is a technique akin to bagging, where each model is trained on a randomly selected subset of the training data Unlike bagging, which utilizes a bootstrapped version of the entire dataset, dropout focuses on training each model using a randomly chosen mini-batch of data.
Dropout, while differing from bagging in certain ways, has been empirically demonstrated to share similar benefits, particularly in mitigating overfitting and decreasing model variance.
Once the sub-networks have been trained, we aim to make predictions using an unseen input data point, denoted as x In bagging, we assess all models within the ensemble and aggregate their outcomes However, this approach is impractical in dropout scenarios due to the vast number of potential sub-networks that can be formed.
Figure 7.14 illustrates the prediction network that has been trained using dropout In this network, all units and connections are maintained, but the weights from each unit are scaled by the probability of that unit being active during training This adjustment compensates for the units that were randomly dropped during training Throughout the training process, all units were retained with a probability r, while the probability of being dropped was 1 − r.
To achieve similar results without evaluating all possible sub-networks, we can assess the full network with all parameters To address the impact of dropout during training, we multiply each estimated parameter from a unit by the probability of that unit being active during training This adjustment ensures that the expected input values remain consistent between training and testing, as only a portion of incoming links are active during training For example, if a unit is retained with a probability \( p \) during training, we multiply all estimated parameters by \( p \) before making predictions during testing This method of approximating the average across all ensemble members has demonstrated effective results in practice, despite the lack of a robust theoretical foundation for its accuracy.
Dropout is an effective regularization method used to reduce variance and prevent overfitting in neural networks Other regularization techniques include parameter penalties such as ridge regression and LASSO, early stopping, and various sparse representations like CNNs, which enforce many parameters to be zero Since its introduction, dropout has gained popularity due to its simplicity, cost-effectiveness, and strong performance A recommended practice in neural network design is to expand the network until overfitting occurs, then extend it slightly further and incorporate dropout to mitigate that overfitting.
Perspective and further reading
Although the first conceptual ideas of neural networks date back to the 1940s (McCulloch and Pitts
In the late 1980s and early 1990s, neural networks gained significant traction with the introduction of the back-propagation algorithm, enabling tasks like handwritten digit classification from low-resolution images However, by the late 1990s, interest in neural networks waned as they struggled to address complex challenges in computer vision and speech recognition, where hand-crafted solutions based on domain-specific knowledge were deemed more effective.
Since the late 2000s, the landscape of deep learning has transformed significantly, driven by advancements in software, hardware, and algorithm parallelization These developments have enabled the tackling of complex problems that were previously unimaginable just a few decades ago, particularly in the field of image processing.
Deep learning models have emerged as the leading techniques in artificial intelligence, achieving near-human performance in various tasks (LeCun, Bengio, and Hinton, 2015) Notable advancements include algorithms that learn to play video games using only pixel data (Mnih et al., 2015) and systems that can automatically generate captions by interpreting images (Xu et al., 2015).
A fairly recent and accessible introduction and overview of deep learning is provided by LeCun, Bengio,and Hinton (2015), and a recent textbook by Goodfellow, Bengio, and Courville (2016).
Random variables
Marginalization
A multivariate random variable \( z \) consists of two components, \( z_1 \) and \( z_2 \), which can be scalars or vectors, represented as \( z = [z_1^T, z_2^T]^T \) When the joint probability density function \( p(z) = p(z_1, z_2) \) is known, one can derive the marginal distribution for \( z_1 \) through the process of marginalization, expressed mathematically as \( p(z_1) = \int p(z_1, z_2) \, dz_2 \).
The other marginalp(z 2 )is obtained analogously by integrating overz 1 instead In Figure A.1 a joint two-dimensional densityp(z 1 , z 2 )is illustrated along with their marginal densitiesp(z 1 )andp(z 2 ).
A.2 Approximating an integral with a sum
Conditioning
Consider again the multivariate random variablezwhich can be partitioned in two partsz= [z 1 T , z 2 T ] T
We can now define theconditionaldistribution ofz 1 , conditioned on having observed a valuez 2 =z 2 , as p(z 1 |z 2 ) = p(z 1 , z 2 ) p(z 2 ) (A.7)
To determine the conditional distribution of \( z_2 \) given \( z_1 = z_1 \), we can follow a similar approach Figure A.1 illustrates the joint two-dimensional probability density function \( p(z_1, z_2) \) alongside the conditional probability density function \( p(z_1 | z_2) \).
From (A.7) it follows that the joint probability density functionp(z 1 , z 2 )can be factorized into the product of a marginal times a conditional, p(z 1 , z 2 ) =p(z 2 |z 1 )p(z 1 ) =p(z 1 |z 2 )p(z 2 ) (A.8)
If we use this factorization for the denominator of the right-hand-side in (A.7) we end up with the relationship p(z 1 |z 2 ) = p(z 2 |z 1 )p(z 1 ) p(z 2 ) (A.9)
This equation is often referred to asBayes’ rule.
Approximating an integral with a sum
Consider again the multivariate random variablezwhich can be partitioned in two partsz= [z 1 T , z 2 T ] T
We can now define theconditionaldistribution ofz 1 , conditioned on having observed a valuez 2 =z 2 , as p(z 1 |z 2 ) = p(z 1 , z 2 ) p(z 2 ) (A.7)
To determine the conditional distribution of z2 given the observed value of z1 = z1, we can follow a similar approach Figure A.1 illustrates the joint two-dimensional probability density function p(z1, z2) alongside the conditional probability density function p(z1 | z2).
From (A.7) it follows that the joint probability density functionp(z 1 , z 2 )can be factorized into the product of a marginal times a conditional, p(z 1 , z 2 ) =p(z 2 |z 1 )p(z 1 ) =p(z 1 |z 2 )p(z 2 ) (A.8)
If we use this factorization for the denominator of the right-hand-side in (A.7) we end up with the relationship p(z 1 |z 2 ) = p(z 2 |z 1 )p(z 1 ) p(z 2 ) (A.9)
This equation is often referred to asBayes’ rule.
A.2 Approximating an integral with a sum
An integral over a given smooth functionh(z)and a probability densityp(z)can be approximated with a sum overMsamples in the following fashion
Monte Carlo integration is utilized when each sample \( z_i \) is independently drawn from the distribution \( p(z) \) As the number of samples \( M \) approaches infinity, the approximate equality represented by the equation \( XM_{j=1} h(z_i) \) converges to an exact value with probability one.
The optimization problem seeks to find the variable value \( \theta_b \) that minimizes the cost function \( L(\theta) \) in the context of unconstrained optimization, formulated as \( \min_{\theta} L(\theta) \) This approach is essential across various scientific and engineering fields, as it identifies the optimal solution for specific challenges A notable application is in linear regression, where the goal is to maximize the likelihood function, leading to a least squares problem with explicit solutions via normal equations However, many optimization challenges lack explicit solutions, necessitating the use of approximate numerical methods, particularly in areas like deep learning and logistic regression This article serves as a concise introduction to the practical aspects of unconstrained numerical optimization.
To create an effective optimization algorithm, it's essential to develop a simple model of the complex cost function L(θ) around the current θ value This model is typically local, meaning it is only applicable within a specific neighborhood of the current value By utilizing this model, we can identify a new θ that yields a lower cost function value, leading to an iterative process commonly found in numerical optimization algorithms While various methods exist for this purpose, they all share fundamental components For detailed information on practical unconstrained optimization algorithms, we recommend consulting the numerous textbooks available on the topic.
A general iterative solution
In the context of the unconstrained minimization problem, a solution is defined as the global minimizer, denoted as θb, which satisfies the condition L(θb) ≤ L(θ) for all θ in R^n However, locating the global minimizer can be challenging, leading us to often rely on local minimizers instead A point θb is classified as a local minimizer if there exists a neighborhood M around θb such that L(θb) ≤ L(θ) for all θ within M.
To find a local minimizer, we begin with an initial point denoted as θ₀ If θ₀ is not a local minimizer of L(θ), we can identify an increment d₀ that, when added to θ₀, results in a lower value of L(θ), specifically L(θ₀ + d₀) < L(θ₀) Similarly, if θ₁ = θ₀ + d₀ is also not a local minimizer, we can continue this process to explore further increments.
1 Note that it is sufficient to cover minimization problem, since any maximization problem can be considered as a minimization problem simply by changing the sign of the cost function.
Throughout the course, we have extensively discussed various loss functions, which serve as examples of cost functions An increment, denoted as \( d_1 \), can be added to \( \theta_1 \) such that \( L(\theta_1 + d_1) < L(\theta_1) \) This process is repeated until no further increments can reduce the objective function, leading us to a local minimizer Most algorithms designed to solve this problem utilize iterative procedures similar to this approach It is important to note that the increment \( d \) is often expressed as the product of two components, \( d = \gamma p \).
In optimization algorithms, the scalar parameter γ, known as the step length, and the vector p ∈ R^n, referred to as the search direction, play crucial roles The algorithm seeks a solution by advancing in the search direction, with the step length determining the distance of this movement This process raises several important questions regarding its effectiveness and application.
1 How can we compute a useful search directionp?
2 How big steps should we make, i.e what is a good value of the step lengthγ?
3 How do we determine when we have reached a local minimizer, and stop searching for new directions?
In this section, we will address key questions and ultimately present a general algorithm commonly employed for unconstrained minimization.
A straightforward way of finding a general characterization of all search directionspresulting in a decrease in the value of the cost function, i.e directionspsuch that
To construct a local model of the cost function L(θ) around the point θ, we utilize the inequality L(θ+p) < L(θ) (B.3) This approach is supported by Taylor's theorem, which offers a polynomial approximation of a function near a specified point Specifically, a linear approximation of the cost function L(θ) around θ can be derived, providing valuable insights into its behavior in the vicinity of that point.
To enhance the search direction in optimization, we can refine the objective function by incorporating its linear approximation This leads to the condition that for a successful search direction \( p \), the inequality \( L(θ) + p^T ∇L(θ) < L(θ) \) must hold true, which simplifies to \( p^T ∇L(θ) < 0 \).
We define a search direction using the formula p = −V∇L(θ), where V is a positive definite scaling matrix, to introduce additional flexibility By substituting this expression into our previous equation, we find that p^T L(θ) = −∇^T L(θ)V^T ∇L(θ) = −k∇L(θ)k^2_V < 0 This inequality highlights that the squared weighted two-norm is positive, confirming that the chosen search direction p = −V∇L(θ) effectively reduces the objective function's value Consequently, we classify this search direction as a descent direction.
Algorithm 9 outlines a strategy known as line search, which emphasizes its iterative nature through the use of the subscript t This algorithm explores the path established by the current iterate θ t and proceeds in the direction p t To determine the optimal distance to move along this path, the algorithm minimizes the cost function along the line, expressed as minγ L(θ t + γp t).
Commonly used search directions
Steepest descent direction
The descent condition for the scalar product requires that the search direction \( p^T \nabla L(\theta_t) \) must be negative, which can be expressed as \( \|p\| \|\nabla L(\theta_t)\| \cos(\phi) < 0 \) Here, \( \phi \) represents the angle between the vectors \( p \) and \( \nabla L(\theta_t) \) To optimize the search direction, we can fix the length of \( p \) and minimize the scalar product by selecting \( \phi = \pi \), resulting in \( p = -\nabla L(\theta_t) \).
The gradient vector at a specific point indicates the direction of the steepest ascent of the function, which is why the suggested search direction in (B.10) is known as the steepest descent direction.
3 The scalar (or dot) product of two vectors a and b is defined as a T b = kakkbk cos(ϕ), where kak denotes the length(magnitude) of the vector a and ϕ denotes the angle between a and b.
The steepest descent method can often be slow due to its limited use of information about the cost function To enhance efficiency, Newton and quasi-Newton methods utilize additional insights into the local geometry of the cost function, allowing for a more detailed local model and improved convergence.
Newton direction
We will enhance our model of the objective function by incorporating the quadratic term from the Taylor expansion This leads to a refined quadratic approximation, denoted as m(θ t , p t ), of the cost function centered around the current iterate θ t.
In equation (B.11), the gradient of the cost function is represented as \( g_t = \nabla L(\theta)|_{\theta=\theta_t} \), while the Hessian is denoted as \( H_t = \nabla^2 L(\theta)|_{\theta=\theta_t} \), both evaluated at the current iterate \( \theta_t \) The Newton direction aims to determine the optimal search direction that minimizes the quadratic model outlined in (B.11), achieved by taking the derivative of the model.
∂p t =gt+Htpt (B.12) to zero, resulting in p t =−H t −1 g t (B.13)
Computing the Hessian can be challenging and costly, leading to the creation of search directions that utilize Hessian approximations These methods are commonly referred to as quasi-Newton directions.
Quasi-Newton
The quasi-Newton direction utilizes a local quadratic model of the cost function, akin to the method used for determining the Newton direction Instead of relying on a predefined Hessian, this approach learns the Hessian from the available data, specifically the cost function values and their gradients.
Let us first denote the line segment connecting two adjacent iteratesθ t andθ t+1 by r t (τ) =θ t +τ(θ t+1 −θ t ), τ ∈[0,1] (B.14)
From the fundamental theorem of calculus we know that
∂τ∇L(r t (τ))dτ =∇L(r t (1))− ∇L(r t (0)) =∇L(θ t+1 )− ∇L(θ t ) =g t+1 −g t , (B.15) and from the chain rule we have that
∂τ =∇ 2 L(r t (τ))(θ t+1 −θ t ) (B.16) Hence, in combining (B.15) and (B.16) we obtain yt Z 1 0
0 ∇ 2 L(rt(τ))stdτ (B.17) where we have definedy t =g t+1 −g t ands t =θ t+1 −θ t An interpretation of the above equation is that the difference between two consecutive gradientsy t is given by integrating the Hessian timess t for
Further reading
Quasi-Newton methods rely on the assumption that an integral can be approximated by a constant matrix \( B_{t+1} \), leading to the equation \( y_t = B_{t+1} s_t \), known as the secant condition However, this condition alone does not uniquely determine the matrix \( B_{t+1} \) due to its inherent symmetry and excessive degrees of freedom To address this issue, regularization techniques are employed to select \( B_{t+1} \) as the optimal solution.
In optimization, the quasi-Newton methods utilize a symmetric matrix B, subject to specific constraints, to approximate the Hessian matrix The choice of weighting matrix W leads to various algorithms, with the most prevalent being BFGS, DFP, and Broyden’s method These algorithms iteratively update the Hessian approximation, B t+1, replacing the need for the actual Hessian in the optimization process.
This appendix draws inspiration from the foundational works of Nocedal and Wright (2006) and Wills (2017) on numerical solutions for optimization problems A crucial initial step in addressing these problems is classifying them as convex or non-convex, with this discussion primarily focusing on non-convex scenarios For insights into convex problems, Boyd and Vandenberghe (2004) offer an excellent engineering perspective Additionally, Bottou, Curtis, and Nocedal (2017) provide a comprehensive introduction to numerical optimization within the machine learning realm, emphasizing large-scale issues that inherently lead to stochastic optimization challenges, as discussed in the deep learning chapter.
Abu-Mostafa, Yaser S., Malik Magdon-Ismail, and Hsuan-Tien Lin (2012).Learning From Data A short course AMLbook.com.
Barber, David (2012).Bayesian reasoning and machine learning Cambridge University Press.
Bishop, Christopher M (2006).Pattern Recognition and Machine Learning Springer.
Bottou, L., F E Curtis, and J Nocedal (2017).Optimization methods for large-scale machine learning.
Boyd, S and L Vandenberghe (2004) Convex Optimization Cambridge, UK: Cambridge University
Breiman, Leo (Oct 2001) “Random Forests” In:Machine Learning45.1, pp 5–32 issn: 1573-0565 doi:
Deisenroth, M P., A Faisal, and C O Ong (2019) Mathematics for machine learning Cambridge
Dheeru, Dua and Efi Karra Taniskidou (2017).UCI Machine Learning Repository url:http://archive. ics.uci.edu/ml.
Efron, Bradley and Trevor Hastie (2016).Computer age statistical inference Cambridge University Press.
Ezekiel, Mordecai and Karl A Fox (1959).Methods of Correlation and Regression Analysis John Wiley
Freund, Yoav and Robert E Schapire (1996) “Experiments with a new boosting algorithm” In:Proceedings of the 13th International Conference on Machine Learning (ICML).
Friedman, Jerome (2001) “Greedy function approximation: A gradient boosting machine” In:Annals of
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani (2000) “Additive logistic regression: a statistical view of boosting (with discussion)” In:The Annals of Statistics28.2, pp 337–407.
Gelman, Andrew et al (2013).Bayesian data analysis 3rd ed CRC Press.
Ghahramani, Zoubin (May 2015) “Probabilistic machine learning and artificial intelligence” In:Nature
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville (2016).Deep Learning.http://www.deeplearningbook. org MIT Press.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman (2009).The elements of statistical learning Data mining, inference, and prediction 2nd ed Springer.
Hastie, Trevor, Robert Tibshirani, and Martin J Wainwright (2015).Statistical learning with sparsity: the
Lasso and generalizations CRC Press.
Hoerl, Arthur E and Robert W Kennard (1970) “Ridge regression: biased estimation for nonorthogonal problems” In:Technometrics12.1, pp 55–67.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2013).An introduction to statistical learning With applications in R Springer.
Jordan, M I and T M Mitchell (2015) “Machine learning: trends, perspectives, and prospects” In:
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton (2015) “Deep learning” In:Nature521, pp 436–444.
LeCun, Yann, Bernhard Boser, et al (1990) “Handwritten Digit Recognition with a Back-Propagation
Network” In:Advances in Neural Information Processing Systems (NIPS), pp 396–404.
MacKay, D J C (2003).Information theory, inference and learning algorithms Cambridge University