Statistical foundations of machine learning

Handbook Statistical foundations of machine learning Gianluca Bontempi Machine Learning Group Computer Science Department Universite Libre de Bruxelles, ULB Belgique June 2, 2017 Follow me on LinkedIn.

Notations

underlying most of the machine learning techniques.

Chapter 7 presents conventional linear approaches to regression and classification.

Chapter 8 introduces some machine learning techniques which deal with nonlinear regression and classification tasks.

Chapter 9 presents the model averaging approach, a recent and powerful way for obtaining improved generalization accuracy by combining several learning machines.

Although the book focuses on supervised learning, some related notions of un- supervised learning and density estimation are given in Appendix A.

In this manuscript, we use boldface to represent random variables, while instances or realizations of these variables are written in normal font Although it's essential to differentiate between a random variable and its realization in notation, we will only apply this distinction when the context does not make the meaning clear.

In the context of variables, lowercase letters represent scalars or vectors of observables, while Greek letters signify parameter vectors Uppercase letters denote matrices, and italicized uppercase letters indicate generic sets Additionally, uppercase Greek letters are used to represent sets of parameters.

Generic notation θ Parameter vector. θ Random parameter vector.

[N×n] Dimensionality of a matrix withN rows andncolumns.

M T Transpose of the matrix M. diag[m1, , mN] Diagonal matrix with diagonal [m1, , mN]

M Random matrix. θˆ Estimate of θ. θˆ Estimator of θ. τ Index in an iterative algorithm.

Ω Set of possible outcomes. ω Outcome (or elementary event).

Prob{E} Probability of the event E.

(Ω,{E},Prob{ã}) Probabilistic model of an experiment.

Z Domain of the random variablez.

P(z) Probability distribution of a discrete random variablez AlsoP z (z).

F(z) = Prob{z≤z} Distribution function of a continuous random variable z AlsoFz(z). p(z) Probability density of a continuous r.v Alsopz(z).

E[z] Expected value of the random variablez.

Xz(x, y)p(x)dx Expected value of the random variablezaveraged overx.

Var [z] Variance of the random variable z.

L(θ) Likelihood of a parameter θ. l(θ) Log-Likelihood of a parameterθ. lemp(θ) Empirical Log-likelihood of a parameterθ.

Learning Theory notation x Multidimensional input variable.

X ⊂ < n Input space. y Multidimensional output variable.

Y ⊂ < Output space. x i i th realization of the random vectorx. f(x) Target regression function. w Random noise variable. z i =hxi, y i i Input-output sample: i th case in training set.

D N ={z 1 , z 2 , , z N } Training set. Λ Class of hypothesis. α Hypothesis parameter vector. h(x, α) Hypothesis function. Λs Hypothesis class of complexitys.

Remp(α) Empirical functional risk. αN arg minRemp(α).

GN Mean integrated squared error (MISE). l Number of folds in cross-validation.

Gˆcv Cross-validation estimate ofGN.

Gˆloo Leave-one-out estimate ofGN.

Ntr Number of samples used for training in cross-validation.

N ts Number of samples used for test in cross-validation. α i N tr i= 1, , l Parameter which minimizes the empirical risk ofD N tr

D (i) Training set with thei th sample set aside. αN(i) Parameter which minimizes the empirical risk ofD(i).

D N ∗ Bootstrap training set of sizeN generated byDN with replacement.

Nb Number of bootstrap samples.

Data analysis notation xij j th element of vectorxi.

The Xij element of matrix X represents a specific value within the dataset The q query point indicates the location in the input space where a prediction is needed The prediction at this query point is denoted as ˆyq For leave-one-out predictions, ˆy−j i refers to the prediction for the ith input while excluding the jth sample The leave-one-out error, e loo j, is calculated as the difference between the actual value yj and the leave-one-out prediction ˆy−j j, with the jth sample set aside.

In a modular architecture, the local model represented by hj(x, α) utilizes the linear coefficients vector β and the least-squares parameters vector βˆ to optimize performance The least-squares parameters vector with the j-th sample set aside is denoted as βˆ −j Additionally, the activation or basis function is represented by ρj, while ηj refers to the set of parameters associated with this activation function Understanding these components is crucial for analyzing bandwidth and improving model accuracy.

Probability theory is the study of uncertain phenomena, using mathematical language to quantify this uncertainty Although unpredictable in a deterministic sense, these phenomena can exhibit regularities that are modeled mathematically through idealized probabilistic frameworks, which outline all possible outcomes and their associated probabilities This foundational chapter introduces essential probability concepts, crucial for understanding the statistical elements of machine learning Readers are encouraged to focus on the random variable as a concise representation of uncertainty and on probability as a formal tool for managing uncertain information Notably, special emphasis is placed on conditional and joint probability, as these concepts are vital for statistical modeling and machine learning, helping to define the relationships and dependencies between random variables.

The random model of uncertainty

Axiomatic definition of probability

Probability quantifies uncertainty and is defined for a random experiment For any event E, its probability, denoted as Prob{E}, is a real number within the range [0,1] The function Prob{ã} : Ω → [0,1] represents the probability measure or probability distribution, adhering to three fundamental axioms.

3 Prob{E 1 +E 2 }= Prob{E 1 }+ Prob{E 2 }ifE 1 andE 2 are mutually exclusive.

The axioms of probability theory establish fundamental principles: the first axiom asserts that probabilities are nonnegative real numbers; the second axiom assigns a probability of one to the universal event Ω, ensuring the normalization of the probability measure; and the third axiom emphasizes that the probability function must be additive, aligning with our intuitive understanding of probability behavior.

All probabilistic results are based directly or indirectly on the axioms and only the axioms, for instance

The article explores various interpretations of probability axioms, focusing on the frequentist and Bayesian perspectives in Section 2.1.3 It emphasizes that the probability function serves as a formal representation of uncertainty, aligning closely with human perceptions of uncertainty.

So from a mathematician point of view, probability is easy to define: it is a countably additive set function defined on aBorel field , with a total mass of one.

A key challenge in probabilistic modeling is accurately computing the probability value Prob{E} for a generic event E The assignment of probabilities is often the most complex aspect of developing these models While probability theory itself is neutral and can draw inferences independent of specific probability assignments, the outcomes are significantly influenced by the chosen assignments Inaccurate probability assignments can lead to misleading predictions that do not accurately represent the behavior of the modeled phenomenon The following sections will outline commonly used procedures in practice to address these challenges.

Symmetrical definition of probability

In a random experiment characterized by M symmetric outcomes, each outcome is equally likely to occur The number of outcomes that favor the event E, which signifies the occurrence of E, plays a crucial role in determining the event's probability.

An intuitive definition of probability (also known as the classical definition) of the event E that adheres to the axioms is

In other words,the probability of an event equals the ratio of its favourable outcomes to the total number of outcomes provided that all outcomes are equally likely [91].

The typical approach to probability often relies on the symmetric hypothesis, as seen in examples like rolling a fair die or selecting a ball from a bowl of W white and B black balls, where the probability of selecting a white ball is calculated as W/(W+B) This calculation is based on symmetrical assumptions rather than experimentation However, one must question the validity of these assumptions in certain scenarios, such as determining the probability of a newborn being a boy This raises the issue of how to define the probability of an event when the symmetrical hypothesis is not applicable.

Frequentist definition of probability

Let us consider a random experiment and an event E Suppose we repeat the experimentN times and that we record the number of times N E that the event E occurs The quantity

The relative frequency of an event E, denoted as N (2.1.5), ranges between 0 and 1 When an experiment is conducted repeatedly under identical conditions, the frequency approaches a stable value as N increases This insight prompted von Mises to establish frequency as a fundamental concept in defining probability.

Definition 1.1 (von Mises) The probability Prob{E}of an eventE is the limit

N E N where N is the number of observations and N E is the number of times that E occurred.

The definition provided is reasonable and aligns with the axioms discussed in Section 2.1.1 However, in practical applications, the number $ N $ is finite, and thus the limit must be regarded as a hypothesis rather than a determinable experimental value Despite this limitation, the frequentist interpretation plays a crucial role in illustrating the connections between theoretical frameworks and real-world applications Nonetheless, it falls short when representing probability in contexts that model degrees of belief, such as estimating the likelihood of your professor winning a Nobel Prize, where defining the number $ N $ of repetitions becomes problematic.

The Bayesian approach offers a significant alternative interpretation of probability, viewing it as a degree of belief In this framework, Prob{E} quantifies an observer's confidence in the truth of event E occurring or existing However, this article will not delve into the Bayesian methodology for statistics and data analysis, and readers seeking more information are encouraged to consult reference [51].

The Law of Large Numbers

A well-known justification of the frequentist approach is provided by the Weak Law of Large Numbers, proposed by Bernoulli.

Theorem 1.2 Let Prob{E}=pand suppose that the eventE occursN E times in

N trials Then, N N E converges topin probability, that is, for any >0,

1 As Keynes said ”In the long run we are all dead”.

2.1 THE RANDOM MODEL OF UNCERTAINTY 23

Figure 2.1: Fair coin tossing random experiment: evolution of the relative frequency (left) and of the absolute difference between the number of heads and tails (R script freq.R).

The theorem states that the ratio $ N_E / N $ approaches a specific value $ p $ such that for any given $ \epsilon > 0 $, the probability that $ |N_E / N - p| \leq \epsilon $ approaches 1 as $ N $ increases to infinity This finding supports the common application of Monte Carlo simulations for numerically solving probability problems.

Note that this does NOT mean thatN E will be close toNãp In fact,

In a fair coin-tossing game, the law does not guarantee that the absolute difference between heads and tails remains near zero; instead, this difference can continue to increase, albeit at a slower rate than the total number of tosses.

### script freq.R ### set.seed(1) ## initialization random seed

In a simulation involving 600,000 trials of coin tosses, the outcomes were recorded as either heads (H) or tails (T) The analysis calculated the absolute difference between the number of heads and tails, as well as the relative frequency of heads over the total trials Two plots were generated: the first displayed the relative frequency of heads, illustrating its convergence towards 0.5 as the number of trials increased, while the second plot depicted the absolute difference between heads and tails, highlighting the fluctuations in outcomes throughout the trials This experiment effectively demonstrates the principles of probability and the law of large numbers in a visual format.

Independence and conditional probability

Let us introduce the definition of independent events.

Definition 1.3(Independent events) Two eventsE1andE2areindependentif and only if

Prob{E1∩ E2}= Prob{E1}Prob{E2} (2.1.7) and we writeE1⊥⊥ E2.

Note that the quantity Prob{E1∩ E2} is often referred to as joint probability and denoted by Prob{E 1 ,E 2 }.

As an example of two independent events, think of two outcomes of a roulette wheel or of two coins tossed simultaneously.

When rolling a fair die, let x represent the outcome The event E1 occurs if x is an even number, while event E2 is defined as x being greater than or equal to 3 Additionally, event E3 encompasses the scenarios where x equals 4, 5, or 6.

Are the eventsE1 andE2independent? Are the events E1 andE3 independent?

LetE1 and E2 two disjoint events with positive probability Can they be independent? The answer is no since

Let E1 be an event such that Prob{E1} >0 and E2 another event We define the conditional probability ofE2 given thatE1 has occurred as follows:

Definition 1.4 (Conditional probability) If Prob{E1} >0 then the conditional probability ofE2 givenE1 is

Note that for any fixed E1, the quantity Prob{ã|E1} satisfies the axioms of probability For instance if E2,E3 andE4 are disjoint events we have that

However this does NOT generally hold for Prob{E1|ã}, that is when we fix the term

E 1 on the left of the conditional bar For two disjoint eventsE 2 andE 3 , in general

Also it is generally NOT the case that Prob{E2|E1}= Prob{E1|E2}.

The following result derives from the definition of conditional probability. Lemma 1 IfE1 andE2are independent events then

In qualitative terms, the independence of two events indicates that knowing one event (E1) has occurred does not alter the probability of the other event (E2) occurring.

Combined experiments

In probability theory, it is essential to recognize that we have primarily focused on events within a single sample space However, the most intriguing applications of probability arise when dealing with combined random experiments that encompass multiple sample spaces.

Ω = Ω1×Ω2× .Ωn is the Cartesian product of several spaces Ω i ,i= 1, , n.

For instance if we want to study the probabilistic dependence between the height and the weight of a child we have to define a joint sample space

Ω ={(w, h) :w∈Ω w , h∈Ω h } made of all pairs (w, h) where Ω w is the sample space of the random experiment describing the weight and Ω h is the sample space of the random experiment describing the height.

All properties discussed previously apply to events that do not share the same sample space For example, in a combined experiment represented as Ω = Ω1 × Ω2, two events E1 from Ω1 and E2 from Ω2 are considered independent if the probability of E1 given E2 is equal to the probability of E1 itself, expressed as Prob{E1|E2} = Prob{E1}.

Some examples of real problems modelled by random combined experiments are presented in the following.

In a fair coin-tossing game, each toss is independent, meaning past outcomes do not influence future results For example, after observing 10 consecutive tails, one might mistakenly believe that the likelihood of the next toss being heads has increased, a misconception known as the gambler’s fallacy However, the occurrence of such a rare event does not alter the probability of the next toss, which remains unaffected by previous outcomes.

A medical study investigates the correlation between medical test outcomes and the presence of a disease, conceptualizing the research as a blend of two random experiments.

1 the random experiment which models the state of the patient Its sample space is Ω s = {H, S} where H and S stand for healthy and sick patient, respectively.

2 the random experiment which models the outcome of the medical test Its sample space is Ω o ={+,−}where + and−stand for positive and negative outcome of the test, respectively.

The dependency between the state of the patient and the outcome of the test can be studied in terms of conditional probability.

Out of 1,000 patients tested, 108 showed positive responses, with 9 confirmed cases of the disease among them Conversely, among the 892 patients who tested negative, only 1 was actually sick According to the frequentist interpretation, the probabilities of these joint events can be approximated using the specified expression.

Doctors seek to understand the probabilities associated with test outcomes, specifically the likelihood of a positive or negative result when a patient is either sick or healthy They are also interested in determining the probability of encountering a sick or healthy patient based on the test outcome received This analysis is grounded in the principles of conditional probability.

The test results indicate a high level of accuracy; however, a positive result does not necessarily mean there is a significant likelihood of illness.

This example shows that sometimes humans tend to confound Prob{E s |E o } with Prob{E o |E s } and that the most intuitive response is not always the right one.

The law of total probability and the Bayes’ theorem

Let us consider an indeterminate practical situation where a set of eventsE1,E2, ,

In a scenario where multiple events, E1, E2, , Ek, can occur, it is essential that no two events happen at the same time while ensuring that at least one event occurs This indicates that these events are mutually exclusive and collectively exhaustive, effectively forming a partition of the sample space Ω Two significant theorems can be derived from this concept.

Theorem 1.5 (Law of total probability) Let Prob{Ei}, i = 1, , k denote the probabilities of the ith event Ei and Prob{E|Ei},i= 1, , k the conditional probability of a generic event E given that Ei has occurred It can be shown that

In this case the quantity Prob{E}is referred to asmarginal probability.

Theorem 1.6 (Bayes’ theorem) The conditional (“inverse”) probability of anyEi, i= 1, , k given thatE has occurred is given by

Prob{Ei|E}= Prob{E|Ei}Prob{Ei}

Pk j=1Prob{E|Ej}Prob{Ej} =Prob{E,Ei}

• E1is the event: “Tomorrow is going to rain”.

• E2is the event: “Tomorrow is not going to rain”.

• E is the event: “Tonight is chilly and windy”.

The knowledge of Prob{E1}, Prob{E2} and Prob{E|Ek}, k = 1,2 makes possible the computation of Prob{Ek|E}.

Array of joint/marginal probabilities

In analyzing the combination of two random experiments with sample spaces Ω A = {A1, A2, , An} and Ω B = {B1, B2, , Bm}, we can utilize the joint probability values Prob{Ai, Bj} for each pair of events (Ai, Bj), where i ranges from 1 to n and j ranges from 1 to m This joint probability array provides comprehensive information necessary for calculating all marginal and conditional probabilities through established expressions (2.1.12) and (2.1.8).

A1 Prob{A1, B1} Prob{A1, B2} ã ã ã Prob{A1, Bm} Prob{A1}

A2 Prob{A2, B1} Prob{A2, B2} ã ã ã Prob{A1, Bm} Prob{A2}

An Prob{An, B1} Prob{An, B2} ã ã ã Prob{An, Bm} Prob{An}

The marginal probabilities of events B1, B2, , Bm sum to 1, expressed as Prob{B1} + Prob{B2} + + Prob{Bm} = 1 The probability of event Ai can be calculated using the joint probability matrix, where Prob{Ai} is the sum of Prob{Ai, Bj} for j = 1 to m, and similarly, Prob{Bj} is the sum of Prob{Ai, Bj} for i = 1 to n By utilizing an entry from the joint probability matrix along with the corresponding row or column sums, we can compute the conditional probability effectively, as illustrated in the provided example.

Let us model the commute time to go back home for an ULB student living in

In a random experiment involving St Gilles, the sample space is defined as Ω={LOW, MEDIUM, HIGH} Additionally, a separate random experiment representing the weather in Brussels has a sample space of Ω={G=GOOD, BD} The analysis includes an array of joint probabilities that connects these two experiments, illustrating the relationship between the conditions in St Gilles and the weather patterns in Brussels.

G (in Bxl) B (in Bxl) Marginal

According to the above probability function, is the commute time dependent on the weather in Bxl? Note that if weather is good

Else if weather is bad

Since Prob{ã|G} 6= Prob{ã|B}, i.e the probability of having a certain commute time changes according to the value of the weather, the relation (2.1.11) is not satisfied.

Consider now the dependency between an event representing the commute time and an event describing the weather in Rome.

G (in Rome) B (in Rome) Marginal

Our question now is: is the commute time dependent on the weather in Rome?

If the weather in Rome is good we obtain

Prob{ã|G} 0.18/0.9=0.2 0.45/0.9=0.5 0.27/0.9=0.3 while if the weather in Rome is bad

Note that the probability of a commute time event does NOT change according to the value of the weather in Rome, e.g Prob{LOW|B}= Prob{LOW}.

To accurately predict commute times in Brussels, analyzing the weather in Brussels would provide significantly more relevant information than considering the weather in Rome Local weather conditions directly impact traffic patterns, public transport schedules, and overall commuting experiences in the city.

Random variables

Example: Marginal/conditional probability function

Consider a probabilistic model of the day’s weather based on the combination of the following random descriptors where

1 the first represents the sky condition and its sample space is Ω ={CLEAR, CLOUDY}.

2 the second represents the barometer trend and its sample space is Ω ={RISING, FALLING}.

3 the third represents the humidity in the afternoon and its sample space is

Let the joint probability values be given by the table

From the joint values we can calculate the probabilitiesP(CLEAR, RISIN G) 0.47 andP(CLOU DY) = 0.35 and the conditional probability value

Machine learning and statistics focus on data analysis, and the connection between random experiments and data is established through the concept of random variables.

In a random experiment characterized by the triple (Ω,{E},Prob{ã}), a mapping rule Ω→ Z ⊂ R allows us to assign a real value z(ω) to each outcome ω This means that the random variable z takes on a specific value when the outcome of the experiment is ω To distinguish between a random variable and its observed value, we denote the random variable in boldface (z) and the observed value in normal font (e.g., z = 11) Since each event E has an associated probability and there is a mapping from events to real values, we can establish a probability distribution for the random variable z.

Definition 2.1(Random variable) Given a random experiment (Ω,{E},Prob{ã}), a random variablezis the result of a mapping Ω→ Z that assigns a numberz to every outcome ω This mapping must satisfy the following two conditions:

• the set{z≤z} is an event for everyz.

Given a random variable z ∈ Z and a subset I ⊂ Z we define the inverse mapping z −1 (I) ={ω∈Ω|z(ω)∈I} (2.2.14) wherez −1 (I)∈ {E}is an event On the basis of the above relation we can associate a probability measure tozaccording to

Prob{z∈I}= Prob z −1 (I) = Prob{ω∈Ω|z(ω)∈I} (2.2.15) Prob{z=z}= Prob z −1 (x) = Prob{ω∈Ω|z(ω) =z} (2.2.16)

In other words, a random variable is a numerical quantity, linked to some experiment involving some degree of randomness, which takes its value from some set

In probability experiments, such as rolling two six-sided dice, the random variable (r.v.) can represent outcomes like the sum or the maximum of the numbers displayed For the sum, the possible values range from 2 to 12, while for the maximum value, the range is from 1 to 6.

To determine the optimal time to leave ULB for watching Fiorentina in the Champions League final, we must consider the random commute time, denoted as z Our experiences indicate that this time varies, with examples such as z1 = 10 minutes, z2 = 23 minutes, and z3 = 17 minutes, highlighting its non-constant nature This variability stems from a complex random process influenced by factors like weather, day of the week, and local events By employing a probabilistic approach, we can model this uncertainty using a random variable z, treating each commute time zi as a realization of a random outcome ωi This method provides a compact representation of the various causes contributing to the unpredictability of commute times, enabling us to calculate the optimal departure time from ULB to ensure a less than 5 percent chance of missing the match's start.

Discrete random variables

Parametric probability function

In certain cases, the probability function may not be fully defined but can be represented as a function of the variable z and a parameter θ For instance, consider a discrete random variable z that can assume values from the set Z = {1, 2, 3}, with its corresponding probability function being specified.

P z (z) = θ 2z θ 2 +θ 4 +θ 6 where θis some fixed nonzero real number.

Whatever the value ofθ,P z (z)>0 forz= 1,2,3 andP z (1) +P z (2) +P z (3) = 1. Therefore z is a well-defined random variable, even if the value of θ is unknown.

A parameter, denoted as θ, represents a constant that is often unknown and plays a crucial role in the analytical expression of a probability function The parametric form serves as an effective method to organize a family of probabilistic models, where the challenge of estimation can be viewed as identifying the appropriate parameter.

Expected value, variance and standard deviation of a discrete r.v

While the probability function Pz fully describes the uncertainty of z, its practical application is limited due to the need to define a value set proportional to the size of Z Consequently, it is often more efficient to utilize a compact representation of Pz by calculating a functional of Pz.

The most common single-number summary of the distributionPzis the expected value which is a measure of central tendency.

Definition 3.1(Expected value) The expected value of a discrete random variable zis defined to be

E[z] =à=X z∈Z zP z (z) (2.3.18) assuming that the sum is well-defined.

The expected value of a random variable does not always belong to its domain Z While "mean" is often used interchangeably with "expected value," it is essential to distinguish it from "average." Further clarification on this difference will be provided in Section 3.3.2.

The concept of expected value was first introduced in the 17th century by C. Huygens in order to study the games of chance.

In European roulette, a player bets $1 on a single number, choosing from the range of 0 to 36, where 0 represents a win for the house The player's potential gain can be viewed as a random variable, reflecting the uncertainty and variability inherent in the game.

In a discrete probability scenario with outcomes in the set Z = {-1, 35}, a player faces two possible results: a loss of $1 with a probability of 36/37, or a win of $35 with a probability of 1/37 Despite having the same mean, these two probability functions exhibit different variances, highlighting the variability in potential gains The expected gain can be calculated based on these probabilities and outcomes.

In other words while casinos gain on average 2.7 cents for every staked dollar, players on average are giving away 2.7 cents (whatever sophisticated their betting strategy is).

A common way to summarize the spread of a distribution is provided by the variance.

Definition 3.2(Variance) The variance of a discrete random variablezis defined as

The variance is a measure of the dispersion of the probability function of the random variable around its mean.

Note that the following relation holds σ 2 =E[(z−E[z]) 2 ] (2.3.22)

=E[z 2 ]−à 2 (2.3.25) whatever is the probability function ofz Figure 2.2 illustrate two example discrete r.v probability functions which have the same mean but different variance.

Note that an alternative measure of spread could be represented byE[|z−à|]. However this quantity is much more difficult to be analytically manipulated than the variance.

The variance Var[z] differs in dimension from the values of z; for example, if z is measured in meters, Var[z] is expressed in square meters In contrast, the standard deviation serves as a measure of spread that shares the same dimension as the random variable z.

Definition 3.3(Standard deviation) The standard deviation of a discrete random variablezis defined as the positive square root of the variance.

Let us consider a binary random variable z ∈ Z = {0,1} where Pz(1) = p and

Moments of a discrete r.v

Definition 3.4 (Moment) For any positive integerr, therth moment of the probability function is à r =E[z r ] =X z∈Z z r P z (z) (2.3.30)

The first moment aligns with the mean, while the second moment pertains to the variance, as outlined in Equation (2.3.22) Furthermore, higher-order moments offer deeper insights beyond the mean and variance, revealing additional details about the shape of the probability distribution.

Definition 3.5(Skewness) The skewness of a discrete random variablezis defined as γ= E[(z−à) 3 ] σ 3 (2.3.31)

Skewness is a key parameter that indicates the asymmetry of a random variable's probability function Probability functions exhibiting positive skewness feature long tails extending to the right, while those with negative skewness display long tails that extend to the left.

Entropy and relative entropy

Definition 3.6 (Entropy) Given a discrete r.v z, the entropy of the probability functionPz(z) is defined by

H(z) is a measure of the unpredictability of the r.v z Suppose that there are

The entropy of a random variable z reaches its maximum value of logM when the probability distribution Pz(z) is uniform, with each possible value having a probability of 1/M Conversely, entropy is minimized when the probability is concentrated on a single value of z, resulting in P(z) = 1 for that specific value while all other probabilities are zero.

Figure 2.3: A discrete probability function with positive skewness (left) and one with a negative skewness (right).

Entropy, like variance, quantifies the uncertainty of a random variable (r.v.), but it uniquely relies solely on the probabilities of the various outcomes rather than the outcomes themselves Thus, entropy can be viewed as a function of the probability distribution P(z) rather than the values z.

Let us now consider two different discrete probability functions on the same set of values

P0=Pz 0 (z), P1=Pz 1 (z) whereP 0 (z)>0 if and only ifP 1 (z)>0.

Therelative entropies(or theKullback-Leibler divergences) associated with these two functions are

These asymmetric quantities measure the dissimilarity between the two functions.

A symmetric formulation of the dissimilarity is provided by thedivergencequantity

Continuous random variable

Mean, variance, moments of a continuous r.v

Consider a continuous scalar r.v whose rangeZ = (l, h) and density functionp(z).

We have the following definitions

Definition 4.3 (Expectation or mean) The mean of a continuous scalar r.v zis the scalar value à=E[z] Z h l zp(z)dz (2.4.34)

Definition 4.4 (Variance) The variance of a continuous scalar r.v zis the scalar value σ 2 =E[(z−à) 2 ] Z h l

Definition 4.5 (Moments) The r-th moment of a continuous scalar r.v zis the scalar value à r =E[z r ] Z h l z r p(z)dz (2.4.36)

Note that the moment of orderr= 1 coincides with themean ofz.

Definition 4.6 (Upper critical point) For a given 0 ≤ α≤ 1 the upper critical point of a continuous r.v zis the numberzαsuch that

Figure 2.4 shows an example of cumulative distribution together with the upper critical point.

Joint probability

Marginal and conditional probability

In the context of discrete random variables, let {z₁, , zₘ} represent a subset of size m for which a joint probability function is established The marginal probabilities for this subset can be calculated by summing over all possible combinations of values for the other variables, as outlined in expression (2.5.37).

Compute the marginal probabilitiesP(z1= 0) and P(z1= 1) from the joint probability of the spam mail example.

• For continuous random variables the marginal density is p(z1, , zm) Z p(z1, , zm, zm+1, , zn)dzm+1 dzn (2.5.38)

The following definition for r.v derives directly from Equation (2.1.8).

Definition 5.1(Conditional probability function) Theconditional probability func- tionfor one subset of discrete variables{zi:i∈S1}given values for another disjoint subset{zj :j∈S2} whereS1∩S2=∅, is defined as the ratio

P({z j :j ∈S 2 }) Note that ifxandyare independent then

Also, since independence is symmetric, this is equivalent to say thatPy(y|x=x) Py(y).

Definition 5.2 (Conditional density function) The conditional density function for one subset of continuous variables{z i :i∈S 1 }given values for another disjoint subset{zj :j∈S2} whereS1∩S2=∅, is defined as the ratio p({zi:i∈S 1 }|{zj :j∈S 2 }) =p({zi:i∈S 1 },{zj :j∈S 2 }) p({zj:j∈S2}) where p({zj :j∈S2}) is the marginal density of the setS2 of variables.

Chain rule

The chain rule, also known as the general product rule, enables the computation of the joint distribution of a set of n random variables by utilizing conditional probabilities This rule is particularly useful for simplifying the representation of complex distributions, as it allows them to be expressed in terms of conditional probabilities Specifically, the joint distribution can be represented as: p(z_n, , z_1) = p(z_n | z_{n-1}, , z_1) p(z_{n-1} | z_{n-2}, , z_1) p(z_2 | z_1) p(z_1).

Independence

Having defined the joint and the conditional probability, we can now define when two random variables are independent.

Definition 5.3 (Independent discrete random variables) Letxandy be two discrete random variables Two variables xandyare defined to be statistically independent (written asx⊥⊥y) if the joint probability

Prob{x=x,y=y}= Prob{x=x}Prob{y=y} (2.5.41)The definition can be easily extended to the continuous case.

Definition 5.4(Independent continuous random variables) Two continuous variables xandy are defined to bestatistically independent (written asx⊥⊥y) if the joint density p(x=x,y=y) =p(x=x)p(y=y) (2.5.42) From the definition of independence and conditional density it follows that x⊥⊥y⇔p(x|y) =p(x) (2.5.43)

Note that in mathematical terms an independence assumption implies that a bivariate density function can be written in a simpler form as the product of two univariate densities

Independence between variables implies that the outcome of one variable does not influence the other It is important to note that independence is not reflexive, meaning a variable cannot be independent of itself, nor is it transitive; if two pairs of variables are independent, it does not guarantee that the first and last variables are independent Additionally, independence is symmetric, as the independence of variable x from variable y is equivalent to the independence of variable y from variable x.

If we consider three instead of two variables, they are said to be mutually independent if and only if each pair of rv.s is independent and p(x,y,z) =p(x)p(y)p(z)

Note also that x⊥⊥(y,z)⇒x⊥⊥z,x⊥⊥y holds, but not the opposite.

Check whether the variablez1andz2 of the spam mail example are independent.

Conditional independence

Independence is a dynamic relationship; while random variable x may be independent of y (x ⊥⊥ y), the introduction of another variable z can create dependence between x and y Conversely, x and y can exhibit independence in the presence of z, despite being dependent without it This concept highlights the importance of conditional independence in understanding the relationships between variables.

Definition 5.5 (Conditional independence) Two r.v.s x and y are conditionally independent given z=z (x⊥⊥y|z=z) iffp(x,y|z=z) =p(x|z=z)p(y|z=z). Two r.v.sxandyareconditionally independent (x⊥⊥y) iff they are conditionally independent for all values ofz.

The notation x⊥⊥y|z=z indicates that variables x and y are independent when the condition z=z is met However, this does not imply that x and y maintain their independence when z=z does not occur Consequently, it is possible for two variables to exhibit independence while lacking conditional independence, or vice versa.

It can be shown that the following two assertions are equivalent

If (x ⊥⊥ y|z), (z ⊥⊥ y|x), (z ⊥⊥ x|y) then x, y, z are mutually independent.

If zis a random vector, theorder of the conditional independence is equal to the number of variables inz.

Entropy in the continuous case

Consider a continuous r.v y The(differential) entropy ofyis defined by

Z log(p(y))p(y)dy=E y [−log(p(y))] =E y log 1 p(y) with the convention that 0 log 0 = 0.

Entropy if a functional of the distribution ofyand is a measure of the predictabil- ity of a r.v y The higher the entropy, the less reliable are our predictions about y.

In the case of a normal random vectorY={y1, ,yn} ∼ N(0,Σ)

Consider two continuous r.v.s x and y and their joint density p(x, y) The joint entropy ofxandyis defined by

Z Z log(p(x, y))p(x, y)dxdy=Ex,y[−log(p(x, y))] =Ex,y log 1 p(x, y)

Theconditional entropy is defined as

Z Z log(p(y|x))p(x, y)dxdy=Ex,y[−log(p(y|x))] =E x,y log 1 p(y|x)

This quantity quantifies the remaining uncertainty ofy oncexis known.

Note that in generalH(y|x)6=H(x|y),H(y)−H(y|x) =H(x)−H(x|y) and that the chain ruleholds

Another interesting property is theindependence bound

Common univariate discrete probability functions

The Bernoulli trial

ABernoulli trial is a random experiment with two possible outcomes, often called

In probability theory, success and failure are represented by the probabilities p and (1−p), respectively A Bernoulli random variable, denoted as z, is a binary discrete random variable linked to a Bernoulli trial, where it takes the value z = 0 with a probability of (1−p) and z = 1 with a probability of p.

The probability function ofzcan be written in the form

Prob{z=z}=P z (z) =p z (1−p) 1−z , z= 0,1Note thatE[z] =pand Var [z] =p(1−p).

The Binomial probability function

A binomial random variable represents the number of successes in a fixed number

N ofindependent Bernoullitrials with thesame probabilityof success for each trial.

A typical example is the numberzof heads inN tosses of a coin.

The probability function ofz∼Bin(N, p) is given by

The mean of the probability function isà=N p Note that:

• the Bernoulli probability function is a special case (N = 1) of the binomial function,

• for small p, the probability of having at least 1 success inN trials is proportional toN, as long asN pis small,

• if z1 ∼ Bin(N1, p) and z2 ∼ Bin(N1, p) are independent then z1+z2 ∼Bin(N1+N2, p)

The Geometric probability function

A r.v zhas ageometricprobability function if it represents the number of successes before the first failure in a sequence of independent Bernoulli trials with probability of successp Its probability function is

The geometric probability function has an important property, known as the memoryless or Markov property According to this property, given two integers z1≥0,z2≥0,

Note that it is the only discrete probability function with this property.

A r.v z has a generalized geometric probability function if it represents the number of Bernoulli trials preceding but not including the k+ 1th failure Its function is

2.6 COMMON UNIVARIATE DISCRETE PROBABILITY FUNCTIONS 41 where k is supposed to be given The reader may verify that these functions are, indeed, probability functions by showing thatP z∈ZP z (z) = 1 by using the formula for the sum of a geometric series.

The geometric (and generalized geometric probability function) are encountered in some problems in queuing theory and risk analysis.

The Poisson probability function

A r.v zhas a Poisson probability function with parameterλ > 0 and is denoted byz∼Poisson(λ) if

The Poisson probability function, represented as P(z) = e^(-λ) λ^z / z!, where z = 0, 1, , serves as a limiting case of the binomial function When the number of trials (N) is large and the probability of success (p) for each trial is small, while the product Np = λ remains moderate, the probability of achieving z successes in the binomial distribution closely approximates the probability of a Poisson random variable with parameter λ taking the value z.

• if z 1 ∼ Poisson(λ 1 ) and z 2 ∼ Poisson(λ 2 ) are independent then z 1 +z 2 ∼ Poisson(λ 1 +λ 2 ).

The French football player Trezeguet has one percent probability of missing a penalty shot Let us compute the probability that he misses no penalty in 100 shots.

We have to compute the probability of z = 0 realizations of an event with probability p= 0.01 out ofN = 100 trials By using (2.6.44), we obtain:

Let us consider now the Poisson approximation Since λ=N p= 1, according to (2.6.45) we have

Table 2.1 summarizes the expression of mean and variance for the above-mentioned probability functions.

Table 2.1: Common discrete probability functions

Common univariate continuous distributions

Uniform distribution

A random variablezis said to be uniformly distributed on the interval (a, b) (written asz∼ U(a, b)) if its probability density function is given by p(z) ( 1 b−a ifa < z < b

Exponential distribution

A continuous random variable z is said to be exponentially distributed with rate λ >0 (written asz∼ E(λ)) if its probability density function is given by p z (z) (λe −λz ifz≥0

This distribution is commonly used to describe physical phenomena (e.g radioactive decay time or failure time) and can be considered as a continuous approximation to the geometric distribution (Section 2.6.3) Note that

• the mean ofzis 1/λand its variance is 1/λ 2

• just like the geometric distribution it satisfies thememoryless property

The Gamma distribution

We say thatzhas agamma distributionwith parameters (n,λ) if its density function is pz(z) ( λ n z n−1 e −λz (n−1)! ifz >0

• it is the distribution of the sum ofni.i.d 2 r.v having exponential distribution with rateλ.

2 i.i.d stands for identically and independently distributed (see also Section 3.1)

Table 2.2: Upper critical points for a standard distribution

• the exponential distribution is a special case (n= 1) of the gamma distribution.

Normal distribution: the scalar case

A continuous scalar random variable xis said to benormally distributed with parameters àand σ 2 (written as x∼ N(à, σ 2 )) if its probability density function is given by p(x) = 1

• the normal distribution is also widely known as theGaussian distribution.

• the mean ofxisà; the variance of xisσ 2

• the coefficient in front of the exponential ensures thatR

• the probability that an observationxfrom a normal r.v is within 2 standard deviations from the mean is 0.95.

Ifà= 0 andσ 2 = 1 the distribution is definedstandard normal We will denote its distribution functionF z (z) = Φ(z) Given a normal r.v x∼ N(à, σ 2 ), the r.v. z= (x−à)/σ (2.7.46) has a standard normal distribution Also ifz∼ N(0,1) then x=à+σz∼ N(à, σ 2 )

If we denote by zα the upper critical points (Definition 4.6) of the standard normal density

1−α= Prob{z≤zα}=F(zα) = Φ(zα) the following relations hold z α = Φ −1 (1−α), z 1−α =−zα, 1−α= 1

Table 2.2 lists the most commonly used values ofzα.

In Table 2.3 we list some important relationships holding in the normal case

The following R code can be used to test the first and the third relation in Table 2.3 by using random sampling and simulation.

N10 For example, ifN = 3 and

DN ={a, b, c}, we have 10 different bootstrap sets: {a,b,c},{a,a,b},{a,a,c},{b,b,a},{b,b,c}, {c,c,a},{c,c,b},{a,a,a},{b,b,b},{c,c,c}.

Under balanced bootstrap samplingtheB bootstrap sets are generated in such a way that each original data point is present exactly B times in the entire collection of bootstrap samples.

Bootstrap estimate of the variance

In the bootstrap methodology, for each dataset D(b) where b ranges from 1 to B, we define a bootstrap replication denoted as θˆ(b) = t(D(b)) This replication represents the statistic's value for each specific bootstrap sample The bootstrap technique then calculates the variance of the estimator ˆθ by analyzing the variance of the collection of values θˆ(b) for b from 1 to B.

The term "bootstrap" differs in meaning from its usage in computer operating systems, where it refers to initiating a computer using a foundational set of instructions.

Figure 4.1: Bootstrap replications of a dataset and boostrap statistic computation

It can be shown that if ˆθ = ˆà, then for B → ∞, the bootstrap estimate Varbs[ˆθ] converges to the variance Var [ ˆà].

Bootstrap estimate of bias

Let ˆθ be the estimator based on the original sampleDN and θˆ (ã) PB b=1θˆ (b) B

Since Bias[ˆθ] =E[ˆθ]−θ, thebootstrap estimate of the bias of the estimator ˆθ is obtained by replacingE[ˆθ] with ˆθ (ã) andθwith ˆθ:

Then, since θ=E[ˆθ]−Bias[ˆθ] thebootstrap bias correctedestimate is θˆbs = ˆθ−Biasbs[ˆθ] = ˆθ−(ˆθ (ã) −θ) = 2ˆˆ θ−θˆ (ã)

See R filepatch.Rfor the estimation of bias and variance in the case of the patch data example.

Bootstrap confidence interval

The bootstrap principle

The objective of any estimation procedure involving an unknown parameter θ of a distribution Fz and an estimator ˆθ is to determine or approximate the distribution of ˆθ−θ To calculate the variance of ˆθ, knowledge of Fz is essential, as it involves computing ED N [(ˆθ−E[ˆθ]) 2 ] However, in practical scenarios where Fz is unknown, this calculation cannot be performed analytically The bootstrap method addresses this by (i) substituting Fz with its empirical counterpart and (ii) employing a Monte Carlo simulation approach to estimate ED N [(ˆθ−E[ˆθ]) 2 ] through resampling from the dataset.

The bootstrap technique yields a Monte Carlo approximation of the distribution of ˆθ (b) − θ This implies that the variability of the empirical estimate ˆθ (b) around the true estimate ˆθ is anticipated to closely resemble the variability of ˆθ around the true parameter θ.

The bootstrap principle is founded on two key properties: first, as the sample size N increases, the empirical distribution ˆFz(ã) almost surely converges to the true distribution Fz(ã) Second, as the number of bootstrap samples B increases, the variance of the estimator ˆθ, derived from the empirical distribution, converges in probability to the true variance.

D d N[(ˆθ−E[ˆθ]) 2 ] stands for the plug-in estimate of the variance of ˆθ based on the empirical distribution.

Bootstrap estimation for small finite sample sizes (N) inherently involves some error, which comprises both statistical and simulation errors The statistical error arises from the discrepancy between the true underlying distribution Fz(ã) and the empirical distribution ˆFz(ã) The extent of this error is influenced by the selected estimator ˆθ(DN) and diminishes as the number of observations (N) increases.

The simulation error arises from relying on empirical properties of ˆθ(DN) instead of precise properties To minimize this simulation error, it is beneficial to increase the number of bootstrap replications, denoted as B.

Unlike the jacknife method, in the bootstrap the number of replicatesB can be adjusted to the computer resources In practice two “rules of thumb” are typically used:

1 Even a small number of bootstrap replications, e.g B = 25, is usually infor- mative B= 50 is often enough to give a good estimate of Varh θˆi

2 Very seldom are more thanB = 200 replications needed for estimating Varh θˆi Much bigger values ofB are required for bootstrap confidence intervals.

Note that the use of rough statistics ˆθ(e.g unsmooth or unstable) can make the resampling approach behave wildly Example of nonsmooth statistics are sample quantiles and the median.

In general terms, for i.i.d observations, the following conditions are required for the convergence of the bootstrap estimate

Randomization tests

Randomization and bootstrap

Bootstrap and randomization are both resampling techniques, but they have distinct characteristics Randomized samples are created by scrambling existing data without replacement, whereas bootstrap samples are formed by sampling with replacement from the original dataset Randomization tests are particularly useful when the order or association within the data is believed to hold significant information, as they evaluate the null hypothesis that this order or association is random.

Bootstrap sampling is designed to analyze the statistical distribution of specific statistics, such as the mean, where the order of data does not impact the results In scenarios where order is irrelevant, randomization becomes ineffective, as the statistics remain unchanged regardless of the resampling method used.

Permutation test

Bootstrap and randomization are both resampling techniques used in statistics, but they have distinct characteristics Randomization involves creating a sample by rearranging the existing data without replacement, while bootstrap sampling involves selecting samples with replacement from the original dataset Randomization tests are particularly useful when the sequence or relationship within the data is believed to hold significant meaning, as they assess the null hypothesis that this order or association is purely random.

Bootstrap sampling focuses on understanding the statistical distribution of specific statistics, such as the mean, where the order of data points does not impact the results In this scenario, randomization is ineffective because the statistics derived from different resampled datasets without replacement will yield identical results.

A permutation test is a nonparametric method used to compare two samples, {z1, , zM} and {y1, , yN}, drawn from unknown distributions Fz and Fy, respectively The null hypothesis posits that these two distributions are identical, irrespective of their specific analytical forms This approach allows for robust statistical inference without relying on distributional assumptions.

The permutation test utilizes an order-independent test statistic, t(D N , D M ), to assess the observed data against the null hypothesis distribution To establish this distribution, all possible partitionings of the N+M observations into two subsets of sizes N and M are examined, resulting in R = M + N choose M configurations Under the null hypothesis, each partitioning is equally probable For each i-th permutation (i = 1, , R), the test calculates the corresponding t(i) statistic The observed statistic t(D N , D M ) is then compared to the set of t(i) values, and if it falls within the α/2 tails of the t(i) distribution, the null hypothesis is rejected, leading to a type I error rate of α.

The permutation procedure will involve substantial computation unlessM and

N are small When the number of permutations is too large a random sample of a large numberR of permutations can be taken.

Note that when observations are drawn according to a normal distribution, it can be shown that the use of a permutation test gives results close to those obtained using thet test.

Let us consider D 4 = [74,86,98,102,89] and D 3 = [10,25,80] We run a permutation test (R = 8 4

= 70 permutations) to test the hypothesis that the two sets belong to the same distribution (R scripts perm.R).

Lett(DN) = ˆà(D4)−à(Dˆ 3) = 51.46 Figure 4.2 shows the position of t(DN) with respect to the null sampling distribution.

## script s_prem.R rm(list=ls()) library(e1071) dispo0 are also the points for whichαi=γ.

The R script svm.R addresses a non-separable problem by configuring the boolean variable 'separable' to FALSE In the accompanying Figure 7.12, the training points from two classes are illustrated in red and blue, while the separating hyperplane and margin boundaries are also displayed Additionally, the support vectors are marked in black, with points from the red class that have a positive slack variable shown in yellow, and those from the blue class with a positive slack variable represented in green.

Once the value of γ is established, the SVM approach simplifies to a quadratic optimization problem, supported by numerous methods and numerical software The parameter γ serves as a complexity parameter, limiting the extent to which classifications can incorrectly fall outside the margin In practice, selecting this parameter involves a structural identification process, where γ is varied across a broad range of values and evaluated using a validation strategy.

In Figure 7.12, the maximal margin hyperplane is illustrated for a non-separable binary classification task, showcasing various values of C The support vectors are highlighted in black, while the slack points for the red class are represented in yellow, and the slack points for the blue class are depicted in green.

In the previous chapter, we examined input/output regression problems characterized by linear relationships between inputs and outputs, as well as classification problems where the optimal decision boundaries are also linear.

The advantage of linear models are numerous:

• the least-squares ˆβ estimate can be expressed in an analytical form and can be easily calculated through matrix computation.

• statistical properties of the estimator can be easily defined.

• recursive formulation for sequential updating are available.

Nonlinear regression

Artificial neural networks

Artificial neural networks (ANN), also known as neural nets, are computational models designed for parallel and distributed information processing, inspired by the brain's neurons Recently, there has been a significant shift in neural computing towards a more statistically grounded interpretation, moving away from purely biological analogies to focus on insights derived from statistical pattern recognition theory.

The feed-forward neural network, also known as the multi-layer perceptron (MLP), is the primary type of neural network utilized in supervised learning for tasks such as classification and regression Feed-forward artificial neural networks (FNN) have been successfully implemented across various prediction applications, including speech recognition, financial forecasting, image compression, and adaptive industrial control.

Figure 8.5: Two-layer feed-forward NN

Feed-forward neural networks (FNN) feature a layered structure, consisting of multiple processing units known as artificial neurons or nodes Each node connects to others through real-valued weights, referred to as parameters, but does not connect to nodes within the same layer FNNs include an input layer and an output layer, and typically incorporate a bias unit in all layers except the output layer, serving a similar function to the intercept term in linear models.

For simplicity, henceforth, we will consider only FNN with one single output. Let

• nbe the number of inputs,

• H (l) the number of hidden units of the lth layer (l= 1, , L) of the FNN,

• w kv (l) denote the weight of the link connecting the kth node in the l−1 layer and thevth node in the llayer,

• z (l) v ,v= 1, , H (l) the output of the vth hidden node of thelth layer,

• z (l) 0 denote the bias for the l,l= 1, , Llayer.

For l ≥1 the output of the vth, v = 1, , H (l) , hidden unit of the lth layer, is obtained by first forming a weighted linear combination of theH (l−1) outputs of the lower level a (l) v H (l−1)

X k=1 w (l) kv z k (l−1) +w 0v (l) z 0 (l−1) , v= 1, , H (l) and then by transforming the sum using anactivation function to give z v (l) =g (l) (a (l) v ), v= 1, , H (l)

The activation function g (l) (ã) is typically a nonlinear transformation like the logistic orsigmoid function g (l) (z) = 1

1 +e −z For L = 2 (i.e single hidden layer or two-layer feed-forward NN), the input/output relation is given by ˆ y=h(x, αN) =g (2) (a (2) 1 ) =g (2)

Note that ifg (1) (ã) andg (2) (ã) are linear mappings, this functional form becomes linear.

In a Feedforward Neural Network (FNN), once the number of inputs and the function form g(ã) are established, two key parameters remain to be selected: the weights w(l) for l = 1, 2, and the number of hidden nodes H The weights of the FNN correspond to the parameters αN discussed in Section 5.1, where the hypothesis function h(ã) is defined The calibration of these weights using a training dataset is essential for parametric identification in neural networks, typically executed through a backpropagation algorithm, which will be elaborated on in the subsequent section.

The numberH of hidden nodes represents the complexitysin Equation (5.6.43).

Increasing the value of H enhances the range of input/output functions that a Feedforward Neural Network (FNN) can represent In other words, the selection of hidden nodes directly influences the representation capability of the FNN approximator, playing a crucial role in the structural identification process within FNNs.

Backpropagation is an algorithm which, once the number of hidden nodes H is given, estimates the weights αN ={w (l) , l= 1,2} on the basis of the training set

D N It is a gradient-based algorithm which aims to minimize the cost function

(y i −h(x i , α N )) 2 whereαN ={w (l) , l= 1,2} is the set of weights.

The backpropagation algorithm exploits the network structure in order to compute recursively the gradient.

The simplest (and least effective) backprop algorithm is an iterative gradient descent which is based on the iterative formula α N (k+ 1) =α N (k)−η∂SSE emp (α N (k))

∂αN(k) (8.1.2) where α N (k) is the weight vector at the kth iteration and η is the learning rate which indicates the relative size of the change in weights.

Weights are initially set to random values and are adjusted to minimize error, guided by a convergence criterion that determines when to stop the algorithm However, this approach is often inefficient, requiring numerous iterations to achieve a stationary point, and it does not ensure a consistent decrease in the sum of squared errors (SSE).

The single-input single-output neural network features one hidden layer with two hidden nodes and excludes bias units Effective implementations of this algorithm utilize the Levenberg-Marquardt method, as detailed in Section 6.6.2.6.

To illustrate the computation of derivatives in a simple neural network, we consider a single-input, single-output network with one hidden layer and two hidden nodes, without bias units The predictor function is represented as ˆy(x) = h(x, αN) = g(a1(x)), where a1(x) is defined as a1(x) = w(2)11 z1 + w(2)21 z2 In this context, the function takes the form ˆy(x) = g(w(2)11 g(w(1)11 x) + w(2)21 g(w(1)12 x), with αN encompassing the weights [w(1)11, w(1)12, w(2)11, w(2)21] The backpropagation algorithm requires calculating the derivatives of the sum of squared errors (SSE) with respect to each weight w in αN.

∂w and the terms (yi−y(xˆ i)) are easy to be computed, we focus on ∂w ∂ˆ y

As far as the weights { w (2) 11 , w 21 (2) }of the hidden/output layer are concerned, we have

Figure 8.7: Neural network fitting with s = 2 hidden nodes The red continuous line represents the neural network estimation of the Dopler function.

As far as the weights {w (1) 11 , w (1) 12 } of the input/hidden layer

The computation of derivatives concerning the weights of lower layers is dependent on terms previously calculated for the upper layers, specifically the term g0(a1(x)), which was derived in the earlier computation This relationship illustrates a backpropagation of numerical values from the upper layers to the lower layers, thereby validating the terminology associated with this process.

Note that this algorithm presents all the typical drawbacks of the gradient- based procedures discussed in Section 6.6.2.7, like slow convergence, local minima convergence, sensitivity to the weights initialization.

The FNN learning algorithm for a single-hidden layer architecture is implemented by the R librarynnet The scriptnnet.Rshows the prediction accuracy for different number of hidden nodes (Figure 8.7 and Figure 8.8).

A two-layer feedforward neural network (FNN) with sigmoidal hidden units is a significant model for various practical applications These networks are capable of approximating any continuous mapping, whether one-to-one or many-to-one, between finite-dimensional spaces with high accuracy.

In Figure 8.8, a neural network with seven hidden nodes effectively estimates the Doppler function, as illustrated by the red continuous line While the findings highlight the significance of having a sufficiently large number of hidden units, they lack practical applicability, as they do not provide guidance on selecting the optimal number of hidden nodes for a limited sample size and a general nonlinear mapping.

Choosing the appropriate number of hidden nodes in a feedforward neural network (FNN) involves a structural identification process that evaluates various architectures to identify those closest to the optimal configuration This assessment often employs cross-validation techniques or regularization strategies based on complexity criteria to ensure effective model selection.

This example highlights the danger of overfitting when the structural identification of a neural network relies on empirical risk rather than more accurate estimates of generalization error.

Consider a datasetD N ={xi, y i },i= 1, , N whereN = 50 and x∈ N

In this study, we explore a 3-dimensional vector $ \mathbf{x} $ related to the output $ y $ through the equation $ y = x_1^2 + 4 \log(|x_2|) + 5x_3 $ We employ a single-hidden-layer neural network, utilizing the R package nnet, configured with 15 hidden neurons to model this non-linear relationship Our goal is to assess the prediction accuracy on a new independent and identically distributed (i.i.d) dataset consisting of 50 samples The neural network is trained on the entire training set using the R script cv.R, and we report the empirical prediction MISE error as a measure of model performance.

(yi−h(xi, αN)) 2 = 1.6∗10 −6 where αN is obtained by the parametric identification step However, if we test h(ã, αN) on the test set we obtain

This neural network isseriously overfittingthe dataset The empirical error is a very badestimate of the MISE.

We perform now aK-fold cross-validation in order to have a better estimate of MISE, whereK= 10 TheK= 10 cross-validated estimate of MISE is

This figure is a much more reliable estimation of the prediction accuracy.

The leave-one-out estimate K=N= 50 is

It follows that the cross-validated estimate could be used to select a more appropriate number of hidden neurons.

Y.k.tr

Tiêu đề	Statistical Foundations Of Machine Learning
Tác giả	Gianluca Bontempi
Trường học	Universite Libre de Bruxelles
Chuyên ngành	Computer Science
Thể loại	handbook
Năm xuất bản	2017
Thành phố	Bruxelles

Định dạng
Số trang	200
Dung lượng	5,25 MB