lt can extend to multiple linear regression involving several independent variables and logistic regression, suitable for binary classification problems 2.1.1 Simple linear regression mo
Trang 1HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
OFFICE FOR INTERNATIONAL STUDY PROGRAM
ells:
ASSIGNMENT PROJECT REPORT
Lecturer: PHAN THI HUONG
Semester: 231
Ho Chi Minh City, Dec 2023
Trang 2
Student ID Percentage of work
Tran Hoai Son 2152943
Trang 3Table of Contents
2.1.1 Simple linear regression model - c1 1 11T 2S 1n TT KT TT key 2 2.1.2 Multiple linear regression model .- - c1 2v 2v v vn nh nen 2
2.2 RANDOM FOREST (DECISION TREES): - QL Q0 20 HS ST HT TH HH TT TK TH kh vu 3
"9-9 - ((Ẵểc 3 2.2.2 Characterizing the Accuracy of Random ForesfS Lee 5
5.1 MULTIPLE LINEAR REGRESSION MODEL 0 0 2222111111211 11115111111 511 1x re 18
5.1.1 Fitting the multiple linear regression model .- -. - ccc se s2 18 5.1.2 Analysis of Variance Table -ccc Q00 0010111111111 HS SH HT TT Tnhh Tnhh nh ng 2n 19
5.2 RANDOM FOREST Q0 Q 2n 2n HT TS TH HH KT TT KT KTS K KH TK KH ket 20
5.2.1 Tune parameters by using grid search - + + ch nh ke 20 5.2.2 Set the train control for the model . 1 S111 12212 1 ng ke 20 5.2.3 Train random foresf model . - - c0 21222222211 1111 11T ng KT ng key 20
6.1 MULTIPLE LINEAR REGRESSION MODEL 0 0 2222111111211 1111511111111 1 re 21
20 VAb) 200/29 1 dddddđd sẽ 21 6.1.2 DiSAAdVANTAGES ằ 22
Trang 41 Data introduction:
Context of data: Concrete is the most important material in civil engineering The concrete compressive strength is a highly nonlinear function of age and ingredients These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate
Data collection method: The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from laboratory Data is in raw form (not scaled) Number of observations, features: The data has 8 quantitative input variables, and 1 quantitative output variable, and 1030 instances (observations)
° Cement : measured in kg in a m? mixture
° Blast : measured in kg in a m3 mixture
° Fly ash : measured in kg in a m3 mixture
° Water : measured in kg in a mỶ mixture
° Superplasticizer : measured in kg in a m3 mixture
° Coarse Aggregate : measured in kg in a m3 mixture
° Fine Aggregate : measured in kg in a m? mixture
° Age : day (1~365)
° Concrete compressive strength measured in MPa
Population: Concrete compressive strength
Sample: 1030 obsevations (mixtures)
2 Background:
2.1 Linear Regression
Linear regression predicts the relationship between two variables by assuming a linear connection between the independent and dependent variables It seeks the optimal line that minimizes the sum of squared differences between predicted and actual values Applied in various domains like economics and finance, this method analyzes and forecasts data trends
Trang 5lt can extend to multiple linear regression involving several independent variables and logistic regression, suitable for binary classification problems
2.1.1 Simple linear regression model
In simple linear regression, we attempt to model the relationship between two variables, forexample, income and number of years of education, height and weight of people, length and width of envelopes, temperature and output of an industrial process, altitude and boiling point of water, or dose of a drug and response For a linear relationship, we can use a model of the form:
where y is the dependent or response variable and x is the independent or predictor variable The random variable 1 is the error term in the model In this context, error does not mean mistake but is a statistical term representing random fluctuations, measurement errors,
or the effect of factors outside of our control
The linearity of the model is an assumption We typically add other assumptions about the distribution of the error terms, independence of the observed values of y, and so on Using observed values of x and y, we estimate £6, and 6, and make inferences such as confidence intervals and tests of hypotheses for £,) and £, We may also use the estimated model to forecast or predict the value of y for a particular value of x, in which case a measure of predictive accuracy may also be of interest
2.1.2 Multiple linear regression model
The response y is often influenced by more than one predictor variable For example, the yield of a crop may depend on the amount of nitrogen, potash, and phosphate fertilizers used These variables are controlled by the experimenter, but the yield may also depend on uncontrollable variables such as those associated with weather
A linear model relating the response y to several predictors has the form”
The parameters fp, Ø¡, , 8, are called regression coefficients As in (1.1), 1 provides for random variation in y not explained by the x variables This random variation may be due partly to other variables that affect y but are not known or not observed The model in (1.2) is linear in the b parameters; it is not necessarily linear in the x variables
Trang 6A model provides a theoretical framework for better understanding of a phenomenon
of interest Thus a model is a mathematical construct that we believe may represent the mechanism that generated the observations at hand The postulated model may be an idealizec oversimplification of the complex real-world situation, but in many such cases, empirical models provide useful approximations of the relationships among variables These relationships may be either associative or causative
2.2 Random forest (Decision Trees):
2.2.1 Definition
Random forest is a commonly-used machine learning algorithm trademarked by Leo Breiman and Adele Cutler, which combines the output of multiple decision trees to reach a single result Its ease of use and flexibility have fueled its adoption, as it handles both classification and regression problems
Since the random forest model is made up of multiple decision trees, it would be helpful to start by describing the decision tree algorithm briefly Decision trees start with a basic question, such as, “Should I surf?” From there, you can ask a series of questions to determine an answer, such as, “Is it along period swell?” or “Is the wind blowing offshore?”
These questions make up the decision nodes in the tree, acting as a means to split the data Each question helps an individual to arrive at a final decision, which would be denoted by the leaf node Observations that fit the criteria will follow the “Yes” branch and those that don’t
will follow the alternate path Decision trees seek to find the best split to subset the data, and they are typically trained through the Classification and Regression Tree (CART) algorithm Metrics, such as Gini impurity, information gain, or mean square error (MSE), can be used to evaluate the quality of the split
This decision tree is an example of a classification problem, where the class labels are
"surf" and "don't surf."
While decision trees are common supervised learning algorithms, they can be prone to problems, such as bias and overfitting However, when multiple decision trees form an ensemble in the random forest algorithm, they predict more accurate results, particularly when the individual trees are uncorrelated with each other
Trang 7The random forest algorithm is an extension of the bagging method as it utilizes both bagging and feature randomness to create an uncorrelated forest of decision trees Feature randomness, also known as feature bagging or “the random subspace method’ (link resides outside ibm.com), generates a random subset of features, which ensures low correlation among decision trees This is a key difference between decision trees and random forests While decision trees consider all the possible feature splits, random forests only select a subset
of those features
Random forest algorithms have three main hyperparameters, which need to be set before training These include node size, the number of trees, and the number of features sampled From there, the random forest classifier can be used to solve for regression or classification problems
The random forest algorithm is made up of a collection of decision trees, and each tree
in the ensemble is comprised of a data sample drawn from a training set with replacement, called the bootstrap sample Of that training sample, one-third of it is set aside as test data, known as the out-of-bag (oob) sample, which we’ll come back to later Another instance of randomness is then injected through feature bagging, adding more diversity to the dataset and reducing the correlation among decision trees Depending on the type of problem, the determination of the prediction will vary For a regression task, the individual decision trees will be averaged, and for a classification task, a majority vote—i.e the most frequent categorical variable—will yield the predicted class Finally, the oob sample is then used for cross-validation, finalizing that prediction
Random forest algorithms have three main hyperparameters, which need to be set before training These include node size, the number of trees, and the number of features sampled From there, the random forest classifier can be used to solve for regression or classification problems
The random forest algorithm is made up of a collection of decision trees, and each tree
in the ensemble is comprised of a data sample drawn from a training set with replacement, called the bootstrap sample Of that training sample, one-third of it is set aside as test data, known as the out-of-bag (oob) sample, which we’ll come back to later Another instance of
Trang 8randomness is then injected through feature bagging, adding more diversity to the dataset and reducing the correlation among decision trees Depending on the type of problem, the determination of the prediction will vary For a regression task, the individual decision trees will be averaged, and for a classification task, a majority vote—i.e the most frequent categorical variable—will yield the predicted class Finally, the oob sample is then used for cross-validation, finalizing that prediction
2.2.2 Characterizing the Accuracy of Random Forests
2.2.2.1 Random Forests Converge
drawn at random from the distribution of the random vector Y,X, define the margin function
as
mg(X,¥) = aryl (hy (X) = ¥) — marysy ariel (hy (X) =f)
where /(.) is the indicator function The margin measures the extent to which the
average number of votes at X,Y for the right class exceeds the average vote for any other class The larger the margin, the more confidence in the classification The generalization error is given by
PE* = Pyy(mg(X,Y) <0)
where the subscripts X,Y indicate that the probability is over the X,Y space In random
forests, h;, (X) = h(X,0,, ) For a large number of trees, it follows from the Strong Law of
Large Numbers and the tree structure that
As the number of trees increases, for almost surely all sequences ©1 , PE* converges to
Pyy(Po(h(X, 0) = ¥) — maxjzyPo(h(X, 0) = j) < 0)
This result explains why random forests do not overfit as more trees are added, but produce a limiting value of the generalization error
2.2.2.2 Strength and Correlation
For random forests, an upper bound can be derived for the generalization error in terms
of two parameters that are measures of how accurate the individual classifiers are and of the
Trang 9dependence between them The interplay between these two gives the foundation for understanding the workings of random forests We build on the analysis in Amit and Geman
mr(X,Y) = Po(h(X,6) = ¥) — P;(h(X,6) = J(X,Y))
= Ea[I(h(X,Ø) = Y) — I(h(x,6) = J.,Y))|
The raw margin function is
rmg(6,X,Y) = I(h(X,Ø) = Y) — I(h(X, 9) = 7@.Y))
Thus, mr (X, Y) is the expectation of rmg(0,X, Y) with respect to © For any function f the identity
[Eof (©)? = Foo f(O)F(@)
holds where ©,0' are independent with the same distribution, implying that
mr(X,Y)? = Esgrmg(6, X,Y)xmg(6,X,Y)
Using this gives
var(mr) = Eạø (covy yrmg(O,x, y)rmg(®,,X, Y)) = Esø(p(9, 6 )sd(0)sa(Ø`))
Trang 10where p(0,0’) is the correlation betweenrmg(0,X,Y )andrmg(0’,X,Y ) holding 0,0’
fixed and sd(@) 1s the standard deviation of rmg(6, X,Y ) holding 0 fixed
Where p is the mean value of the correlation; that is,
Eoo'(e(@,')sd(@)sd(@’)) Eoe'(sd(@)sd(@'))
Write E,var(®) < Eo (Exyrmg(9, X, y)) -s*<1-s?
p=
Trang 11Figure 3.1: Data set
° Dealing with missing data
We implement some function to test whether they are any missing data or relative
cement —` slag ash water superplastic coarseagg fineagg age strength
problems
Figure 3.2: Couting the missing value
As can be seen from figure above there is no N/A value in our dataset, and itis suitable for the next step however if there is any N/A value then a command “na.omit(concrete)” can
be used to filter it
Trang 12° Check incorrect value
These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate and fine aggregate The values of these components must be positive Wit the value of the age, it is between 1 and 365 days
# Check for incorrect values and omit them
Trang 134 Descriptive statistics:
° Data summary
After having done cleaning process, we currently have a clear and clean data set in the data frameLet’s summary concrete by using function summary in R
cement slag ash water superplastic coarseagg fineagg
Min 7102.0 Min : 0.0 Min : 0.00 Min 2121.8 Min : 0.000 Min : 801.0 Min 2594.0
Median :272.9 Median : 22.0 Median: 0.00 Median :185.0 Median : 6.400 Median : 968.0 Median :779.5
Mean :281.2 Mean : 739 Mean : 54.19 Mean :181.6 Mean : 6.205 Mean : 972.9 Mean :773.6
3rd Qu :350.0 3rd Qu :142.9 3rd Qu :118.30 3rd Qu :192.0 3rd Qu :10.200 3rd Qu :1029.4 3rd Qu :824.0
Max 1540.0 Max 7359.4 Max 7200.10 Max 1247.0 Max 232.200 Max 1145.0 Max 7992.6
Figure 4.1: Summary data set
° Statistics and plots
Box Plots for Concrete Features
cement slag ash water coarseagg age strength
Figure 4.2: Box plot for every Concrete Features The use of box plots for concrete features can effectively illustrate the distribution, central tendency, and variability of the specific concrete features being analyzed Box plots
10
Trang 14can help in identifying outliers, comparing different feature distributions, and understanding the spread of the data
According to the Boxplot chart, we can see that the content of cosaeagg accounts for a large proportion of the mass of concrete, while the substance with the mass content in the mixture is Superplastic A notable point that needs to be noted is the extremely large mass fluctuations in Cement about 438 kg/m3
Heatmap of relationship for Concrete features
strength - age- fineagg - C0arseagg -
1
i superplastic -
water - ash-
slag-
cement-
cement slag ash water superplastic coarseagg fineagg age strength
Figure 4.3: Heatmap of realation for Concrete features Aheatmap of relationships for concrete features is a visual representation that displays the degree of correlation or association between different features or variables in a dataset Each cell in the heatmap represents the correlation coefficient or some other measure of association between a pair of features, with the color intensity indicating the strength and direction of the relationship
According to the heat map diagram, we can clearly see that there are 1 feature with high strength ratios: Cement
Below is a Pair plot chart to more clearly see the correlation ratio between features
11