Assignment project report cement manufacturing dataset using logistic regressison course probability and statistics

lt can extend to multiple linear regression involving several independent variables and logistic regression, suitable for binary classification problems 2.1.1 Simple linear regression mo

Trang 1

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

OFFICE FOR INTERNATIONAL STUDY PROGRAM

ells:

ASSIGNMENT PROJECT REPORT

Lecturer: PHAN THI HUONG

Semester: 231

Ho Chi Minh City, Dec 2023

Trang 2

Student ID Percentage of work

Tran Hoai Son 2152943

Trang 3

Table of Contents

2.1.1 Simple linear regression model - c1 1 11T 2S 1n TT KT TT key 2 2.1.2 Multiple linear regression model .- - c1 2v 2v v vn nh nen 2

2.2 RANDOM FOREST (DECISION TREES): - QL Q0 20 HS ST HT TH HH TT TK TH kh vu 3

"9-9 - ((Ẵểc 3 2.2.2 Characterizing the Accuracy of Random ForesfS Lee 5

5.1 MULTIPLE LINEAR REGRESSION MODEL 0 0 2222111111211 11115111111 511 1x re 18

5.1.1 Fitting the multiple linear regression model .- -. - ccc se s2 18 5.1.2 Analysis of Variance Table -ccc Q00 0010111111111 HS SH HT TT Tnhh Tnhh nh ng 2n 19

5.2 RANDOM FOREST Q0 Q 2n 2n HT TS TH HH KT TT KT KTS K KH TK KH ket 20

5.2.1 Tune parameters by using grid search - + + ch nh ke 20 5.2.2 Set the train control for the model . 1 S111 12212 1 ng ke 20 5.2.3 Train random foresf model . - - c0 21222222211 1111 11T ng KT ng key 20

6.1 MULTIPLE LINEAR REGRESSION MODEL 0 0 2222111111211 1111511111111 1 re 21

20 VAb) 200/29 1 dddddđd sẽ 21 6.1.2 DiSAAdVANTAGES ằ 22

Trang 4

1 Data introduction:

Context of data: Concrete is the most important material in civil engineering The concrete compressive strength is a highly nonlinear function of age and ingredients These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate

Data collection method: The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from laboratory Data is in raw form (not scaled) Number of observations, features: The data has 8 quantitative input variables, and 1 quantitative output variable, and 1030 instances (observations)

° Cement : measured in kg in a m? mixture

° Blast : measured in kg in a m3 mixture

° Fly ash : measured in kg in a m3 mixture

° Water : measured in kg in a mỶ mixture

° Superplasticizer : measured in kg in a m3 mixture

° Coarse Aggregate : measured in kg in a m3 mixture

° Fine Aggregate : measured in kg in a m? mixture

° Age : day (1~365)

° Concrete compressive strength measured in MPa

Population: Concrete compressive strength

Sample: 1030 obsevations (mixtures)

2 Background:

2.1 Linear Regression

Linear regression predicts the relationship between two variables by assuming a linear connection between the independent and dependent variables It seeks the optimal line that minimizes the sum of squared differences between predicted and actual values Applied in various domains like economics and finance, this method analyzes and forecasts data trends

Trang 5

lt can extend to multiple linear regression involving several independent variables and logistic regression, suitable for binary classification problems

2.1.1 Simple linear regression model

In simple linear regression, we attempt to model the relationship between two variables, forexample, income and number of years of education, height and weight of people, length and width of envelopes, temperature and output of an industrial process, altitude and boiling point of water, or dose of a drug and response For a linear relationship, we can use a model of the form:

where y is the dependent or response variable and x is the independent or predictor variable The random variable 1 is the error term in the model In this context, error does not mean mistake but is a statistical term representing random fluctuations, measurement errors,

or the effect of factors outside of our control

The linearity of the model is an assumption We typically add other assumptions about the distribution of the error terms, independence of the observed values of y, and so on Using observed values of x and y, we estimate £6, and 6, and make inferences such as confidence intervals and tests of hypotheses for £,) and £, We may also use the estimated model to forecast or predict the value of y for a particular value of x, in which case a measure of predictive accuracy may also be of interest

2.1.2 Multiple linear regression model

The response y is often influenced by more than one predictor variable For example, the yield of a crop may depend on the amount of nitrogen, potash, and phosphate fertilizers used These variables are controlled by the experimenter, but the yield may also depend on uncontrollable variables such as those associated with weather

A linear model relating the response y to several predictors has the form”

The parameters fp, Ø¡, , 8, are called regression coefficients As in (1.1), 1 provides for random variation in y not explained by the x variables This random variation may be due partly to other variables that affect y but are not known or not observed The model in (1.2) is linear in the b parameters; it is not necessarily linear in the x variables

Trang 6

A model provides a theoretical framework for better understanding of a phenomenon

of interest Thus a model is a mathematical construct that we believe may represent the mechanism that generated the observations at hand The postulated model may be an idealizec oversimplification of the complex real-world situation, but in many such cases, empirical models provide useful approximations of the relationships among variables These relationships may be either associative or causative

2.2 Random forest (Decision Trees):

2.2.1 Definition

Random forest is a commonly-used machine learning algorithm trademarked by Leo Breiman and Adele Cutler, which combines the output of multiple decision trees to reach a single result Its ease of use and flexibility have fueled its adoption, as it handles both classification and regression problems

Since the random forest model is made up of multiple decision trees, it would be helpful to start by describing the decision tree algorithm briefly Decision trees start with a basic question, such as, “Should I surf?” From there, you can ask a series of questions to determine an answer, such as, “Is it along period swell?” or “Is the wind blowing offshore?”

These questions make up the decision nodes in the tree, acting as a means to split the data Each question helps an individual to arrive at a final decision, which would be denoted by the leaf node Observations that fit the criteria will follow the “Yes” branch and those that don’t

will follow the alternate path Decision trees seek to find the best split to subset the data, and they are typically trained through the Classification and Regression Tree (CART) algorithm Metrics, such as Gini impurity, information gain, or mean square error (MSE), can be used to evaluate the quality of the split

This decision tree is an example of a classification problem, where the class labels are

"surf" and "don't surf."

While decision trees are common supervised learning algorithms, they can be prone to problems, such as bias and overfitting However, when multiple decision trees form an ensemble in the random forest algorithm, they predict more accurate results, particularly when the individual trees are uncorrelated with each other

Trang 7

The random forest algorithm is an extension of the bagging method as it utilizes both bagging and feature randomness to create an uncorrelated forest of decision trees Feature randomness, also known as feature bagging or “the random subspace method’ (link resides outside ibm.com), generates a random subset of features, which ensures low correlation among decision trees This is a key difference between decision trees and random forests While decision trees consider all the possible feature splits, random forests only select a subset

of those features

Random forest algorithms have three main hyperparameters, which need to be set before training These include node size, the number of trees, and the number of features sampled From there, the random forest classifier can be used to solve for regression or classification problems

The random forest algorithm is made up of a collection of decision trees, and each tree

in the ensemble is comprised of a data sample drawn from a training set with replacement, called the bootstrap sample Of that training sample, one-third of it is set aside as test data, known as the out-of-bag (oob) sample, which we’ll come back to later Another instance of randomness is then injected through feature bagging, adding more diversity to the dataset and reducing the correlation among decision trees Depending on the type of problem, the determination of the prediction will vary For a regression task, the individual decision trees will be averaged, and for a classification task, a majority vote—i.e the most frequent categorical variable—will yield the predicted class Finally, the oob sample is then used for cross-validation, finalizing that prediction

Random forest algorithms have three main hyperparameters, which need to be set before training These include node size, the number of trees, and the number of features sampled From there, the random forest classifier can be used to solve for regression or classification problems

The random forest algorithm is made up of a collection of decision trees, and each tree

in the ensemble is comprised of a data sample drawn from a training set with replacement, called the bootstrap sample Of that training sample, one-third of it is set aside as test data, known as the out-of-bag (oob) sample, which we’ll come back to later Another instance of

Trang 8

randomness is then injected through feature bagging, adding more diversity to the dataset and reducing the correlation among decision trees Depending on the type of problem, the determination of the prediction will vary For a regression task, the individual decision trees will be averaged, and for a classification task, a majority vote—i.e the most frequent categorical variable—will yield the predicted class Finally, the oob sample is then used for cross-validation, finalizing that prediction

2.2.2 Characterizing the Accuracy of Random Forests

2.2.2.1 Random Forests Converge

drawn at random from the distribution of the random vector Y,X, define the margin function

as

mg(X,¥) = aryl (hy (X) = ¥) — marysy ariel (hy (X) =f)

where /(.) is the indicator function The margin measures the extent to which the

average number of votes at X,Y for the right class exceeds the average vote for any other class The larger the margin, the more confidence in the classification The generalization error is given by

PE* = Pyy(mg(X,Y) <0)

where the subscripts X,Y indicate that the probability is over the X,Y space In random

forests, h;, (X) = h(X,0,, ) For a large number of trees, it follows from the Strong Law of

Large Numbers and the tree structure that

As the number of trees increases, for almost surely all sequences ©1 , PE* converges to

Pyy(Po(h(X, 0) = ¥) — maxjzyPo(h(X, 0) = j) < 0)

This result explains why random forests do not overfit as more trees are added, but produce a limiting value of the generalization error

2.2.2.2 Strength and Correlation

For random forests, an upper bound can be derived for the generalization error in terms

of two parameters that are measures of how accurate the individual classifiers are and of the

Trang 9

dependence between them The interplay between these two gives the foundation for understanding the workings of random forests We build on the analysis in Amit and Geman

mr(X,Y) = Po(h(X,6) = ¥) — P;(h(X,6) = J(X,Y))

= Ea[I(h(X,Ø) = Y) — I(h(x,6) = J.,Y))|

The raw margin function is

rmg(6,X,Y) = I(h(X,Ø) = Y) — I(h(X, 9) = 7@.Y))

Thus, mr (X, Y) is the expectation of rmg(0,X, Y) with respect to © For any function f the identity

[Eof (©)? = Foo f(O)F(@)

holds where ©,0' are independent with the same distribution, implying that

mr(X,Y)? = Esgrmg(6, X,Y)xmg(6,X,Y)

Using this gives

var(mr) = Eạø (covy yrmg(O,x, y)rmg(®,,X, Y)) = Esø(p(9, 6 )sd(0)sa(Ø`))

Trang 10

where p(0,0’) is the correlation betweenrmg(0,X,Y )andrmg(0’,X,Y ) holding 0,0’

fixed and sd(@) 1s the standard deviation of rmg(6, X,Y ) holding 0 fixed

Where p is the mean value of the correlation; that is,

Eoo'(e(@,')sd(@)sd(@’)) Eoe'(sd(@)sd(@'))

Write E,var(®) < Eo (Exyrmg(9, X, y)) -s*<1-s?

p=

Trang 11

Figure 3.1: Data set

° Dealing with missing data

We implement some function to test whether they are any missing data or relative

cement —` slag ash water superplastic coarseagg fineagg age strength

problems

Figure 3.2: Couting the missing value

As can be seen from figure above there is no N/A value in our dataset, and itis suitable for the next step however if there is any N/A value then a command “na.omit(concrete)” can

be used to filter it

Trang 12

° Check incorrect value

These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate and fine aggregate The values of these components must be positive Wit the value of the age, it is between 1 and 365 days

# Check for incorrect values and omit them

Trang 13

4 Descriptive statistics:

° Data summary

After having done cleaning process, we currently have a clear and clean data set in the data frameLet’s summary concrete by using function summary in R

cement slag ash water superplastic coarseagg fineagg

Min 7102.0 Min : 0.0 Min : 0.00 Min 2121.8 Min : 0.000 Min : 801.0 Min 2594.0

Median :272.9 Median : 22.0 Median: 0.00 Median :185.0 Median : 6.400 Median : 968.0 Median :779.5

Mean :281.2 Mean : 739 Mean : 54.19 Mean :181.6 Mean : 6.205 Mean : 972.9 Mean :773.6

3rd Qu :350.0 3rd Qu :142.9 3rd Qu :118.30 3rd Qu :192.0 3rd Qu :10.200 3rd Qu :1029.4 3rd Qu :824.0

Max 1540.0 Max 7359.4 Max 7200.10 Max 1247.0 Max 232.200 Max 1145.0 Max 7992.6

Figure 4.1: Summary data set

° Statistics and plots

Box Plots for Concrete Features

cement slag ash water coarseagg age strength

Figure 4.2: Box plot for every Concrete Features The use of box plots for concrete features can effectively illustrate the distribution, central tendency, and variability of the specific concrete features being analyzed Box plots

10

Trang 14

can help in identifying outliers, comparing different feature distributions, and understanding the spread of the data

According to the Boxplot chart, we can see that the content of cosaeagg accounts for a large proportion of the mass of concrete, while the substance with the mass content in the mixture is Superplastic A notable point that needs to be noted is the extremely large mass fluctuations in Cement about 438 kg/m3

Heatmap of relationship for Concrete features

strength - age- fineagg - C0arseagg -

1

i superplastic -

water - ash-

slag-

cement-

cement slag ash water superplastic coarseagg fineagg age strength

Figure 4.3: Heatmap of realation for Concrete features Aheatmap of relationships for concrete features is a visual representation that displays the degree of correlation or association between different features or variables in a dataset Each cell in the heatmap represents the correlation coefficient or some other measure of association between a pair of features, with the color intensity indicating the strength and direction of the relationship

According to the heat map diagram, we can clearly see that there are 1 feature with high strength ratios: Cement

Below is a Pair plot chart to more clearly see the correlation ratio between features

11

Tiêu đề	Cement Manufacturing Dataset Using Logistic Regression
Tác giả	Trần Hoăi Sơn, Tran Dan Nhi, Tran Thi Phuong Vy
Người hướng dẫn	PHAN THI HUONG
Trường học	HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
Chuyên ngành	Probability and Statistics
Thể loại	Assignment Project Report
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	28
Dung lượng	2,73 MB