project report concrete compressibility data analysis

Subsequently, save it in the Folder named ‘Project’ and usethe command below to import the data:Explain: We use read.csv command for importing data from csv file.III.Data cleaningIn orde

Trang 1

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

FACULTY OF CIVIL ENGINEERING

Trang 2

PROBABILITY & STATISTICS

Ho Chi Minh City University of Technology

Faculty of Civil Engineering

1 | P a g e

-Contents

I Dataset introduction 3

1 General information 3

2 Variables in dataset 3

II Import data 4

III Data cleaning 4

IV Data visualization 5

1 Data transformations 5

2 Descriptive statistic 6

3 Graph 6

3.1 Histogram 6

3.2 Boxplot 8

3.3 Pair plot 9

3.4 Conclusion 11

4 Remove outliers 11

5 Spearman’s correlation 13

5.1 Correlation analysis 13

5.2 Purpose of Spearman’s correlation 15

5.3 Assumptions of the test 15

5.4 Calculation for Spearman correlation coefficient 15

5.5 Ranking the variables 16

5.6 Apply and Result 16

5.7 Significance test 17

5.8 Conclusion 19

V Fitting linear regression model 19

1 Linear regression model 19

1.1 Simple linear regression model 19

1.2 Multiple linear regression model 20

2 Apply and result 20

2.1 Build model 20

2.2 Interpretation 21

3 Prediction and graph 23

Trang 3

-Figures

Figure 1: Histograms 7

Figure 2: Boxplots 9

Figure 3: Pair Plot 10

Figure 4:Relationship between Cement and Concrete.Compressive.Strength 11

Figure 5:Boxplot after removing outliers 13

Figure 6: Correlation 14

Figure 7: Monotomic relationship 15

Figure 8:Correlation matrix 16

Figure 9: Prediction 24

Trang 4

we can find the most effective way to produce concrete

In this project, we will analyse the dataset concrete.csv that we collected from CementManufacturing Dataset The dataset comprises 9 factors that affect the compressivestrength of concrete with 1030 observations

3

mixture

A main material to form concrete, andthis material is obtained by quenchingmolten iron slag form a blast furnace inwater or steam

mixture

Coarse Aggregate refer to irregular and granular materials such as sand, gravel, or crushed stone and are used for making concrete

mixture

Fine Aggregate are basically natural sand particles from the land through the miningprocess and used for making concrete

Trang 5

8 Age of testing day was poured in place and left to set.Age of concrete is the time elapsed since it

This is the major variable of this dataset

It is the strength of hardened concrete measured by the compression test.Moreover, it depends on the 8 variables above

II Import data

First of all, we need to transfer the dataset file from Excel (*.xlsx) into CommaSeparated Values (*.csv) Subsequently, save it in the Folder named ‘Project’ and usethe command below to import the data:

Explain: We use read.csv() command for importing data from csv file.

III Data cleaning

In order to check whether the names of variables are simple enough to work, we use the

print() command to print them as outputs:

So, we can receive the output:

By observing this output, we can see that the names of variables are short and simple

to work with so we do not need to simplify them

Trang 6

Explain 1: We use command str() to print the variable’s data structure to see whether it has NA value of string value and to know if we need to transform our data or not.

> str(data)

So, we can see the output:

So, we can see the data are all in form of numbers and integers but we need to check for

the NA data in the whole dataset by using anyNA() function:

Now we can see the output:

Therefore, we can see this data don’t have any NA value, so we don’t need to clean it anymore, so we just assign dataclean = data

In case the output is ‘TRUE’, we nead to remove all the NA data before assigning

Trang 7

Explain: We use summary( ) function to see descriptive statistics of each variable

So, we can see the output:

Comment: Because by just 6 value we can’t conclude anything about each variable’s

distribution or something more so we need stronger tool is graph

Histograms are commonly used in statistics to demonstrate how many of a certain type

of variable occur within a specific range

Both histograms and bar charts provide a visual display using columns, and peopleoften use the terms interchangeably Technically, however, a histogram represents thefrequency distribution of variables in a data set A bar graph typically represents agraphical comparison of discrete or categorical variables

> summary ( dataclean )

Trang 8

-In many cases, the data will tend to be concentrated around a central point, not skewedleft or right However, the shape that closely resembles a "bell" is called a normaldistribution

Through these definition, we can use R which allows us to have such a

better visualization about this:

- Firstly, we need to import 2 libraries use for plotting the graph:

Trang 9

As we can realize from the histograms, most of our variables are not

normally distributed because their distributions are not concentrated around a central point, but they are skewed left and right

Trang 10

- Minimum Score: The lowest score, excluding outliers (shown at the end of theleft whisker).

- Lower Quartile: Twenty-five percent of scores fall below the lower quartilevalue (also known as the first quartile)

that divides the box into two parts (sometimes known as the second quartile).Half the scores are greater than or equal to this value and half are less

value (also known as the third quartile) Thus, 25% of data are above this value

right whisker)

(i.e the lower 25% of scores and the upper 25% of scores)

of scores (i.e., the range between the 25% and 75% percentile)

Because we import 2 libraries for plotting above so we don’t need to import anymorejust run the command directly:

Trang 11

-Now we can see the output:

Figure 2: Boxplots

quartile and 3 quartile of each variable in the dataset Therefore, it is possible tond

conclude again that most of the variables in the dataset are not normally distributed

3.3 Pair plot

Pair plot is used to understand the best set of features to explain a relationship betweentwo variables or to form the most separated clusters One of the simplest ways tovisualize the relation between all features, the pair plot method plots all the pair

> ggplot ( gather dataclean aes value ( ), (, )) +geom_boxplot ()

+facet_wrap key scales ( ~ , = "free"

ncol= )

Trang 12

-11 | P a g e

relationships in the dataset at once The method takes all the features in the dataset and plots it against each other

Using pair plot to see relationship between each pair of variables:

The result is illustrated by the following figure:

Figure 3: Pair Plot

From the pair plot above, we can see Cement is the factor that is directly proportional to the Concrete compressive strength And the plot below will give more specific relationship between these two variables

> pairs ( dataclean,col = "#6F8FAF" , main = "PAIRPLOT" )

> pairs(dataclean[, c("cement", "strength")])

Trang 13

Figure 4:Relationship between Cement and Concrete Compressive Strength 3.4 Conclusion

By observing both boxplot and pair plot, it can be seen that there are some outliers

in our data so we have to remove them from our dataset

4 Remove outliers

An outlier may be due to variability in the measurement or it may indicateexperimental error; the latter are sometimes excluded from the dataset An outlier cancause serious problems in statistical analyses

in our data in order to avoid the problems in statistical analyses:

Trang 14

After removing the outliers, we need to assign a new dataframe for our dataset:

The output:

Through this output, we can recognize that there are 85 observations are deleted fromthe dataset and now we have to consider about only 945 observations by drawinganother box plot:

And the output:

[1] 945

> dataclean2 <- dataclean[- c(slag, ash, water,

superplastic,coarseagg,fineagg,age),]

> nrow(dataclean2)

> ggplot ( gather dataclean aes value ( ), (, )) +geom_boxplot () +facet_wrap key ( ~ ,

scales = "free", ncol = 4 )

> ash <- which(dataclean$ash %in% boxplot(dataclean$ash, plot=FALSE)$out)

> water <- which(dataclean$water %in% boxplot(dataclean$water, plot=FALSE)$out)

> superplastic <- which(dataclean$superplastic %in% boxplot(dataclean$superplastic, plot=FALSE)

$out)

> coarseagg <- which(dataclean$coarseagg %in% boxplot(dataclean$coarseagg, plot=FALSE)$out)

> fineagg <- which(dataclean$fineagg %in% boxplot(dataclean$fineagg , plot=FALSE)

$out)

> age <- which(dataclean$age %in% boxplot(dataclean$age , plot=FALSE)$out)

Trang 15

Figure 5:Boxplot after removing outliers

Actually, we cannot remove all the outliers, but we can only remove a part of them Moreover, the outliers that cannot be removed have some significant effects on the dataset

5 Spearman’s correlation

5.1 Correlation analysis

Correlation analysis in research is a statistical method used to measure the strength ofthe linear relationship between two variables and compute their association Simplyput correlation analysis calculates the level of change in one variable due to the change

in the other A high correlation points to a strong relationship between the twovariables, while a low correlation means that the variables are weakly related

Trang 16

And we measured the correlation between variable by using correlation’s coefficient (r) range from -1 to 1:

- With r < 0 indicates the nagative relationship

- With r = 0 indicates there are no relationship between two variables

- With r > 0 indicates the positive relationship

- With r = −1 or r = 1 indicates that two variables is completely dependent toeach other

Figure 6: Correlation

There are two types of correlation’s coefficient which are the Pearson correlation coefficient and the Spearman correlation coefficient:

- Pearson correlation coefficient: We often use it to evaluate linear relationship

between two variables and it can be used when two variables is normallydistributed

- Spearman correlation coefficient: We often use it to evaluate non linear

relationship and it can be used in any situation because it can use for twovariables which are not normally distributed

Trang 17

16 | P a g e

-5.2 Purpose of Spearman’s correlation

The Spearman’s rank-order correlation is the nonparametric version of the Pearson product moment correlation Spearman’s correlation coefficient (�, also signified by

��) measures the strength and direction of association between two ranked variables

5.3 Assumptions of the test

You need two variables that are either ordinal, interval or ratio Although you wouldnormally hope to use a Pearson product-moment correlation on interval or ratio data,the Spearman correlation can be used when the assumptions of the Pearson correlationare markedly violated However, Spearman's correlation determines the strength and

direction of the monotonic relationship between your two variables rather than the

strength and direction of the linear relationship between your two variables, which iswhat Pearson's correlation determines

- What is monotonic relationship?

A monotonic relationship is a relationship that does one of the following:

As the value of one variable increases, so does the value of the other variable

As the value of one variable increases, the other variable value decreases Examples of monotonic and non-monotonic relationships are presented in the diagram below:

Figure 7: Monotomic relationship Note: A monotonic relationship is not strictly an assumption of Spearman’s

correlation That is, you can run a Spearman’s correlation on a non-monotonicrelationship to determine if there is a monotonic component to the association

⇒ So when we list the noticeable number between two variables we need to performthe significance test to make sure that there are have monotonic relationship betweenthis pair of variables or not

5.4 Calculation for Spearman correlation coefficient

The Spearman correlation coefficient is defined as the Pearson correlation coefficient between the rank variables

For a sample of size n, the n raw scores �� , � are converted to ranks ((((� (((( ), ((((�((�), and(((((

�� is computed as:

�� = �� ( )� ,,,,,,,,,, (�) ,,,,,

Trang 18

- �� (((( ((((((( ), )): the covariance of the rank variables((((((

- �)(( �� )((: the standard deviations of the rank variables

5.5 Ranking the variables

- Sort all value of that variable from the highest to the lowest

- The highest has rank 1 and continue to the second highest is rank 2 and so on

to the lowest value

5.6 Apply and Result

(we could or couldn’t use library(corrplot) depend on the version of Rstudio)

Explain: We assign M as Spearman’s correlation coefficient of the dataset by each

pairs of variable so that we can see the output below:

> library(corrplot)

> M <- cor(dataclean2, method = 'spearman')

> corrplot(M, method = 'number', tl.col="black", col = colorRampPalette(c("#F6F2D4",

"#95D1CC", "#22577E"))(100))

Trang 19

18 | P a g e

Figure 8:Correlation matrix

Trang 20

-At this figure, we will have:

- Concrete compressive strength and Age of testing day have a strong positivecorrelation ( � = 0.49)

- Concrete compressive strength and Cement have a strong positive correlation(� = 0.46)

- Fly ash and Super plasticizer have a strong positive correlation (� = 0.45)

- Super plasticizer and Water have a strong negative correlation (� −0.52)=

5.7 Significance test

In Spearman’s correlation test we have one more step is significance test to check whether the correlation between two variables is monotonic or not

So we set the hypothesis test where:

- �0: there is no monotonic correlation ( = 0)�

- �1: there is monotonic correlation (� ≠ 0)

We can test for significance using the equation:

� = �√ 1 − �� − 22

Which is distributed approximately as Student’s t-distribution with � − 2 ��

of freedom under the null hypothesis test with � = 1% so if � − �� < 1%,

we can reject �0

a Concrete compressive strength and Cement

So, we can see the output below:

(this output rho will appear or not , which depend on the version of Rstudio)

[1] 1.325723e-50

> cement_concrete_strength <- cor.test(dataclean2$cement,dataclean2$strength,

method = "spearman",exact = FALSE)

> cement concrete strength$p.value

> cement_concrete_strength$estimate*sqrt((821-2)/(1-

cement_concrete_strength$estimate^2))

Tiêu đề	Concrete Compressibility Data Analysis
Người hướng dẫn	Dr. Nguyen Tien Dung
Trường học	Ho Chi Minh City University of Technology
Chuyên ngành	Probability and Statistics
Thể loại	Project Report
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	29
Dung lượng	3,64 MB