Subsequently, save it in the Folder named ‘Project’ and usethe command below to import the data:Explain: We use read.csv command for importing data from csv file.III.Data cleaningIn orde
Trang 1VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
FACULTY OF CIVIL ENGINEERING
Trang 2PROBABILITY & STATISTICS
Ho Chi Minh City University of Technology
Faculty of Civil Engineering
1 | P a g e
-Contents
I Dataset introduction 3
1 General information 3
2 Variables in dataset 3
II Import data 4
III Data cleaning 4
IV Data visualization 5
1 Data transformations 5
2 Descriptive statistic 6
3 Graph 6
3.1 Histogram 6
3.2 Boxplot 8
3.3 Pair plot 9
3.4 Conclusion 11
4 Remove outliers 11
5 Spearman’s correlation 13
5.1 Correlation analysis 13
5.2 Purpose of Spearman’s correlation 15
5.3 Assumptions of the test 15
5.4 Calculation for Spearman correlation coefficient 15
5.5 Ranking the variables 16
5.6 Apply and Result 16
5.7 Significance test 17
5.8 Conclusion 19
V Fitting linear regression model 19
1 Linear regression model 19
1.1 Simple linear regression model 19
1.2 Multiple linear regression model 20
2 Apply and result 20
2.1 Build model 20
2.2 Interpretation 21
3 Prediction and graph 23
Trang 3-Figures
Figure 1: Histograms 7
Figure 2: Boxplots 9
Figure 3: Pair Plot 10
Figure 4:Relationship between Cement and Concrete.Compressive.Strength 11
Figure 5:Boxplot after removing outliers 13
Figure 6: Correlation 14
Figure 7: Monotomic relationship 15
Figure 8:Correlation matrix 16
Figure 9: Prediction 24
Trang 4Ho Chi Minh City University of Technology
we can find the most effective way to produce concrete
In this project, we will analyse the dataset concrete.csv that we collected from CementManufacturing Dataset The dataset comprises 9 factors that affect the compressivestrength of concrete with 1030 observations
3
mixture
A main material to form concrete, andthis material is obtained by quenchingmolten iron slag form a blast furnace inwater or steam
mixture
Coarse Aggregate refer to irregular and granular materials such as sand, gravel, or crushed stone and are used for making concrete
mixture
Fine Aggregate are basically natural sand particles from the land through the miningprocess and used for making concrete
Trang 58 Age of testing day was poured in place and left to set.Age of concrete is the time elapsed since it
This is the major variable of this dataset
It is the strength of hardened concrete measured by the compression test.Moreover, it depends on the 8 variables above
II Import data
First of all, we need to transfer the dataset file from Excel (*.xlsx) into CommaSeparated Values (*.csv) Subsequently, save it in the Folder named ‘Project’ and usethe command below to import the data:
Explain: We use read.csv() command for importing data from csv file.
III Data cleaning
In order to check whether the names of variables are simple enough to work, we use the
print() command to print them as outputs:
So, we can receive the output:
By observing this output, we can see that the names of variables are short and simple
to work with so we do not need to simplify them
Trang 6Explain 1: We use command str() to print the variable’s data structure to see whether it has NA value of string value and to know if we need to transform our data or not.
> str(data)
So, we can see the output:
So, we can see the data are all in form of numbers and integers but we need to check for
the NA data in the whole dataset by using anyNA() function:
Now we can see the output:
Therefore, we can see this data don’t have any NA value, so we don’t need to clean it anymore, so we just assign dataclean = data
In case the output is ‘TRUE’, we nead to remove all the NA data before assigning
Trang 7PROBABILITY & STATISTICS
Ho Chi Minh City University of Technology
Faculty of Civil Engineering
Explain: We use summary( ) function to see descriptive statistics of each variable
So, we can see the output:
Comment: Because by just 6 value we can’t conclude anything about each variable’s
distribution or something more so we need stronger tool is graph
Histograms are commonly used in statistics to demonstrate how many of a certain type
of variable occur within a specific range
Both histograms and bar charts provide a visual display using columns, and peopleoften use the terms interchangeably Technically, however, a histogram represents thefrequency distribution of variables in a data set A bar graph typically represents agraphical comparison of discrete or categorical variables
> summary ( dataclean )
Trang 8-In many cases, the data will tend to be concentrated around a central point, not skewedleft or right However, the shape that closely resembles a "bell" is called a normaldistribution
Through these definition, we can use R which allows us to have such a
better visualization about this:
- Firstly, we need to import 2 libraries use for plotting the graph:
Trang 9As we can realize from the histograms, most of our variables are not
normally distributed because their distributions are not concentrated around a central point, but they are skewed left and right
Trang 10- Minimum Score: The lowest score, excluding outliers (shown at the end of theleft whisker).
- Lower Quartile: Twenty-five percent of scores fall below the lower quartilevalue (also known as the first quartile)
that divides the box into two parts (sometimes known as the second quartile).Half the scores are greater than or equal to this value and half are less
value (also known as the third quartile) Thus, 25% of data are above this value
right whisker)
(i.e the lower 25% of scores and the upper 25% of scores)
of scores (i.e., the range between the 25% and 75% percentile)
Because we import 2 libraries for plotting above so we don’t need to import anymorejust run the command directly:
Trang 11-Now we can see the output:
Figure 2: Boxplots
quartile and 3 quartile of each variable in the dataset Therefore, it is possible tond
conclude again that most of the variables in the dataset are not normally distributed
3.3 Pair plot
Pair plot is used to understand the best set of features to explain a relationship betweentwo variables or to form the most separated clusters One of the simplest ways tovisualize the relation between all features, the pair plot method plots all the pair
> ggplot ( gather dataclean aes value ( ), (, )) +geom_boxplot ()
+facet_wrap key scales ( ~ , = "free"
ncol= )
Trang 12Ho Chi Minh City University of Technology
-11 | P a g e
relationships in the dataset at once The method takes all the features in the dataset and plots it against each other
Using pair plot to see relationship between each pair of variables:
The result is illustrated by the following figure:
Figure 3: Pair Plot
From the pair plot above, we can see Cement is the factor that is directly proportional to the Concrete compressive strength And the plot below will give more specific relationship between these two variables
> pairs ( dataclean,col = "#6F8FAF" , main = "PAIRPLOT" )
> pairs(dataclean[, c("cement", "strength")])
Trang 13Figure 4:Relationship between Cement and Concrete Compressive Strength 3.4 Conclusion
By observing both boxplot and pair plot, it can be seen that there are some outliers
in our data so we have to remove them from our dataset
4 Remove outliers
An outlier may be due to variability in the measurement or it may indicateexperimental error; the latter are sometimes excluded from the dataset An outlier cancause serious problems in statistical analyses
in our data in order to avoid the problems in statistical analyses:
Trang 14After removing the outliers, we need to assign a new dataframe for our dataset:
The output:
Through this output, we can recognize that there are 85 observations are deleted fromthe dataset and now we have to consider about only 945 observations by drawinganother box plot:
And the output:
[1] 945
> dataclean2 <- dataclean[- c(slag, ash, water,
superplastic,coarseagg,fineagg,age),]
> nrow(dataclean2)
> ggplot ( gather dataclean aes value ( ), (, )) +geom_boxplot () +facet_wrap key ( ~ ,
scales = "free", ncol = 4 )
> ash <- which(dataclean$ash %in% boxplot(dataclean$ash, plot=FALSE)$out)
> water <- which(dataclean$water %in% boxplot(dataclean$water, plot=FALSE)$out)
> superplastic <- which(dataclean$superplastic %in% boxplot(dataclean$superplastic, plot=FALSE)
$out)
> coarseagg <- which(dataclean$coarseagg %in% boxplot(dataclean$coarseagg, plot=FALSE)$out)
> fineagg <- which(dataclean$fineagg %in% boxplot(dataclean$fineagg , plot=FALSE)
$out)
> age <- which(dataclean$age %in% boxplot(dataclean$age , plot=FALSE)$out)
Trang 15Figure 5:Boxplot after removing outliers
Actually, we cannot remove all the outliers, but we can only remove a part of them Moreover, the outliers that cannot be removed have some significant effects on the dataset
5 Spearman’s correlation
5.1 Correlation analysis
Correlation analysis in research is a statistical method used to measure the strength ofthe linear relationship between two variables and compute their association Simplyput correlation analysis calculates the level of change in one variable due to the change
in the other A high correlation points to a strong relationship between the twovariables, while a low correlation means that the variables are weakly related
Trang 16And we measured the correlation between variable by using correlation’s coefficient (r) range from -1 to 1:
- With r < 0 indicates the nagative relationship
- With r = 0 indicates there are no relationship between two variables
- With r > 0 indicates the positive relationship
- With r = −1 or r = 1 indicates that two variables is completely dependent toeach other
Figure 6: Correlation
There are two types of correlation’s coefficient which are the Pearson correlation coefficient and the Spearman correlation coefficient:
- Pearson correlation coefficient: We often use it to evaluate linear relationship
between two variables and it can be used when two variables is normallydistributed
- Spearman correlation coefficient: We often use it to evaluate non linear
relationship and it can be used in any situation because it can use for twovariables which are not normally distributed
Trang 17PROBABILITY & STATISTICS
Ho Chi Minh City University of Technology
Faculty of Civil Engineering
16 | P a g e
-5.2 Purpose of Spearman’s correlation
The Spearman’s rank-order correlation is the nonparametric version of the Pearson product moment correlation Spearman’s correlation coefficient (�, also signified by
��) measures the strength and direction of association between two ranked variables
5.3 Assumptions of the test
You need two variables that are either ordinal, interval or ratio Although you wouldnormally hope to use a Pearson product-moment correlation on interval or ratio data,the Spearman correlation can be used when the assumptions of the Pearson correlationare markedly violated However, Spearman's correlation determines the strength and
direction of the monotonic relationship between your two variables rather than the
strength and direction of the linear relationship between your two variables, which iswhat Pearson's correlation determines
- What is monotonic relationship?
A monotonic relationship is a relationship that does one of the following:
As the value of one variable increases, so does the value of the other variable
As the value of one variable increases, the other variable value decreases Examples of monotonic and non-monotonic relationships are presented in the diagram below:
Figure 7: Monotomic relationship Note: A monotonic relationship is not strictly an assumption of Spearman’s
correlation That is, you can run a Spearman’s correlation on a non-monotonicrelationship to determine if there is a monotonic component to the association
⇒ So when we list the noticeable number between two variables we need to performthe significance test to make sure that there are have monotonic relationship betweenthis pair of variables or not
5.4 Calculation for Spearman correlation coefficient
The Spearman correlation coefficient is defined as the Pearson correlation coefficient between the rank variables
For a sample of size n, the n raw scores �� , � are converted to ranks ((((� (((( ), ((((�((�), and(((((
�� is computed as:
�� = �� ( )� ,,,,,,,,,, (�) ,,,,,
Trang 18PROBABILITY & STATISTICS
Ho Chi Minh City University of Technology
Faculty of Civil Engineering
- ��� (((( ((((((( ), )): the covariance of the rank variables((((((
- �)(( ��� �)((: the standard deviations of the rank variables
5.5 Ranking the variables
- Sort all value of that variable from the highest to the lowest
- The highest has rank 1 and continue to the second highest is rank 2 and so on
to the lowest value
5.6 Apply and Result
(we could or couldn’t use library(corrplot) depend on the version of Rstudio)
Explain: We assign M as Spearman’s correlation coefficient of the dataset by each
pairs of variable so that we can see the output below:
> library(corrplot)
> M <- cor(dataclean2, method = 'spearman')
> corrplot(M, method = 'number', tl.col="black", col = colorRampPalette(c("#F6F2D4",
"#95D1CC", "#22577E"))(100))
Trang 19PROBABILITY & STATISTICS
Ho Chi Minh City University of Technology
Faculty of Civil Engineering
18 | P a g e
Figure 8:Correlation matrix
Trang 20-At this figure, we will have:
- Concrete compressive strength and Age of testing day have a strong positivecorrelation ( � = 0.49)
- Concrete compressive strength and Cement have a strong positive correlation(� = 0.46)
- Fly ash and Super plasticizer have a strong positive correlation (� = 0.45)
- Super plasticizer and Water have a strong negative correlation (� −0.52)=
5.7 Significance test
In Spearman’s correlation test we have one more step is significance test to check whether the correlation between two variables is monotonic or not
So we set the hypothesis test where:
- �0: there is no monotonic correlation ( = 0)�
- �1: there is monotonic correlation (� ≠ 0)
We can test for significance using the equation:
� = �√ 1 − �� − 22
Which is distributed approximately as Student’s t-distribution with � − 2 �������
of freedom under the null hypothesis test with � = 1% so if � − ����� < 1%,
we can reject �0
a Concrete compressive strength and Cement
So, we can see the output below:
(this output rho will appear or not , which depend on the version of Rstudio)
[1] 1.325723e-50
> cement_concrete_strength <- cor.test(dataclean2$cement,dataclean2$strength,
method = "spearman",exact = FALSE)
> cement concrete strength$p.value
> cement_concrete_strength$estimate*sqrt((821-2)/(1-
cement_concrete_strength$estimate^2))