HCMC University of Technology VIET NAM NATIONAL UNIVERSITY HCMC UNIVERSITY OF TECHNOLOGY DEPARTMENT OF CHEMICAL ENGNEERING Report Assignment PROBABILITY AND STATISTIC Report Assignment Lecturer PhD Ng[.]
VIET NAM NATIONAL UNIVERSITY HCMC UNIVERSITY OF TECHNOLOGY DEPARTMENT OF CHEMICAL ENGNEERING Report Assignment PROBABILITY AND STATISTIC Report Assignment Lecturer: PhD Nguyễn Tiến Dũng CC02 – Group 09 Team member No Name Student ID Nguyễn Gia Phát 2152228 Nguyễn Trọng Nguyên 2152197 Lê Nguyễn Phú Anh 2152383 Nguyễn Quốc Hưng 2153411 Đỗ Tấn Kiệt 1852490 Sign Ho Chi Minh, Sunday 04nd December 2022 HCMC University of Technology TABLE OF CONTENTS I Topic II Theoretical basis 2.1 One-way ANOVA 2.2 Two-way ANOVA 2.3 Prediction model - Multiple Linear Regression III Data processing 1.Data import Checking statistics values Data visualization Building a linear regression model 19 Make forecasts for the compressive strength of concrete 24 REFERENCES 25 Contribution of team members Points 1|Page HCMC University of Technology I Topic Concrete is the most important material in civil engineering The concrete compressive strength is a highly nonlinear function of age and ingredients File “concrete.csv” contains information about the compressive strength of concrete affected by variables The data set was taken from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength The data set contains 1030 instances of the compressive strength of concrete and attributes Mains variables in the dataset: Cement – quantitative – kg in a m3 mixture – Input Variable Blast Furnace Slag – quantitative – kg in a m3 mixture – Input Variable Fly Ash – quantitative – kg in a m3 mixture – Input Variable Water – quantitative – kg in a m3 mixture – Input Variable Superplasticizer – quantitative – kg in a m3 mixture – Input Variable Coarse Aggregate – quantitative – kg in a m3 mixture – Input Variable Fine Aggregate – quantitative – kg in a m3 mixture – Input Variable Age – quantitative – kg in a m3 mixture – Input Variable Concrete compressive strength – quantitative –MPa– Out Variable The purpose of our team is to test whether the linear regression model between the concrete compressive strength really exits, if it does, make a forecast base on the data in the file “concrete.csv”, and use Anova analyze the influence of each variable II Theoretical basis 2.1 One-way ANOVA One way ANOVA is a hypothesis test used for testing the equality of three or more population means simultaneously using variance For example: 2|Page HCMC University of Technology In one laboratory, a team studied whether changes in CO2 concentration affected the germination rate of soybean seeds by gradually increasing the CO2 concentration and recording the height of the bean sprouts after day • Statistical problem: Comparing the height means between groups of CO2 concentration Assumptions for using one-way ANOVA: • The population are normally distributed To test the normality, we use the Normal probability plot of the Residuals (mentioned in Prediction model) • The sample are random and independent • The population has equal variances An observed dataset can be generalized as table below: Treatment Observation Totals Average y 11 y 12 … y1 n y1 y1 y 21 y 22 … y2 n y2 y2 … … … … … … … A ya ya … y an ya ya … a n y =∑ ∑ y ij y = y /an i=1 j=1 Model considered: Yij=µ+ τi+ϵij (i = 1, 2, , a; j = 1, 2, , n) • Where: µ is the overall mean, τi is the ith treatment effect, ϵij is the random error component Null and alternative hypotheses: 2=…=τk=0 {HH10:: τiτ≠1=τ with at least one i Sum of square (SS) Degree of Median of square (MS) 3|Page HCMC University of Technology freedom(df) Treatment a SStreatment =n ∑ ( y i− y ) a−1 i=1 Error SS E=n ∑ ∑ ( y ij − y i ) a (n − 1) SST = SStreatment + an – a n MSerror = i=1 j=1 Total SStreatment a−1 MStreatment = SSE [a(n−1)] SSerror Test statistic: F 0= MStreatment ¿ =SStreatment /(a−1)¿ MSE SSE /¿¿ • F0 has a Fisher distribution with (a−1) and a (n−1)degree of freedom F ∼ fa−1 ,a (n−1) · • Given α, H0 would be rejected if f >fa−1 , a( n−1)α· 2.2 Two-way ANOVA Two-way ANOVA is a statistical technique that used for examining the effect of two factors on the continuous dependent variable It also studies the interrelationship between the two independent variables which influences the values of the dependent one For example: In an Arithmetic test, several male and female students of different ages participated Exam results are recorded In this case, two-way ANOVA could be used to determine if gender and age affected the scores • Statistical problem: Comparing the score means according to the genders and ages Assumptions for using two-way ANOVA are similar with one-way ANOVA (section 2.2) The table of dataset for two-way ANOVA can be generalize as follow: Factor 1 Factor 2 K X11 X21 XK1 4|Page HCMC University of Technology X12 X22 XK2 The mean values: Mean of each Mean of each row Total mean column H X j=∑ X ij H K i=1 X i =∑ X ij j=1 ,2 , , H j=1 X= i=1 , , , K H K H ∑ ∑ X ij ∑ X i ∑ X j i =1 j=1 n = i=1 K = j=1 H Variance analysis factors: Sum of square Group i K SS K =H ∑ ( X i− X ¿ )¿ i=1 Group j H SS H =K ∑ ( X j− X ¿ )¿ i=1 Error SS E=SST −SS K −SS H Total Median of square MS K = SS k K−1 MS H = MS E = Degree of F-ratio freedom K−1 F1 = MS K MS E SS H H −1 H−1 F2 = MS H MS E SS E (H −1)(K −1) ( H−1)( K −1) K SST =∑ ( X ij −X ¿ ) ¿ i=1 KH −1 Factor Factor H0 No difference in means of group i No difference in means of group j H1 At least difference in means of group i At least difference in means of group j Given α Reject H0 if Reject H0 if f 1> fk−1 ,(k −1)( h−1), α· f 2> fh−1 ,(k −1)( h−1) ,α· 5|Page HCMC University of Technology 2.3 Prediction model - Multiple Linear Regression Regression analysis is the collection of statistical tools that are used to model and explore relationships between variables that are related in a non-deterministic manner Multiple linear regression is a critical technique that is deployed to study the linearity and dependency between a group of independent variables and a dependent one The general formula for multiple linear regression can be expressed as: Y = β0 + β x1 +…+ βk x k +ϵ • β , β , , βn are regression coefficients Each parameter represents the change in the mean response, E( y) , per unit increase in the associated predictor variable when all the other predictors are held constant • ϵ is called the random error and follow N (0 ,σ 2) Assumptions of multiple linear regression model: • A linear relationship between the dependent and independent variables (can be tested by using Scatter diagram) Notice that, in some cases, the independent variables are not in compatible formats or linear relationship We can use data transformation to make them fitted and better organized • The independent variables are not highly correlated with each other • The variance of the residuals is constant • Independence of observation • Multivariate normality (occurs when residuals are normally distributed) Predicted Values and Residuals: y i=b +b1 x1 + +b k x k , where the b values • A predicted value is calculated as ^ come from statistical software and the x-values are specified by us yi , the difference between an • A residual (error) term is calculated as e i= y i− ^ actual and a predicted value of y 6|Page HCMC University of Technology Analysis of Variance for Testing Significance of Regression in Multiple Regression Source Regression Sum of square df k n SS R=∑ ¿ ¿ i=1 Residual n−p n SS E=∑ ¿¿ i=1 Total Mean square MS R = MS E = SS R k F0 MS R MS E SS E (n− p) n−1 n SST =∑ ¿ ¿ i=1 with the hypothesises for F : 2=…=βk=0 {HH10: :ββ≠1=β with at least one i i 2 R ∧adjusted R We may also use the coefficient of multiple determination R or adjusted R2 as a global statistic to assess the fit of the model Computationally, R 2= SS R SS =1− E SST SST R= SS E /( n− p) SST /(n−1) III Data processing 1.Data import Import data from “concrete.csv” Figure 1: R code and result of seeing the first six lines of data 7|Page HCMC University of Technology To facilitate the calculation as well as detect unknown values in the excel file, we will convert all the variables to numeric format then the unknown values will be converted to NA Figure 2: R code used to convert variables to numeric format Checking statistics values Checking statistical values for all variables in concrete Figure 3: R code and statistical values of all variables We see in the picture above, surveying the compressive strength compressive of concrete after using every day for year, we change the value of components that 8|Page HCMC University of Technology make up concrete to find the mass of each component specific to create block that brings both economic value and long-term value for both users and producers Data visualization Create a new data named data (including variables like concrete) and convert the variables cement, slag, flyash, water, superplasticiczer, coarseaggregate, fineaggregate, age, csMPa to log ( cement + ) , log ( slag+ ) , log ( flyash+1 ) , log ( water +1 ) , log ( superplasticizer+ ) , log ( coarseaggregate +1 ) respectively Figure 4: R code and results when converting variables to log( x+1) Explain the reason for converting to log ( x +1 ): ● Improve the fit of model: assuming that when we build the regression model, the regression error (residual) must have a normal distribution, so that in the case of regression error (residual) is no normal distribution, taking the log of a variable helps to scale and make the variable distributed standard In addition, in the case of residuals (variable variance) caused by the independent variables, we also can convert those variables to log ● Interpretation: this is the reason why we can interpret the relationship between two variables more conveniently If I take log of the variable Y and the independent X, then the regression coefficient β will be the elasticity coefficient and the interpretation will be as follows: a 1% increase in X will lead to an increase in what we could expect Y to increase β% (in terms of Y’s mean), … 9|Page HCMC University of Technology In this figure, we can almost see the graph of the variable csMPa befor and after converting to log( x+1) form, they are relatively similar to the graph of the normal distribution We will continue to draw the scatter plots of each variable to further test our linear regression model Draw a scatter plot to display how the csMPa variables is distributed in relation to the cement variable both and before the log( x+1) form transfer Figure 11: R code and results when plotting the scatter plot showing the distribution of the csMPa variables according to the cement before and after the transfer to log ( x+1) form 11 | P a g e HCMC University of Technology We see that when in normal form, it is very difficult to see the linearity (specifically, covariance) of the two variables cement and csMPa, and when we converted to log( x+1) form, it is quite easy to see the linearity between the two variables but still a bit uncertain We will check the next variables Draw a scatter plot to display how the csMPa variables is distributed in relation to the other variables both and before the log ( x+1) form transfer Figure 11: R code and results when plotting the scatter plot showing the distribution of the csMPa variables according to the slag before and after the transfer to log( x+1) form 12 | P a g e HCMC University of Technology Figure 11: R code and results when plotting the scatter plot showing the distribution of the csMPa variables according to the flyash before and after the transfer to log ( x+1) form 13 | P a g e HCMC University of Technology Figure 11: R code and results when plotting the scatter plot showing the distribution of the csMPa variables according to the water before and after the transfer to log ( x+1) form 14 | P a g e HCMC University of Technology Figure 11: R code and results when plotting the scatter plot showing the distribution of the csMPa variables according to the superplasticizer before and after the transfer to log( x+1) form 15 | P a g e HCMC University of Technology Figure 11: R code and results when plotting the scatter plot showing the distribution of the csMPa variables according to the coarseaggregate before and after the transfer to log( x+1) form 16 | P a g e HCMC University of Technology Figure 11: R code and results when plotting the scatter plot showing the distribution of the csMPa variables according to the fineaggregate before and after the transfer to log( x+1) form 17 | P a g e HCMC University of Technology Figure 11: R code and results when plotting the scatter plot and the boxplot showing the distribution of the csMPa variables according to the age before and after the transfer to log( x+1) form 18 | P a g e HCMC University of Technology In summary, the graphs above show us that it seems likely that a linear regression model exits, but that linearity does not seem to be the case for all variables We are going to build a linear regression model for all the variables and will check to see if our model is really In addition, it is obvious that the log ( x+1) conversion has helped us to have a clear view of the graph as well as the linearity of the variables Building a linear regression model Consider a linear regression model (lrm1) including: Dependent variable: csMPa Independent variables: cement, slag, flyash, water, superplasticizer, coarseaggregate, fineaggregate, age Figure 11: R code and results when building a liner regression model lrm1 19 | P a g e ... base on the data in the file “concrete.csv”, and use Anova analyze the influence of each variable II Theoretical basis 2.1 One- way ANOVA One way ANOVA is a hypothesis test used for testing the... if gender and age affected the scores • Statistical problem: Comparing the score means according to the genders and ages Assumptions for using two -way ANOVA are similar with one- way ANOVA (section... 2.1 One- way ANOVA 2.2 Two -way ANOVA 2.3 Prediction model - Multiple Linear Regression III Data processing 1.Data import Checking statistics