12 3.8 R code and results when find the ratio of missing values in print data.. 25 5.24 Comparison table of observed data versus predicted data from the logistic regression model.. 2.2 T
Trang 1FACULTY OF MECHANICAL ENGINEERING
ASSIGNMENT REPORT
PROBABILITY AND STATISTICS - SEMESTER 232
CC02 - GROUP: 08 - FACULTY: CHEMICAL ENGINEERING
INSTRUCTOR: PhD Nguyen Tien Dung
TEAM LIST:
1 Nguyen Minh Thien 2252770
2 Nguyen Minh Chau 2252090
4 Le Nguyen Phuong 2252655
5 Tran Nguyen Anh Thu 2252801
Ho Chi Minh City, April 2024
Trang 2ASSIGNMENT TASKS
1 Nguyen Minh Thien 20% LATEX, Code R, Data Introduction and
Data Preproceeding
2 Nguyen Minh Chau 20% Descriptive Statistics, ROC , AUC theory,
Conclusion
3 Tran Nguyen Anh Thu 20% Inferential Statistics, Logistic Linear
Re-gression and Decision Tree theory
4 Le Nguyen Phuong 20% Descriptive Statistics, ROC , AUC theory
& Extension
5 Le Van Duc Manh 25% Discussion and Extension
Trang 31.1 DATASET DESCRIPTION 4
1.2 VARIABLE DESCRIPTION 4
2 THEORY BACKGROUND 5 2.1 RECEIVER OPERATOR CHARACTERISTIC (ROC) AND AREA UNDER THE CURVE (AUC) 5
2.1.1 Receiver Operator Charateristic (ROC) 5
2.1.2 Area Under the Curve (AUC) 5
2.2 THEORY OF LOGISTICS REGRESSION 6
2.2.1 Definition 6
2.2.2 Odds and odds ratio 7
2.2.3 Decision boundary 8
2.2.4 Types of Logistic regression 9
2.3 LINEAR REGRESSION MODEL AND LOGISTIC REGRESSION MODEL 9
2.3.1 Linear Regression model 9
2.3.2 Logistic Regression model 9
2.4 DECISION TREE 10
2.4.1 Definition 10
2.4.2 Regression model and Decision Tree model 11
3 DATA PREPROCEEDING 11 3.1 DATA READING 11
3.2 DEALING WITH MISSING DATA 12
3.3 SUBSTITUTE THE MISSING VALUES 13
4 DESCRIPTIVE STATISTICS 15 4.1 DATA SUMMARY 15
4.2 PLOT DATA 16
5 INFERENTIAL STATISTICS 22 5.1 DATA SPLITING 22
5.2 BUILDING LOGISTIC REGRESSION MODEL 22
5.2.1 Performing Logistic Regression with the Training Data Set 22
5.3 ESTIMATING THE 95% CONFIDENCE INTERVAL OF 1 AND ODDS RATIO (OR) 25 5.4 TESTING THE ACCURACY OF THE MODEL 25
5.5 APPLY THE ROC - AUC METHOD TO CALCULATE THE ACCURACY OF MODEL 27
6 DISCUSSION & EXTENSION 29 6.1 DISCUSSION 29
6.2 EXTENSION 29
6.2.1 Implementation of Decision Tree 29
6.2.2 Using ROC – AUC Method to find Model Accuracy 31
Trang 4List of Figures
2.1 ROC-AUC Classification Evaluation Metric 6
2.2 Logistic regression diagram 7
2.3 Graph of the relationship between probability value and odds value 8
2.4 Diagram of a decision tree 10
2.5 Decision tree of deciding to go surfing 11
3.6 The Results when checking data of file ”water potability.csv” 12
3.7 R code and results when checking missing data in print data 12
3.8 R code and results when find the ratio of missing values in print data 13
3.9 R code and results when substitute the ph variable missing values 13
4.10 Code and Results of descriptive statistics for quantitative variables 15
4.11 Boxplot of Potability 16
4.12 The histogram of pH, Hardness and Solids 16
4.13 The histogram of Chloramines, Sulfate and Conductivity 17
4.14 The histogram of Organic carbon, Trihalomethanes and Tubidity 18
4.15 The boxplot of pH, Hardness and Solids with Potability 19
4.16 The boxplot of Chloramines, Sulfate and Conductivity with Potability 19
4.17 The boxplot of Organic carbon, Trihalomethanes and Tubidity with Potability 20
4.18 Code and Result for the Correlation matrix 21
4.19 Code and Result for the the Corrplot Graph 21
5.20 Code and Result for AIC index 22
5.21 The summary of ”research” 24
5.22 Estimate the 95% confidence interval of βi and odds ratio (OR) 25
5.23 Prediction results of the logistic regression model on the test set 25
5.24 Comparison table of observed data versus predicted data from the logistic regression model 26
5.25 The confusion matrix on R Studio 26
5.26 Accuracy value 27
5.27 The model’s ROC curve 27
5.28 The AUC index value 28
6.29 The Decision Tree model 29
6.30 Prediction of the Potability by using Decision Tree model on testing set 30
6.31 Comparision between observation data and data predicted from Decision Tree 30
6.32 The confusion matrix on R Studio 30
6.33 The accuracy of Decision Tree 31
6.34 The ROC graph by using Decision Tree 31
6.35 AUC index by using Decision Tree 31
Trang 51 DATA INTRODUCTION
1.1 DATASET DESCRIPTION
Securing access to potable water is indispensable for maintaining health, constitutes a fundamentalhuman right, and forms an integral component of effective public health policy frameworks Waterserves as a medium for biochemical reactions, contributes to the structural components of biologicalentities, and constitutes over 70% of the human body mass, underpinning the very sustenance of hu-man life
Defined as water that is hygienic, devoid of color, odor, and taste, and free from pathogenic croorganisms and deleterious substances, clean water is imperative for ensuring health safety Theconsumption of contaminated water sources has been associated with the onset of dermatological,hepatic, and gastrointestinal disorders, as well as toxicological effects, which substantially compromisehealth
mi-Consequently, the assessment of water quality through analytical processes, using observed values,
is of paramount importance Given its practical implications, our team opts to select this subject asthe focal theme for our capstone project
1.2 VARIABLE DESCRIPTION
The data set which use in this report is about the analysis of water to determine which waterbodies is portable or not There are some basic information:
• Title: Water Quality
• Population: Water bodies around the world
• Sample: 3276 different water bodies
• Observed Values: Each observed values compatible with each water bodies
• Number of random variables: 10 (details in the following table)
Trang 6Table 1.1 Setting Parameters
cont
mg/L The ability of water to make
soap being precipitated
cont
mg/L The amount of Sulfate be
dis-olved in waterConductivity x ∈ Q — 181.5 ≤ x ≤ 753.3,
cont
µS/cm The conductivity of water
Organic carbon x ∈ N — 2.20 ≤ x ≤ 28.30,
cont
ppm The amount of organic
com-pounds, especially carbons, inwater
Trihalomethanes x ∈ Q — 0.738 ≤ x ≤ 124,
cont
Tri-halomethanes in waterTurbidity x ∈ Q — 1.450 ≤ x ≤ 6.739,
cont
NTU Measure the glowing
charac-teristic of waterPotability x = 0 or x = 1, stag No unit The potability of water
2 THEORY BACKGROUND
2.1 RECEIVER OPERATOR CHARACTERISTIC (ROC) AND AREA UNDERTHE CURVE (AUC)
2.1.1 Receiver Operator Charateristic (ROC)
The receiver operating characteristic (ROC) curve is a statistical relationship that is commonlyemployed in radiology, particularly for determining limits of detection and screening The curves
on the graph show the inherent trade-off between sensitivity and specificity The y-axis representssensitivity and x-axis shows 1-specificity (false positive rate) A perfect test would be completelysensitive with no false positives (100%)
This curve would pass through the point in the very top left corner (Figure 2.2) A uselessdiagnostic test is no better than chance (a straight line across the origin) To create a ROC curvebetween these two extremes, choose the cut-off points for sensitivity and specificity for the modality,condition, and patient group in issue (Figure 2.3)
2.1.2 Area Under the Curve (AUC)
AUC is for Area Under the Curve, and the AUC curve reflects the area beneath the ROC curve Itevaluates the overall performance of the binary classification model As both true positive rate (TPR)and false positive rate (FPR) range from 0 to 1, the area will always be between 0 and 1, and a higherAUC value indicates better model performance Our primary goal is to maximize this area in order to
Trang 7achieve the maximum TPR and the lowest FPR at the specified threshold The AUC calculates thelikelihood that the model will assign a randomly selected positive case a higher predicted probabilitythan a randomly selected negative instance.
It shows the likelihood that our model can distinguish between the two classes found in our target
Figure 2.1 ROC-AUC Classification Evaluation Metric
The AUC-ROC curve for a test can also be used to examine the test’s discriminatory capabilities Itgives crucial information about how well the test performs in a certain clinical context The proximity
of the AUC-ROC curve to the upper left corner of the graph indicates that the test is more efficient
2.2 THEORY OF LOGISTICS REGRESSION
2.2.1 Definition
Logistic regression is a data analysis technique that uses mathematics to estimate the probability
of a particular outcome, based on a given data set of independent variables Predictions usually yield
a finite number of outcomes, such as yes or no
Trang 8To understand more easily, you can think about predicting the probability of a student passing anexam based on their study hours and average grade Logistic regression helps determine whether astudent is likely to pass the exam or not based on factors like study hours and average grade, providing
a probability score ranging from 0 to 1 indicating the likelihood of the student passing
This type of statistical model (also known as logit model) is often used for classification andpredictive analytics Since the outcome is a probability, the dependent variable is bounded between 0and 1 The logistic regression equation can be represented as: In which:
• P (X): Success probability: The probability of the dependent variable is equal to success/case,not failure/unsuccessful (probability equal to 1)
• Xβ = β0+ β1X1+ β2X2+ · · · + βkXk: Independent variables
• Xk: Independent variables
• βk: coefficients associated with the independent variables
If we draw a logistic function, we can see an S-curve and monotonically increases as the picture follow:
Figure 2.2 Logistic regression diagram
Logistic regression also models equations between multiple independent variables and one dent variable It uses sigmoid function to standardize the output into an interval (0; 1) Sigmoidfunction has the formula: f (x) = 1+e1−1
depen-This method tests different values of beta through multiple iterations to optimize for the best fit oflog odds Each time it tries a number, it calculates how likely it is for something to happen based onthat number This is called the ”log likelihood function.” The goal is to find the number that makes thelikelihood of something happening as high as possible Once the optimal coefficient (or coefficients ifthere is more than one independent variable) is found, the model computes the conditional probabilitiesfor each observation, subsequently logarithmically transforming and summing them to generate thefinal predicted probability
2.2.2 Odds and odds ratio
Mathematically, odds is the probability of success divided by the probability of failure (assumethat p is the probability of success):
Odds = p
1 − p
• P (X): Success probability: The probability of the dependent variable is equal to success/case,not failure/unsuccessful (probability equal to 1)
Trang 9• Odds > 1: tho probability of success is higher than that of failure.
• Odds < 1: the probability of success is lower than that of failure
• Odds = 1: the probability of success is equal to that of failure
In logistic regression, a logit transformation is applied on the odds This is also commonly known asthe log odds, or the natural logarithm of odds, and this logistic function is represented by the followingformulas: logit(p) = loge
p 1−p
= 1+e1−p = ln
p 1−p
= b0+ b1X1+ b2X2+ + bnXn= kLog odds can be difficult to make sense of within a logistic regression data analysis As a result,exponentiating the beta estimates is common to transform the results into an odds ratio (OR), easingthe interpretation of results The OR represents the odds that an outcome will occur given a particularevent, compared to the odds of the outcome occurring in the absence of that event In practice, OR is
a measure of the association between a dependent (binary) variable and predictor variables OR can
be represented by this formula:
OR = odds 1odds 0 =
odds(p|x = x0+ 1)odds(p|x = x0)Figure 2.3 Graph of the relationship between probability value and odds value
The higher the odds ratio of a prediction is, the higher the chance it will be the positive label If:
• OR > 1: log(OR) > 0 - the event rate in the first group is higher than in the second group
• OR < 1: log(OR) < 0 - the event rate in the first group is lower than in the second group
• Odds = 1: log(OR) = 0 - the event rates in both groups are equal
No matter how low the negative value is, it can still be converted by taking the antilog to the oddsvalue This value now becomes the dependent variable of the Logistic Regression model
To use an example, let’s say that we were to estimate the odds of survival on the Titanic giventhat the person was male, and the odds ratio for males was 0.0810 We’d interpret the odds ratio asthe odds of survival of males decreased by a factor of 0.0810 when compared to females, holding allother variables constant
2.2.3 Decision boundary
The decision boundary is the border or dividing line between classes or groups in classificationproblems For instance, in assessing whether water is clean or not, the decision boundary serves as thedividing line between two types of water: clean water and contaminated water We utilize chemicalindicators of water to construct a classification model When there’s a new water sample, the modeluses the decision boundary to determine whether the water is clean or contaminated based on themeasured chemical indicators
The process of converting continuous probability values (values between 0 and 1) predicted by
a predictive model (e.g., logistic regression model) into discrete classes (true/false or 1/0) in binary
Trang 10classification problems To be able to map this probability into discrete categories (true/false), weneed to choose a threshold value that if the probability is greater than this value, we will classify
it into that category, and if it is lower, it will be classified into the remaining category For binaryclassification, a probability less than 0.5 will be predicted 0, otherwise, will be predicted 1
(
p ≥ 0.5, class = 1
p < 0.5, class = 0For instance, if the threshold value is 0.5 and the prediction is 0.2, we can classify that data point aspositive If the prediction is 0.2, we can classify that data point as negative For multi-class linearregression, we can choose the category that has the highest prediction probability
2.2.4 Types of Logistic regression
There are three types of logistic regression models, which are defined based on categorical response:
• Binary Logistic regression: In this approach, the dependent variable has only two possibleoutcomes, typically represented as 0 or 1 This approach is frequently utilized in various fields,such as predicting email spam or determining tumor malignancy It’s the most widely used form
of logistic regression and a common choice for binary classification tasks
• Multinomial logistic regression: This type of logistic regression deals with dependent ables having three or more unordered outcomes For instance, in the movie industry, predictingwhich genre a viewer prefers among several options helps studios market their films effectively
vari-By considering factors like age, gender, and relationship status, a multinomial logistic regressionmodel can gauge their influence on movie prefernces, aiding in targeted advertising campagins
• Ordinal logistic regression: This type of logistic regression model is leveraged when theresponse variable has three or more possible outcome, but in this case, these values do have adefined order Examples of ordinal responses include grading scales from A to F or rating scalesfrom 1 to 5
2.3 LINEAR REGRESSION MODEL AND LOGISTIC REGRESSION MODEL
Both linear and logistic regression are among the most popular models within data science, andopen-source tools, like Python and R, make the computation for them quick and easy
2.3.1 Linear Regression model
Linear regression model are used to identify the relationship between a continuous dependentvariable and one or more independent variables When there is only one independent variable and onedependent variable, it is known as simple linear regression, but as the number of independent variablesincreases, it is referred to as multiple linear regression In both types, the objective is to establish
a line of best fit through the data set, usually computed using the least squares method (The leastsquares method calculates the distance of each point from the line, squares these distances to makethem positive, and then adds them up The line with the smallest sum of these squared distances isdeemed the best- fitting line)
A continuous variable can have a range of values, such as price or age Therefore, linear regressioncan predict the actual value of the dependent variable This technique can answer questions such as
”What will the price of rice be in 10 years?” Linear regression equation is: y = β0x0+ β1x1+ β2x2+ + βnxn+ ε (βn and ε are regression constants)
2.3.2 Logistic Regression model
Similar to linear regression, logistic regression is also used to estimate the relationship between adependent variable and one or more independent variables, but it is used to make a prediction about
Trang 11a categorical variable versus a continuous one A categorical variable can be true or false, yes or no, 1
or 0, et cetera The unit of measure also differs from linear regression as it produces a probability, butthe logit function transforms the S-curve into straight line This technique cannot predict the truevalue for continuous data This technique can answer questions such as ”Will the price of rice in thenext 10 years increase by 50
While both models are used in regression analysis to make predictions about future outcomes,linear regression is typically easier to understand Linear regression also does not require as large of asample size as logistic regression needs an adequate sample to represent values across all the responsecategories Without a larger, representative sample, the model may not have sufficient statisticalpower to detect a significant effect
With its use in solving classification problems, logistic regression will be used in this project todetermine water quality by determining the relationship between variables such as pH, hardness, solids,chloramines, sulfate, conductivity, organic carbon, trihalomethanes, and turbidity of each area anduse it to predict whether the water in that place is drinkable or not
2.4 DECISION TREE
2.4.1 Definition
Linear and logistic regression models will not work in cases where the relationship which betweencharacteristics and outcomes is nonlinear And the Decision Tree model will solve that A decisiontree is a non-parametric supervised learning algorithm, which is utilized for both classification andregression tasks The goal is to create a model that predicts the value of a target variable by learningsimple decision rules inferred from the data features A tree can be seen as a piecewise constantapproximation It has a hierarchical, tree structure, which consists of a root node, branches, internalnodes and leaf nodes
As the diagram below (Figure 2.6): A decision tree starts with a root node, which does not haveany incoming branches The outgoing branches from the root node then feed into the internal nodes,also known as decision nodes Based on the available features, both node types conduct evaluations
to form homogenous subsets, which are denoted by leaf nodes, or terminal nodes The leaf nodesrepresent all the possible outcomes within the dataset
To predict results at each leaf branch, the average result of the data at this node will be used Thistype of flowchart structure also creates an easy to digest representation of decision-making, allowingdifferent groups across an organization to better understand why a decision was made
Figure 2.4 Diagram of a decision tree
Trang 12The same as logistic regression, the decision tree model can set a threshold to predict an outcome of
“Yes” or “No” At each leaf branch, we have a group of observations, which can include two outcomes(Yes/No) In addition, we can calculate the percentage in the group for each result
As an example, let’s imagine that you were trying to assess whether or not you should go surf, youmay use the following decision rules to make a choice:
Figure 2.5 Decision tree of deciding to go surfing
The complexity of a decision tree largely determines whether data points are grouped into neous sets Smaller trees are better at achieving pure leaf nodes, but as trees grow, maintaining puritybecomes harder, leading to data fragmentation and potential overfitting Decision trees prefer sim-plicity, adhering to Occam’s Razor principle: “entities should not be multiplied beyond necessity.”.Pruning, the removal of branches on less important features, helps reduce complexity and preventoverfitting
homoge-2.4.2 Regression model and Decision Tree model
The Logistic Regression model system demonstrates the importance of variables, but does notprovide an explanation of how predictions are made The Decision Tree model is easier to interpretthan the Logistic Regression model Decision Trees can capture nonlinear interactions among variablesthat Logistic Regression cannot express
3 DATA PREPROCEEDING
3.1 DATA READING
Read the data from file ”water.potability.csv” and assign it the name print data, check thedata to make sure that R Studio can read all the data successfully:
Trang 13Result
Figure 3.6 The Results when checking data of file ”water potability.csv”
3.2 DEALING WITH MISSING DATA
To ensure that our data does not have missing data, we use the command ”colSumsis.na(print data)”
to find the total number of missing data If the result after using the command is 0, meaning there is
no missing data in the following variables
Figure 3.7 R code and results when checking missing data in print data
Comment: From the result, there are 491 missing values of ”ph” variable and 781 missing values
of ”Sulfate” variable
The ratio of missing values:
Trang 14Figure 3.8 R code and results when find the ratio of missing values in print data
3.3 SUBSTITUTE THE MISSING VALUES
With the pH value, the scientists have proved that the hardness of water has a great impact tothe pH level of water Therefore, the pH values are divided into 4 different groups with the followingcriteria:
• Potability = 0, Hardness <= 150
• Potability = 0, Hardness > 150
• Potability = 1, Hardness <= 150
• Potability = 1, Hardness > 150
Figure 3.9 R code and results when substitute the ph variable missing values
Then, substitute the missing values in each groups by the mean values:
Missing values in ”Sulfate” variable
Seawater typically comprises approximately 2,700 milligrams per liter (mg/L) of sulfate In trast, the sulfate concentration in most freshwater sources varies from 3 to 30 mg/L However, certaingeographical regions exhibit considerably higher levels, reaching up to 1,000 mg/L Regrettably, theavailable dataset only encompasses sulfate concentrations ranging from 129 to 481 mg/L Consequently,this analysis will be limited to calculating the mean values of potable and non-potable samples, withthe observed discrepancy being a mere 2 mg/L
Trang 15con-Missing values in ”Trihalomethanes” variable
Trihalomethanes (THMs) are chemical compounds present in water that has been treated withchlorine The concentration of THMs in potable water is influenced by several factors, including thelevel of organic material in the water, the requisite amount of chlorine for water treatment, and thetemperature of the water during treatment A concentration of THMs up to 80 parts per million(ppm) is deemed safe for drinking water For the purposes of this analysis, missing data points will
be imputed using the mean value of the Trihalomethanes variable
Rechecking the missing values
Input
Result
Trang 164 DESCRIPTIVE STATISTICS
4.1 DATA SUMMARY
After handling the missing data, we summarized the data with the command “summary()”statement to find the skew and kurtosis of each variable
Figure 4.10 Code and Results of descriptive statistics for quantitative variables
Descriptive statistical numbers include minimum (”Min”), average (”Mean”), maximum (”Max”),and quartiles (”1st Qu.”, ”Median”, ”3rd Qu.”) The variable ph represents the pH of the watersample: The smallest value ”Min” is 0; the greatest value ”Max” is 14; the ”Mean” average pH value
is 7.104; the median ”Median” is 7.104, which is extremely close to the average pH, implying thatalmost 50% of observed water samples have lower pH values than the average pH value; The firstpercentile level (Q1) is 6.278, indicating that at least 25% of important observations have pH valuesless than or equal to 6.278 and at least 75% of data have pH greater than 6.278 The third percentilelevel (Q3) is 7.870, indicating that at least 75% of observed water samples have pH values less than
or equal to 7.870 and at least 25% of data have pH greater than 7.870
Statistics for categorical variables Potability Potability is a categorical variable, thus wecan’t gain an overview of the information in it As a result, we will do descriptive statistics on thisvariable by generating a frequency table
Comment: After summarizing the data, we obtain the categorical variable ”Potability” withquantity observed for each value of the variable; precisely, 1998 observed water samples gave thevalue 0, indicating ”cannot drink,” and 1278 observed water samples gave the value 1, indicating ”candrink”
Trang 174.2 PLOT DATA
Because the variable ”Potability” is a categorical variable, therefore barplot is the most suitableplot for this kind of variable
Figure 4.11 Boxplot of Potability
Comment: The number of non-potable water samples exceeds the number of drinkable ones Forcontinuous data, we use Boxplot and Histogram graphs to demonstrate the distribution of variablesbased on Potability groups:
With the following histogram graph (Figure 4.14)
Figure 4.12 The histogram of pH, Hardness and Solids