Bài tập lớn Xác suất thống kê ĐH BK

Probability and Statistics Project CC04 Table of Contents I Topic Context Attribute Information II Theory Logistic Regression Analysis Method of Maximum Likelihood Optimal model selections Information about data a Types of Angina (ChestPainType) b Serum cholesterol concentration c Resting ECG results (RestingECG) d Exercise intensity (ST_Slope) III Code: Import data Clean the data 10 Data structure overview 11 A table of quantity statistics 12 a Sex 12 b ChestPainType 12 c FastingBS 12 d RestingECG 13 e ExerciseAngina 13 f ST_Slope 13 g HeartDisease 13 Plot the Histogram and Plot the Barplot 14 a Age and RestingBP histogram 14 b Cholesterol and MaxHR histogram 15 c Oldpeak histogram 16 d Sex and CPT barplots 17 e FastingBloodSugar and Resting ECG barplots 17 f ExerciseAngina, ST_Slope and HeartDisease barplots 18 g Sex and CPT barplots for HeartDisease 19 h FastingBS and RestingECG barplots for HeartDisease 20 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project CC04 i ExerciseAngina and ST_Slope barplots for HeartDisease 21 j Age vs HeartDisease histogram 22 k RestingBP vs HeartDisease histogram 23 l Cholesterol vs HeartDisease histogram 24 m MaxHR vs HeartDisease histogram 25 n Oldpeak vs HeartDisease histogram 26 Model 27 a 27 b 28 c 28 d 29 e 29 f The prediction for training set 31 g The prediction for testing set 31 IV References 32 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project CC04 I Topic Context Cardiovascular diseases (CVDs) are the number cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 32% of all deaths worldwide Four out of 5CVD deaths are due to heart attacks and strokes, and onethird of these deaths occur prematurely in people under 70 years of age Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help Attribute Information Age: age of the patient [years] Sex: sex of the patient [M: Male, F: Female] ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic] RestingBP: resting blood pressure [mm Hg] Cholesterol: serum cholesterol [mm/dl] FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise] RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria] MaxHR: maximum heart rate achieved [Numeric value between 60 and 202] ExerciseAngina: exercise-induced angina [Y: Yes, N: No] Oldpeak: oldpeak = ST [Numeric value measured in depression] ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping] HeartDisease: output class [1: Heart disease, 0: Normal] II Theory Logistic Regression Analysis Regression methods have become an integral component of any data analysis concerned with describing the relationship between a response variable and one or more explanatory variables It is often the case that the outcome variable is discrete, taking on two or more possible values Over the last decade, the logistic regression model has become, in many fields, the standard method of analysis in this situation Linear regression analysis and analysis variance, we find the model and the relationship between a dependent variable is the continuous dependent variable, and one or more independent variable is either continuous or discontinuous But in many cases, the dependent variable is not a constant variable but a binary measure tool: yes/no, ill/healthy, deceased/alived, occurred/didn't happen, etc., and the independent variables can be continuous or discontinuous We want to find out the relationship Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project CC04 between independent and dependent variables We’ll focus on the case of binary responses variable coded using and In practice, these and 1s will code for two classes such as yes/no, win/lose, ill/healthy, etc 1, 𝑦𝑒𝑠 𝑌={ 0, 𝑛𝑜 Given an event frequency x recorded from n subjects, we can calculate the probability 𝑥 of that event as: 𝑝 = 𝑛 First, we define some notation that we will use throughout 𝑝(𝑥) = 𝑃[𝑌 = 1|𝑋 = 𝑥] With a binary (Bernoulli) response, we’ll mostly focus on the case when 𝑌 = 1, since with only two possibilities, it is trivial to obtain probabilities when 𝑌 =0 𝑃[𝑌 = 0|𝑋 = 𝑥] + 𝑃[𝑌=1| 𝑋 = 𝑥] = 𝑃[𝑌 = 0|𝑋 = 𝑥] = 1- 𝑝(𝑥) Another way of expressing risk is odds, roughly translated as possibility The probability of an event is simply defined as the ratio of the probability of the event occurring to the probability of the event not occurring: 𝑝 𝑜𝑑𝑑𝑠 = 1−𝑝 We now define the logistic regression model 𝑝 logit(𝑝) = 𝑙𝑜𝑔 ( ) 1−𝑝 The relationship between p and logit(p) is a continuous relationship and has the following form 𝑝 (𝑥 ) ) = 𝛽0 + 𝛽1 𝑥1 + +𝛽𝑝−1 𝑥𝑝−1 − 𝑝 (𝑥 ) Immediately we notice some similarities to ordinary linear regression, in particular, the right-hand side This is our usual linear combination of the predictors 𝑙𝑜𝑔 ( Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project CC04 The left-hand side is called the log(odds), which is the log of the odds The odds are the probability for a positive event (𝑌 = 1) divided by the probability of a negative event (𝑌 = 0) So when the odds are 1, the two events have equal probability Odds higher than favor a positive event The opposite is true when the odds are lower than With logistic regression, which uses the Bernoulli distribution, we only need to estimate the Bernoulli distribution’s single parameter 𝑝(x), which happens to be its mean So even though we introduced ordinary linear regression first, in some ways, logistic regression is actually simpler Note that applying the inverse logit transformation allow us to obtain an expression for 𝑝(x)  +  x + +  x p −1 p −1 e 11 p ( x ) = P[Y = 1| X = x]=  +  x + +  p −1x p −1 1+ e 1 With 𝑛 observations, we write the model indexed with 𝑖 to note that it is being applied to each observation  p ( x)  log   =  + 1 x1 + +  p −1 x p −1 − p x ( )   We can apply the inverse logit transformation to obtain 𝑃[𝑌𝑖 = 1|𝑋𝑖 = 𝑥𝑖 ] for each observation Since these are probabilities, it’s good that we used a function that returns values between and  +  x + +  x p −1 i ( p −1) e i1 p ( xi ) = P[Yi = 1| X i = xi ]=  +  x + +  p −1 xi ( p −1) + e i1 1 − p ( xi ) = P[Yi = | X i = xi ]=  +  x + +  p −1 xi ( p −1) + e i1 Given an independent variable x (x can be continuous or discontinuous), the logistic regression model states that: logit(p) = α + βx Similar to the linear regression model, α and β are two linear parameters that need to be estimated from the research data But the meaning of this parameter, especially the parameter β, is very different from the meaning we are used to with linear regression models 𝑝 logit(𝑝) = 𝑙𝑜𝑔 ( ) 𝛼 + 𝛽𝑥 1−𝑝 𝑝 𝑜𝑑𝑑𝑠(𝑝) = = 𝑒 𝛼+𝛽𝑥 1−𝑝 Method of Maximum Likelihood The parameters of the model can be estimated by the method of maximum likelihood This is a quite general technique, similar to the least-squares method in that it finds a set of parameters that optimizes a goodness-of-fit criterion (in fact, the least-squares method itself is a slightly modified maximum-likelihood procedure) The likelihood function L(β) is simply the probability of the entire observed data set for varying parameters Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project CC04 To “fit” this model, that is estimate the 𝛽 parameters, we will use maximum likelihood  = [ , 1 ,  , ,  p −1 ] We first write the likelihood given the observed data n L (  ) =  P[Yi = yi | X i = xi ] i =1 This is already technically a function of the 𝛽 parameters, but we’ll some rearrangement to make this more explicit n n L(β) = ∏ p(xi ) ∏ (1 − p(xj )) (1 − yi ) i;yi =1 𝑛 j;yj =0 𝑛 𝐿(𝛽) = ∏ 𝑝(𝑥𝑖 ) ∏ (1 − 𝑝(𝑥𝑗 )) 𝑖;𝑦𝑖 =1 𝑛 𝐿 (𝛽 ) = ∏ 𝑖;𝑦𝑖 =1 𝑗;𝑦𝑗 =0 𝑒 𝛽0 +𝛽1 𝑥1 + +𝛽𝑝−1 𝑥𝑖(𝑝−1) 1+𝑒 𝛽0 +𝛽1 𝑥1 + +𝛽𝑝−1 𝑥𝑖(𝑝−1) 𝑛 ∏ 𝑗;𝑦𝑗 =0 1+𝑒 𝛽0 +𝛽1 𝑥1 + +𝛽𝑝−1 𝑥𝑗(𝑝−1) Unfortunately, unlike ordinary linear regression, there is no analytical solution for this maximization problem Instead, it will need to be solved numerically This is where we will require the help of a computer software For our problem, R can take care of this for us using an iteratively reweighted least squares algorithm We’ll leave the details for a machine learning or optimization course, which would likely also discuss alternative optimization strategies Logistic regression analysis belongs to the class of generalized linear models These models are characterized by their response distribution (here the binomial distribution) and a link function, which transfers the mean value to a scale in which the relation to background variables is described as linear and additive In a logistic regression 𝑝 analysis, the link function is logit(𝑝) = 𝑙𝑜𝑔 ( ) 1−𝑝 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project CC04 Optimal model selections One of the difficult and sometimes difficult problems in multivariable logistic regression analysis is choosing a model that can adequately describe the data Suppose we are conducting a study on a model with a dependent variable y and independent variables x1, x2 and x3, we can obtain the following models to predict y: y = f(x1), y = f(x2), y = f(x3), y = f(x1, x2), y = f(x1, x3), y = f(x2, x3), and y = f(x1, x2, x3), where f is a function In general, with k independent variables x1, x2, x3 xk, we can have up to 2k - different models to choose from It is often the case that when a model has too many independent variables, some of these will not contribute to the outcome of the model as there may be no correlation between the independent variable and the dependent variable In this situation, it is recommended that we detect and remove these unnecessary variables from our calculation so our calculation will not be skewed One further advantage of removing these variables is making the interpretation of our data much easier to manage When comparing a model with independent variables that has the same ability to predict data or better with a model with independent variables, the first model is chosen for ease and accuracy It must be noted, however, that these removed independent variables must be confirmed to have no statistical importance on the dependent variable, that is there is no observed correlation, as it is counterproductive to give up accuracy for clarity Adequate criterion here means that the model must describe the data satisfactorily, i.e must predict close (or as close as possible) to the actual observed value of the dependent variable y If the observed value of y is 10, and if there is a predictive model of and a predictive model of 6, the former must be considered more complete The criterion of “practical significance”, as it is called, means that the model must be supported by theory or have biological significance (if biological research), clinical significance (if research study), clinical studies, etc It's possible that phone numbers are somehow related to fracture rates, but of course such a model makes no senseas correlation does not imply causation This is an important criterion, because if a statistical analysis results in a model that makes a lot of mathematical sense but has no practical significance, then the model is just a numbers game, without any real meaning real scientific value The third criterion (of practical significance) belongs to the realm of theory, and we will not discuss it here An important and useful metric for us to decide on a simple and complete model is the Akaike Information Criterion (AIC) The formula for calculating the AIC value: 𝐴𝐼𝐶 = −2 𝑙𝑜𝑔( 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑) + 2𝑘 𝑤ℎ𝑒𝑟𝑒 𝑘 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 A simple and complete model should be one with as low an AIC value as possible and the independent variables must be statistically significant Thus, the problem of finding a simple and complete model is actually looking for the one (or more) model with the lowest or near lowest AIC value Information about data a Types of Angina (ChestPainType) + Typical angina (TA): Typical angina consists of the following features: Feeling pain like strangulation, pain like tightness or pressure in the left chest or behind the Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project CC04 breastbone, which may radiate to the chin, left hand Appears with a regular nature, increases after exertion, strong emotions, encounters cold The time is usually to minutes + Atypical angina (ATA): Atypical angina is different from classic angina There are no signs of an ischemic heart attack The symptoms of atypical angina can range from dull, sharp pain to tearing or presence symptoms such as shortness of breath or back pain + Non-anginal angina (NAP): Unstable chest pain caused by blockage of a coronary artery without having a heart attack Symptoms include chest discomfort with or without shortness of breath, nausea, and sweating foul Diagnosis is by electrocardiogram and the presence or absence of serologic findings + Asymptomatic (ASY) b Serum cholesterol concentration The higher the level of HDL cholesterol in the blood, the lower the risk of cardiovascular disease Still, when HDL - cholesterol levels fall below 40 mg/dL, the risk of heart disease increases circuit c Resting ECG results (RestingECG) + ST: The ST segment indicates that the depolarization of the ventricular myocardium has completed Usually, paragraph the ST is level with the isoelectric line as the PR (or TP) interval Sometimes it's a little higher isoelectric line slightly d Exercise intensity (ST_Slope) + Up: the intensity of the exercise increases from light to heavy + Flat: the intensity of the exercise does not change + Down: the intensity of the exercise decreases from heavy to light And some other medical information III Code: Import data Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project CC04 Clean the data Comments: There is no missing value to process in the file df 10 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project CC04 Data structure overview 11 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project CC04 f ExerciseAngina, ST_Slope and HeartDisease barplots 18 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project CC04 g.Sex and CPT barplots for HeartDisease Comments: The percentage of patients with heart disease in the male group was higher than the percentage of patients with heart disease in the female group The ratio of patients with heart disease in the group of patients with chest pain is higher than patients with other symptoms 19 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project h CC04 FastingBS and RestingECG barplots for HeartDisease Comments: The ratio of patients with heart disease in the group of patients with blood sugar above 120 mg/dl was higher than the proportion of patients with heart disease in the group of patients with blood sugar from less than 120 mg/dl Percentage of patients with heart disease in the group of patients with abnormal ECG results is higher than in patients with other ECG results Due to all three groups of resting electrocardiograms have a higher proportion of people with the disease, so it does not help us to predict the probability of people getting the disease 20 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project i CC04 ExerciseAngina and ST_Slope barplots for HeartDisease Comments: The percentage of patients with heart disease in the group of patients with angina during exercise is higher than the proportion of patients with heart disease in the group of patients without exercise angina The proportion of patients with heart disease in the group of patients participating in flat aerobic exercise is higher than the patients in the other group  Variables ST_Slope, ExerciseAngina, FastingBS, ChestPainType, Oldpeak, MaxHR, Cholesterol and Age have an effect on predicting disease probability based on plots plotted 21 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project j CC04 Age vs HeartDisease histogram Comments: The average age of patients with the disease was higher than that of patients without the disease It can be said that older people have a higher risk of getting heart disease The highest percentage of having heart disease is the group of age 56, 57 In constrast, the smallest comes from less than 30 age group  This is not the normal distribution because there is a growth in the right side of the mean point 22 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project k CC04 RestingBP vs HeartDisease histogram Comments: Mean resting blood pressure of patients with disease was higher than that of patients without disease In general, the frequency distributions of people with and without disease are comparable (entirely equal) Therefore, measuring resting blood pressure does not predict the probability of a person developing cardiovascular disease  This is not the normal distribution because in the right side of the mean point, there is a higher value and it fluctuates 23 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project l CC04 Cholesterol vs HeartDisease histogram Comments: The mean cholesterol of the patients with the disease was lower than that of the patients without the disease However, the number of patients with the disease having more than 300 cholesterol is always higher than that of patients without disease Therefore, it is relative exact to predict the probability of a person having a heart disease  This is not a normal distribution because in the left side of the mean point, there is a higher value and it changes unpredictable 24 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project m CC04 MaxHR vs HeartDisease histogram Comments: The mean maximum heart rate of patients with the disease was lower than that of patients without the disease The maximum heart rate of people with the disease has a lower distribution than that of people without the disease, thereby helping us to predict the probability of that person having heart disease or not  This is not a normal distribution because it does not look like a slope and goes down gradually to both two sides 25 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project n CC04 Oldpeak vs HeartDisease histogram Comments: The mean depression index of patients with the disease was higher than that of patients without the disease The distribution of depression index of people with the disease is higher than that of people without the disease Therefore, this is a factor that helps us predict the probability of people getting the disease  This is not a normal distribution because it does not look like a slope and goes down gradually to both two sides 26 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project CC04 Model a Summary of the model 27 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project CC04 b The result of initial model with 11 variables c The result of model after eliminating the RestingECG from initial model 28 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project CC04 d The result of model after eliminating the RestingBP from previous model e The result of model after eliminating the MaxHR from previous model Comments : With that result, we are shown every steps to have optimized model Firstly, we start with the model having full of 11 variables (AIC = 577.83) Secondly, we remove one variable (RestingECG) from the initial model and have AIC = 574.82 and so on And after steps, we stop at the model having the lowest value of AIC with variables 29 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project CC04 Summary of the optimized model So, the optimal logistic regression model has the form: 𝑝 𝑙𝑛( 𝑜𝑑𝑑𝑠) = 𝑙𝑛 ( ) = 𝛽0 + 𝛽1 𝐴𝑔𝑒+ +𝛽11 𝑆𝑇_𝑆𝑙𝑜𝑝𝑒𝑈𝑝 1−𝑝 𝛽0 = −1.895054 𝛽1 = 0.024295 𝛽11 = −1.029488 So, the model has the form: 𝑝 𝑙𝑛( 𝑜𝑑𝑑𝑠) = 𝑙𝑛 ( ) 1−𝑝 = −1.895054 + 0.024295 𝐴𝑔𝑒+ + − 1.029488 𝑆𝑇_𝑆𝑙𝑜𝑝𝑒𝑈𝑝 −1.895054+0.024295.𝐴𝑔𝑒+ +−1.029488.𝑆𝑇_𝑆𝑙𝑜𝑝𝑒𝑈𝑝 𝑒 𝑝= + 𝑒 −1.895054+0.024295.𝐴𝑔𝑒+ +−1.029488.𝑆𝑇_𝑆𝑙𝑜𝑝𝑒𝑈𝑝 30 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project CC04 f The prediction for training set - There are 61 patients who not have heart disease that are mislabelled as having heart disease (type I error) - There are 48 patients who have heart disease that are mislabelled as not having heart disease (type II error)  As shown above, the number of false predictions is relatively small compared to the number of correct predictions, which proves our model to be quite capable and accurate The result of traing set’s accuracy is 86.80%, which means those variables can predict heart disease accurately g The prediction for testing set - - There are patients who not have heart disease that are mislabelled as having heart disease (type I error) - There are patients who have heart disease that are mislabelled as not having heart disease (type II error)  As shown above, the number of false predictions is relatively small compared to the number of correct predictions, which proves our model to be quite capable and accurate 31 Instructor: Prof Nguyễn Tiến Dũng Probability and Statistics Project CC04 - The accuracy of predicting the testing set with our model is 90.59%, a very high number Since our model has not seen this data beforehand, it can be concluded that our model can be reliably be used to predict completely new patient’s heart disease condition given that they provide us with the required variables IV References Nguyễn Tiến Dũng (chủ biên), Nguyễn Đình Huy, (2019), Xác suất - Thống kê & Phân tích số liệu J Jambers - D.Hand - W.Hardle, Introductory Statistic with R https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction https://www.geeksforgeeks.org/logistic-regression-in-r-programming/ 32 Instructor: Prof Nguyễn Tiến Dũng

Định dạng
Số trang	31
Dung lượng	1,19 MB