The output show that only G2 variable has not available NA , so we need to come up with a method to replace these data - Method 01: Mean/ Mode/ Median Imputation: 1s a method to fill in
Trang 1TRƯỜNG ĐẠI HỌC BÁCH KHOA DAI HOC QUOC GIA TP HO CHi MINH
on ty "ody
PROBABILITY AND STATISTICS REPORT ON PROJECT INSTRUCTOR: DR Nguyen Tién Ding
CLASS: CC05
GROUP: 10
Nguyễn Hoàng Khánh 1952773
Thành phố Hô Chi Minh — 2021
Trang 2
Table of Contents
1.1 Da†A ÏIĐOFFĂ - 0 0 Tu bì Tà TH về 2 1.2 ni 1n e 2
1.4 Linear Regression models fitting
REFERENCES
This data approach student achievement in secondary education of two Portuguese schools The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires Attribute Information:
sex — student’s sex (binary: ‘F’ — female or ‘M’ — male)
age — student’s age (numeric: from 15 to 22)
studytime — weekly study time (numeric: | - < 2 hours, 2 — 2 to 5 hours, 3 — 5 to
10 hours, 4- > 10 hours)
* failures — number of past class failures (numeric: n if 1 <= n < 3, else 4)
higher — wants to take higher education (binary: yes or no)
absences — number of school absences (numeric: from 0 to 93)
# These grades are related with the course subject, Math or Portuguese:
GI — first period grade (numeric: from 0 to 20)
G2 — second period grade (numeric: from 0 to 20)
G3 — final grade (numeric: from 0 to 20, output target) Steps:
1 Import data: grade.csv
2 Datacleaning: NA (not available)
3 Data visualization
(a) Transformation
(b) Descriptive statistics for each of the variables
(c) Graphs: hist, boxplot, pairs
Trang 34 Fitting linear regression models: We want to explore what factors may affect the final grade
1.1 Data import
Code:
###1.Import data grade <- read.csv("C:\\Users\\admin\\Desktop\\grade csv") attach(grade)
Explanation:
- Grade<read.esv(“C:\\Users\\admin\\Desktop\\grade.csv”): read the given file is
in csv form and save the data with name “grade”
- Attach(grade): attach the file just read to R system and all calculations are performed on the attached data file “grade”
Trang 4After that, detach data of “grade” from R system and attach data of “new DF”
to R system All calculations will be on subdata “new DF” instead of data
$G2 [1] 2 6 9 80 100
$G3 integer (0)
$sex integer (0)
$studytime
$failures integer (0)
$absences integer (0)
$higher integer (0)
$age integer (0)
* Note: if using command : is.na(new DF), the output will be difficult to find out not available (NA) so we need to combine the is.na command into the apply
FALSE FALSE
E FALSE FALSE FALSE FALSE FALSE FALSE FALSE FAL: FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Trang 5
The output show that only G2 variable has not available (NA) , so we need to come up with a method to replace these data
- Method 01: Mean/ Mode/ Median Imputation: 1s a method to fill in not available (NA) with estimated values The goal is to use known relationships that can be identified
in the valid values of the dataset to support estimates for not available (NA) Mean/ Mode/ Median Imputation is one of the most frequently used methods We have two
wats to use
e Generalized Imputation: In this case, we calculates the mean or median for all non-missing values of that variable and then replaces the not available (NA) with the mean or median
e Similar case Imputation: in this case, we also calculate the mean or median, but these will be calculated individually for each subject of the non-missing values, and then replace the not available (NA) with the mean or median depend on subject
- Method 02: Prediction model : create a predictive model to estimate the values that will replace the not available data In this case, split your dataset into two sets: One with no missing values for the variable and another with missing values The first dataset becomes the training dataset of the model while the second dataset with missing values
is the test dataset and the variable with the missing value is considered as the target variable Next, we create a model to predict the target variable based on other attributes
of the training dataset and fill in the missing values of the test dataset
- Method 03: delection: This method is used when the probability of missing a variable is the same for all observations This method is done by 2 ways: List Wise Deletion and Pair Wise Deletion
e List wise delection: Delete observations for which any variable is missing, but this method has a flaw because deleting observations of any of these missing variables reduces the power of the model because it deletes the entire row of observations in the sample that data is missing
Trang 6e Pair wise delection: We will perform the analysis with all cases where the variables of interest are present The advantage of this approach 1s that it keeps many cases available for analysis One of the disadvantages of this method is that
it uses different sample sizes for different variables Method 04: Eliminate missing values (in case those missing values are not important to our data or the number of missing values is too small - only about less than 3% of the total number of observations in a given variable)
Method 05: Replace missing values with another value The replacement with this value will depend on the nature of the missing values in those cases:
e In case the variable with missing values is a numeric variable: It is possible
to replace missing values with values such as: 0, median, mean, etc., depending on certain cases
e In case the variable with missing values is a categorical variable: The cases
of missing values can be grouped into one group, named Missing
=> In summary, although R has different ways of replacing missing data But with the advantages and disadvantages of each method, in this case, there are only
5 lines of not available at variable G2, while the data has 395 lines (accounting for 1.3%), so the treatment The best solution is to delete those not available lines new DF = na.omit(new DF): delete the not available data line in the subdata
“new DF”
Trang 7ng 1 to of 395 entries, 8 total columns 13 14 14 14
Attach(new DF): Informs the software that from this command line onwards all calculations are performed on the processed “new DF” data
1.3 Data visualization
a/ Theoretical section
A Histogram is an approximate representation of the frequency distribution of numerical data The range of values 1s divided bucketed into intervals, which are usually called bins or cells It is recommended that the bins should be of equal width to improve the data visualization
A Scatterplot is a plot using Cartesian coordinates to show the relation between 2 variables x and y for a set of data The data are displayed as a collection of points, each having the value of one variable determining the position on x-axis and one on y-axis Usually, to make a prediction using a scatterplot, we should look for correlations in the data As we move from left to right on the graph, if the data points move up, it shows a positive correlation; if the data points move down, it shows a negative correlation; if they show no pattern, it means that there is no correlation at all
A boxplot is a method for graphically depicting groups of numerical data through their quartiles There are lines extending from the boxes (known as whiskers) indicating variability outside the upper and lower quartiles Additionally, outliers, which are farther data from the box, may be plotted as individual points
b/ Code and explanation
stuc
Trang 8Step 1: We need to classify the variables into 2 types: continuous data and categorical ones
- Continuous variables are: age, G1, G2, G3 and by using command
“as.numerIc”, we can make sure that the imported data will be converted from characters
to numbers
- Categorical variables are: sex, studytime, failures, higher and by using command
“factor”, we can categorize the data and store them as levels
Data visualization tep 1: Classify the variables into 2 types
C 5 variables age = as.numeric(age
G1 as numer ic(G1 G2 = as.numeric(G2
G3 = as.numeric(G3
### Categorical variables studytime as factor (studytime failures = as factor (failures higher = as factor (higher absences as factor (absences sex as factor (sex)
Step 2: Now, the data is ready to be analyzed
For continuous variables, we calculate 5 values for each column including mean, median, standard deviation, min and max By using the command “apply”, we may be able to repetitively perform an action on multiple chunks of data This function takes the form applyCx, MARGIN, FUN), where X: array/ matrix of data to perform function on; MARGIN = |: calculations on rows, MARGIN = 2: calculations on columns; FUN: the function we want to use)
Trang 9## Step 2: Analyze data
### Continuos variab]
median=apply(new_DF[, sd=apply(new_DF[,c(1, mỉn=appTy(new_DF [ ,c(1 max=app ly(new_DF [ ,c(1,
c descriptive_statistics
### Categorical variables table(new_DF §studytime table(failures) table(higher ) table(absences) table(sex)
Let’s take an example
mean = apply(new_DF[, c(1, 2, 3, 9), 2, mean) +new_DF: the dataframe consisting data which we want to perform the function on +[, c(1, 2, 3, 9)]: take all the rows and all the columns except for column 1, 2, 3, 9 since they’re categorical variables
+2 means MARGIN = 2: we will do the calculations for the columns
+ mean: the function we want to use
The results will be stored in the variable called “mean”
“descriptive statistics” , then view it by printing and we will receive the following table
> descriptive_statistics
mean median sd max min
Gl 10.92564 11 3.290886 19 3 G2 10.71795 11 3.737868 19 0 G3 10.41282 11 4.568962 20 0 age 16.70513 17 1.279751 22 15
For categorical variables, we only need to observe the frequency of each category The command “table” already gives us the results we wanted
Trang 10Firstly, we draw the histogram of G3 by using command “hist”
## Step 3: Graphs hist (new_DF$G3,xlab = “Final grade”, main ="Histogram of G3", breaks = 0:20)
And we may get the following output:
Trang 11main="Boxplot of failures considering G3", xlab = “Number of failures”,
col = ”yeTTow”) boxplot (G3~higher,
main="Boxplot of higher considering G3”, xlab = "Students want to get higher education”, col = ”pink”)
Let’s take the first boxplot to analyze how we use this function
Boxplot (G3 ~ studytime, main=’Boxplot of studytime considering G3”,
10
Trang 12xlab = “study hours”, col = “red” )
Before “~’’, the variable is dependent, whereas the variable after is independent
“xlab” is used to name the x-axis, while “main” is used to name the boxplot
A boxplot consists of the minimum value, first quartile, median, third quartile, maximum value and sometimes the outliers Therefore, we will be able to make a lot of conclusions due to the following figures
- Boxplot of G3 based on studying time
Boxplot of studytime considering G3
Trang 13Boxplot of failures considering G3
Number of failures
Boxplot of G3 based on the fact that the students want to take higher education
Boxplot of higher considering G3
Trang 14Eventually, we draw the scatterplots to determine the relationships between G3 and G1, G2, age and absences by using command “pairs”
pairs(G3~G2, col = “blue”, pch = 20, main = ”scatterplot of G3 and G2”) pạrs(G3~G1,
col = “purple”, pch = 21, main = “Scatterplot of G3 and G1") pairs(G3~age,
col = “orange”, pch = 19, main = “Scatterplot of G3 and age”) pairs(G3~absences,
col = "green", pch = 20, main = ”scatterplot of G3 and absences”)
The followings are the figures that we needed
Trang 161.4 Linear Regression models fitting
Given the dataset consisting of y, x1, x2, ., x;, the relation between x and y in terms of
6 should be determined However, since the affection of x’s on output y was not informed, we need to perform the hypothesis test known as Hypotheses for ANOVA Test
15
Trang 17Null hypothesis Ho: Bi = B2 = = Bk =0
Alternative hypothesis Hi: At least one B; 4 0 Rejection of Ho implies that at least one of the regressor variables x1, x2, ., Xk contributes significantly to the model
Test Statistics for ANOVA:
R = SSp/k _ MSR SSp/(n — P) MS;
Null hypothesis Ho: Bj = Bio
Alternative hypothesis H:: Bj 4 Bjo The test statistics for this test is:
Vo Gj
& The null hypothesis Ho will be rejected if |to| > Ca/2n-p
b/ Code and explanation
Before conducting any calculations on the linear regression models, we change the categorical variables into numerical values for the binary variables by using “i1felse” and
33c
for “studytime”, “failures”, “absences” by using “as.numerIc”
16
Trang 18
###4 Fitting linear regression models
##Convert binary data
higher <- ifelse(higher == “yes",1 ,0)
##Build linear models
##M1: G3 is depend on all variables
Mi=1m(G3~G1+G2+studytime+fai lures+absences+higher+age+sex)
summary (M1)
print (summary (M1) )
By using “print” and “summary”, M1 is summarized as below
> M1=1m(G3~G1+G2+studytime+fai lures+absences+hi gher+age+sex)
> print(summary(M1))
call:
Im(formula = G3 ~ Gl + G2 + studytime + failures + absences +
higher + age + sex) Residuals:
Signif codes: 0 “***' 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ O.1 SU 1
Residual standard error: 1.896 on 381 degrees of freedom
Multiple R-squared: 0.8313, Adjusted R-squared: 0.8278
F-statistic: 234.7 on 8 and 381 DF, p-value: < 2.2e-16
Compared with a confidence coefficient a, if the p-value of any variable is sufficiently low, we should keep that variable since it contributes significantly to the dependent variable Otherwise, we should omit that variable
17