Probability And Statistics Report On Project.pdf

The output show that only G2 variable has not available NA , so we need to come up with a method to replace these data - Method 01: Mean/ Mode/ Median Imputation: 1s a method to fill in

Trang 1

TRƯỜNG ĐẠI HỌC BÁCH KHOA DAI HOC QUOC GIA TP HO CHi MINH

on ty "ody

PROBABILITY AND STATISTICS REPORT ON PROJECT INSTRUCTOR: DR Nguyen Tién Ding

CLASS: CC05

GROUP: 10

Nguyễn Hoàng Khánh 1952773

Thành phố Hô Chi Minh — 2021

Trang 2

Table of Contents

1.1 Da†A ÏIĐOFFĂ - 0 0 Tu bì Tà TH về 2 1.2 ni 1n e 2

1.4 Linear Regression models fitting

REFERENCES

This data approach student achievement in secondary education of two Portuguese schools The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires Attribute Information:

sex — student’s sex (binary: ‘F’ — female or ‘M’ — male)

age — student’s age (numeric: from 15 to 22)

studytime — weekly study time (numeric: | - < 2 hours, 2 — 2 to 5 hours, 3 — 5 to

10 hours, 4- > 10 hours)

* failures — number of past class failures (numeric: n if 1 <= n < 3, else 4)

higher — wants to take higher education (binary: yes or no)

absences — number of school absences (numeric: from 0 to 93)

# These grades are related with the course subject, Math or Portuguese:

GI — first period grade (numeric: from 0 to 20)

G2 — second period grade (numeric: from 0 to 20)

G3 — final grade (numeric: from 0 to 20, output target) Steps:

1 Import data: grade.csv

2 Datacleaning: NA (not available)

3 Data visualization

(a) Transformation

(b) Descriptive statistics for each of the variables

(c) Graphs: hist, boxplot, pairs

Trang 3

4 Fitting linear regression models: We want to explore what factors may affect the final grade

1.1 Data import

Code:

###1.Import data grade <- read.csv("C:\\Users\\admin\\Desktop\\grade csv") attach(grade)

Explanation:

- Grade<read.esv(“C:\\Users\\admin\\Desktop\\grade.csv”): read the given file is

in csv form and save the data with name “grade”

- Attach(grade): attach the file just read to R system and all calculations are performed on the attached data file “grade”

Trang 4

After that, detach data of “grade” from R system and attach data of “new DF”

to R system All calculations will be on subdata “new DF” instead of data

$G2 [1] 2 6 9 80 100

$G3 integer (0)

$sex integer (0)

$studytime

$failures integer (0)

$absences integer (0)

$higher integer (0)

$age integer (0)

* Note: if using command : is.na(new DF), the output will be difficult to find out not available (NA) so we need to combine the is.na command into the apply

FALSE FALSE

E FALSE FALSE FALSE FALSE FALSE FALSE FALSE FAL: FALSE FALSE FALSE FALSE FALSE

FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Trang 5

The output show that only G2 variable has not available (NA) , so we need to come up with a method to replace these data

- Method 01: Mean/ Mode/ Median Imputation: 1s a method to fill in not available (NA) with estimated values The goal is to use known relationships that can be identified

in the valid values of the dataset to support estimates for not available (NA) Mean/ Mode/ Median Imputation is one of the most frequently used methods We have two

wats to use

e Generalized Imputation: In this case, we calculates the mean or median for all non-missing values of that variable and then replaces the not available (NA) with the mean or median

e Similar case Imputation: in this case, we also calculate the mean or median, but these will be calculated individually for each subject of the non-missing values, and then replace the not available (NA) with the mean or median depend on subject

- Method 02: Prediction model : create a predictive model to estimate the values that will replace the not available data In this case, split your dataset into two sets: One with no missing values for the variable and another with missing values The first dataset becomes the training dataset of the model while the second dataset with missing values

is the test dataset and the variable with the missing value is considered as the target variable Next, we create a model to predict the target variable based on other attributes

of the training dataset and fill in the missing values of the test dataset

- Method 03: delection: This method is used when the probability of missing a variable is the same for all observations This method is done by 2 ways: List Wise Deletion and Pair Wise Deletion

e List wise delection: Delete observations for which any variable is missing, but this method has a flaw because deleting observations of any of these missing variables reduces the power of the model because it deletes the entire row of observations in the sample that data is missing

Trang 6

e Pair wise delection: We will perform the analysis with all cases where the variables of interest are present The advantage of this approach 1s that it keeps many cases available for analysis One of the disadvantages of this method is that

it uses different sample sizes for different variables Method 04: Eliminate missing values (in case those missing values are not important to our data or the number of missing values is too small - only about less than 3% of the total number of observations in a given variable)

Method 05: Replace missing values with another value The replacement with this value will depend on the nature of the missing values in those cases:

e In case the variable with missing values is a numeric variable: It is possible

to replace missing values with values such as: 0, median, mean, etc., depending on certain cases

e In case the variable with missing values is a categorical variable: The cases

of missing values can be grouped into one group, named Missing

=> In summary, although R has different ways of replacing missing data But with the advantages and disadvantages of each method, in this case, there are only

5 lines of not available at variable G2, while the data has 395 lines (accounting for 1.3%), so the treatment The best solution is to delete those not available lines new DF = na.omit(new DF): delete the not available data line in the subdata

“new DF”

Trang 7

ng 1 to of 395 entries, 8 total columns 13 14 14 14

Attach(new DF): Informs the software that from this command line onwards all calculations are performed on the processed “new DF” data

1.3 Data visualization

a/ Theoretical section

A Histogram is an approximate representation of the frequency distribution of numerical data The range of values 1s divided bucketed into intervals, which are usually called bins or cells It is recommended that the bins should be of equal width to improve the data visualization

A Scatterplot is a plot using Cartesian coordinates to show the relation between 2 variables x and y for a set of data The data are displayed as a collection of points, each having the value of one variable determining the position on x-axis and one on y-axis Usually, to make a prediction using a scatterplot, we should look for correlations in the data As we move from left to right on the graph, if the data points move up, it shows a positive correlation; if the data points move down, it shows a negative correlation; if they show no pattern, it means that there is no correlation at all

A boxplot is a method for graphically depicting groups of numerical data through their quartiles There are lines extending from the boxes (known as whiskers) indicating variability outside the upper and lower quartiles Additionally, outliers, which are farther data from the box, may be plotted as individual points

b/ Code and explanation

stuc

Trang 8

Step 1: We need to classify the variables into 2 types: continuous data and categorical ones

- Continuous variables are: age, G1, G2, G3 and by using command

“as.numerIc”, we can make sure that the imported data will be converted from characters

to numbers

- Categorical variables are: sex, studytime, failures, higher and by using command

“factor”, we can categorize the data and store them as levels

Data visualization tep 1: Classify the variables into 2 types

C 5 variables age = as.numeric(age

G1 as numer ic(G1 G2 = as.numeric(G2

G3 = as.numeric(G3

### Categorical variables studytime as factor (studytime failures = as factor (failures higher = as factor (higher absences as factor (absences sex as factor (sex)

Step 2: Now, the data is ready to be analyzed

For continuous variables, we calculate 5 values for each column including mean, median, standard deviation, min and max By using the command “apply”, we may be able to repetitively perform an action on multiple chunks of data This function takes the form applyCx, MARGIN, FUN), where X: array/ matrix of data to perform function on; MARGIN = |: calculations on rows, MARGIN = 2: calculations on columns; FUN: the function we want to use)

Trang 9

## Step 2: Analyze data

### Continuos variab]

median=apply(new_DF[, sd=apply(new_DF[,c(1, mỉn=appTy(new_DF [ ,c(1 max=app ly(new_DF [ ,c(1,

c descriptive_statistics

### Categorical variables table(new_DF §studytime table(failures) table(higher ) table(absences) table(sex)

Let’s take an example

mean = apply(new_DF[, c(1, 2, 3, 9), 2, mean) +new_DF: the dataframe consisting data which we want to perform the function on +[, c(1, 2, 3, 9)]: take all the rows and all the columns except for column 1, 2, 3, 9 since they’re categorical variables

+2 means MARGIN = 2: we will do the calculations for the columns

+ mean: the function we want to use

The results will be stored in the variable called “mean”

“descriptive statistics” , then view it by printing and we will receive the following table

> descriptive_statistics

mean median sd max min

Gl 10.92564 11 3.290886 19 3 G2 10.71795 11 3.737868 19 0 G3 10.41282 11 4.568962 20 0 age 16.70513 17 1.279751 22 15

For categorical variables, we only need to observe the frequency of each category The command “table” already gives us the results we wanted

Trang 10

Firstly, we draw the histogram of G3 by using command “hist”

## Step 3: Graphs hist (new_DF$G3,xlab = “Final grade”, main ="Histogram of G3", breaks = 0:20)

And we may get the following output:

Trang 11

main="Boxplot of failures considering G3", xlab = “Number of failures”,

col = ”yeTTow”) boxplot (G3~higher,

main="Boxplot of higher considering G3”, xlab = "Students want to get higher education”, col = ”pink”)

Let’s take the first boxplot to analyze how we use this function

Boxplot (G3 ~ studytime, main=’Boxplot of studytime considering G3”,

10

Trang 12

xlab = “study hours”, col = “red” )

Before “~’’, the variable is dependent, whereas the variable after is independent

“xlab” is used to name the x-axis, while “main” is used to name the boxplot

A boxplot consists of the minimum value, first quartile, median, third quartile, maximum value and sometimes the outliers Therefore, we will be able to make a lot of conclusions due to the following figures

- Boxplot of G3 based on studying time

Boxplot of studytime considering G3

Trang 13

Boxplot of failures considering G3

Number of failures

Boxplot of G3 based on the fact that the students want to take higher education

Boxplot of higher considering G3

Trang 14

Eventually, we draw the scatterplots to determine the relationships between G3 and G1, G2, age and absences by using command “pairs”

pairs(G3~G2, col = “blue”, pch = 20, main = ”scatterplot of G3 and G2”) pạrs(G3~G1,

col = “purple”, pch = 21, main = “Scatterplot of G3 and G1") pairs(G3~age,

col = “orange”, pch = 19, main = “Scatterplot of G3 and age”) pairs(G3~absences,

col = "green", pch = 20, main = ”scatterplot of G3 and absences”)

The followings are the figures that we needed

Trang 16

1.4 Linear Regression models fitting

Given the dataset consisting of y, x1, x2, ., x;, the relation between x and y in terms of

6 should be determined However, since the affection of x’s on output y was not informed, we need to perform the hypothesis test known as Hypotheses for ANOVA Test

15

Trang 17

Null hypothesis Ho: Bi = B2 = = Bk =0

Alternative hypothesis Hi: At least one B; 4 0 Rejection of Ho implies that at least one of the regressor variables x1, x2, ., Xk contributes significantly to the model

Test Statistics for ANOVA:

R = SSp/k _ MSR SSp/(n — P) MS;

Null hypothesis Ho: Bj = Bio

Alternative hypothesis H:: Bj 4 Bjo The test statistics for this test is:

Vo Gj

& The null hypothesis Ho will be rejected if |to| > Ca/2n-p

b/ Code and explanation

Before conducting any calculations on the linear regression models, we change the categorical variables into numerical values for the binary variables by using “i1felse” and

33c

for “studytime”, “failures”, “absences” by using “as.numerIc”

16

Trang 18

###4 Fitting linear regression models

##Convert binary data

higher <- ifelse(higher == “yes",1 ,0)

##Build linear models

##M1: G3 is depend on all variables

Mi=1m(G3~G1+G2+studytime+fai lures+absences+higher+age+sex)

summary (M1)

print (summary (M1) )

By using “print” and “summary”, M1 is summarized as below

> M1=1m(G3~G1+G2+studytime+fai lures+absences+hi gher+age+sex)

> print(summary(M1))

call:

Im(formula = G3 ~ Gl + G2 + studytime + failures + absences +

higher + age + sex) Residuals:

Signif codes: 0 “***' 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ O.1 SU 1

Residual standard error: 1.896 on 381 degrees of freedom

Multiple R-squared: 0.8313, Adjusted R-squared: 0.8278

F-statistic: 234.7 on 8 and 381 DF, p-value: < 2.2e-16

Compared with a confidence coefficient a, if the p-value of any variable is sufficiently low, we should keep that variable since it contributes significantly to the dependent variable Otherwise, we should omit that variable

17

Tiêu đề	Probability And Statistics Report On Project
Tác giả	Trần Xuân Trực, Nguyễn Hoàng Khánh, Phạm Cường Nghị, Châu Vĩnh An
Người hướng dẫn	Dr. Nguyễn Tiến Dũng
Trường học	Trường Đại Học Bách Khoa Đại Học Quốc Gia TP Hồ Chí Minh
Chuyên ngành	Probability and Statistics
Thể loại	Project
Năm xuất bản	2021
Thành phố	Thành phố Hồ Chí Minh

Định dạng
Số trang	22
Dung lượng	2,57 MB