report cc03 group 07 bài tập lớn

Comment: The number of female students in the sample is higher than that of Figure 7: table of the students’ idea.Comment: The number of students who want to have a higher education is s

Trang 1

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

REPORT TOPIC 2 GVHD: Phan Thi Huong

L p: CC003ớ

prediction

and data visualizing

Trang 2

I Theoretical Basis

1 Definition

 Arithmetic mean: The arithmetic mean of a set of numbers

𝑥1, 𝑥2, … , 𝑥𝑛 is their sum divided by the number of observations, or

1

𝑛∑𝑛 𝑥𝑖

𝑡=1 The arithmetic mean is usually denoted by x , and is often called the average

 Median: The sample median is a measure of central tendency that divides

the data into two equal parts, half below the median and half above If the number of observations is even, the median is halfway between the two central values

 Standard deviation: Standard deviation is a statistic that measures the

dispersion of a dataset relative to its mean and is calculated as the square root of the variance The standard deviation is calculated as the square root

of variance by determining each data point's deviation relative to the mean

 The minimum: The minimum is the smallest value in the data set

 The maximum: The maximum is the largest value in the data set

 Boxplot: The boxplot is a graphical display that simultaneously describes

several important features of the data set, such as center, spread, departure from symmetry, and identification of unusual observations or outliers

 Pair plot: Pair plot is used to understand the best set of features to

explain a relationship between two variables or to form the most separated clusters It also helps to form some simple classification model by drawing some simple lines or make linear separation in our data-set

 Linear regression: Linear regression attempts to model the relationship

between two variables by fitting a linear equation to observed data One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable

II Codes used in R

1) read_csv( ): Read file which has csv ending into R

Trang 3

2) head( ): The head() function in R is used to display the first n rows present

in the input data frame

3) which( ): Find all the figures which satisfy the given data set

4) sum: `sum` returns the sum of all the values present in its argument 5) na.omit(): The na omit R function removes all incomplete cases of a data

object

6) is.na( ): check if the data were not available or not

7) median( ): calculate the median of the data set

8) mean( ): find the mean of the data set

9) max( ): determine the maximum value

10) min( ): determine the minimum value

11) sd( ): calculate the standard deviation

12) table( ): performs categorical tabulation of data with the variable and its

frequency

13) hist( ): compute the histogram of the given data values

14) boxplot( ): plot the boxplot from the data

15) pairs( ): return a plot matrix, consisting of scatter plots corresponding to

each data frame

16) view( ): show up all the values of the data set

17) lm( ): `lm` is used to fit linear model

18) summary( ): list all the calculated value of the model

III Activity 1

1.1 Topic

This data approach student achievement in secondary education of two Portuguese schools The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires

Attribute Information:

 sex - student’s sex (binary: ’F’ female or ’M’ - male) -

 age - student’s age (numeric: from 15 to 22)

Trang 4

 studytime - weekly study time ( 1: < 2 hours, 2: 2 to 5 hours, 3: 5 to 10 hours, or 4: > 10 hours)

 failures - number of past class failures (numeric: n if 1 ≤ n < 3, else 4)

 higher - wants to take higher education (binary: yes or no)

 absences - number of school absences (numeric: from 0 to 93)

# these grades are related with the course subject, Math or Portuguese:

 G1 - first period grade (numeric: from 0 to 20)

 G2 - second period grade (numeric: from 0 to 20)

 G3 - final grade (numeric: from 0 to 20, output target)

- Creating a new file containing only the key variables given in the topic, save

as “new_grade” and checking first 3 row of the new file

Code:

new_grade<-grade[,c("sex","age","studytime","failures","higher","absences","G1","G2","G3")]

View(new_grade)

head(new_grade,3)

Result:

Trang 5

Figure 2:Create the new file “new_grade”.

- Checking for missing data and calculating the proportion it accounts for in the total data

for about 1.27% in the total data, we can eliminate these missing values without concerning that it will significantly affect the statistic value of the total data

- Eliminating missing values and checking the first 3 rows of the new file

Code:

new_grade<-na.omit(new_grade)

head(new_grade,3)

Result:

Trang 6

1.2.3 Data visualization

a Descriptive statistics for each variables

- About the quantitative variables, we calculate their means, standard

deviations, the medians, the Min and Max values, the first and the third quantile values Setting these into the table named as “info”

Code:

mean<- apply(new_grade[,c("age","absences","G1","G2","G3")],2,mean) standard_deviation<-apply(new_grade[,c("age","absences","G1","G2","G3")],2,sd)

median<- apply(new_grade[,c("age","absences","G1","G2","G3")],2,median) Min<- apply(new_grade[,c("age","absences","G1","G2","G3")],2,min) Max<- apply(new_grade[,c("age","absences","G1","G2","G3")],2,max) Q1<-apply(new_grade[,c("age","absences","G1","G2","G3")],2,quantile,probs

=0.25)

Q3<-apply(new_grade[,c("age","absences","G1","G2","G3")],2,quantile,probs

=0.75)

info<-t(data.frame(mean,standard_deviation,median,Min,Max,Q1,Q3)) Result:

Figure 5: table of quantitative variables

- About the qualitative variables, we set each variable into table

i Table for the variable “sex”

Code:

table(new_grade$sex)

Result:

Trang 7

Comment: The number of female students in the sample is higher than that of male students

ii Table for the variable “higher”

Code:

table(new_grade$higher)

Result:

Figure 7: table of the students’ idea.

Comment: The number of students who want to have a higher education is significantly higher than that of those who do not want a higher education

iii Table for the variable “studytime”

Code:

table(new_grade$studytime)

Result:

Figure 8: table of study time.

Comment: The number of students spending about 2-5 hours a week for studying is the largest, which is 194 students, while that of students spending more than 10 hours a week for studying is the lowest, which is 27 students

iv Table for the variable “failures”

Code:

table(new_grade$failures)

Result:

Figure 9: table of failing grade of each course.

Comment: The number of students who never fail is the largest, which is 304 students, while that of students having more than 3 past class failures is the lowest, which is 16 students

Trang 8

Figure 10:histogram of final grade”G3”.

Comment: The graph shows that the student's final grade is centred largely between 6 and 16 points, with 84 students receiving the highest grade of 8 to 10 points and only two students receiving the lowest grade of 2-4 points (1 student) The graph's arithmetic is out of the ordinary 38 students make up a sizable portion of the students between 0 and 2 points, which makes it difficult

to build a regression model

- We plot the boxplot graphs of variable “G3” relative to each qualitative

variable

i Boxplot graph of variable “G3” relative to variable “sex”

Code:

boxplot(new_grade$G3~new_grade$sex,col="pink",xlab="sex",ylab="G3") Result:

Trang 9

Figure 11 boxplot graph representing the distribution of

Gender for G3.

Comment:

- Group of female students:

• The highest final grade is 19 points

• The lowest final grade is 0 points

• 25% of students have a final grade less than 8

• 75% of students have a final grade of less than 14

- Group of male students:

Trang 10

Result:

Study time for G3.Comment:

- Group of students with less than 2 hours self-study per week

• 25% of students have a final grade of 8 or less

- Groups of students have 2-5 hours self-study per week

- Groups of students have 5-10 hours self-study per week

Trang 11

- Groups of students with more than 10 hours self-study per week

• 75% of students have a final grade of 14.5 or less

Conclusion:

It can be predicted the group with less than 2 hours of self-study time per week had worse test results than the other groups due to lower range of test scores Groups with 5 - 10 hours of self-study time per week have output performed better than the other groups due to a higher distribution of test scores

iii Boxplot graph of variable “G3” relative to variable “failures”

Code:

boxplot(new_grade$G3~new_grade$failures,col="pink",xlab="failures",ylab="G3")

Result:

failures for G3.

Comment:

- The group of students failed to pass the subject once

Trang 12

- The group of students failed to pass the subject twice

- The group of students failed to pass the subject 3 times

• 25% of students have a final exam score of 0

- The group of students has 4 or more times failed to pass the subject

• 75% of students have a final grade of 10.5 or less

Conclusion:

It can be predicted that the group with first time not passing the subject has higher test results than the remaining groups due to the high distribution of test scores The group with 4 or more times without passing the subject had lower test results than the remaining groups due to lower distribution of test scores This shows that the more times a student fails to pass the course, the lower the final score will be

iv Boxplot graph of variable “G3” relative to variable “higher”

Code:

Trang 13

boxplot(new_grade$G3~new_grade$higher,col="pink",xlab="higher",ylab="G3")

Result:

Higher for G3.

Comment:

- The group of students want to take higher education:

• 25% of students have a final exam score of 8 or less

- The group of students do not want to take higher education:

Conclusion:

The test scores of students who want to take higher education are higher than that of remaining students So we can predict the final exam score of students who want to take higher education are higher than that of remaining students

Trang 14

- We plot the pair graphs of the variable “G3” relative to each quantitative

Comment: In general, we can conclude that the variable “G3” has the linear

relationship with the variable “G1”

ii The pair graph of the variable “G3” relative to the variable “G2”

Code:

pairs(G3~G2,data=new_grade,main="Do thi")

Result:

Trang 15

Figure 16 pair graph representing the distribution of

G2 for G3.

.Comment: In general, we can conclude that the variable “G3” has the linear

relationship with the variable “G2”

iii The pair graph of the variable “G3” relative to the variable “absences”

Trang 16

Comment: In general, we can conclude that the variable “G3” do not have the linear relationship with the variable “absences”

iv The pair graph of the variable “G3” relative to the variable “age”

linear relationship with the variable “age”

1.2.4 Fitting linear regression model

- We built the linear regression model with:

+ The dependent variable: G3

+ The independent variable : G1; G2; sex; age; studytime; failures; higher; absences

Trang 17

linear_lm<-lm(G3~sex+age+studytime+failures+higher+absences+G1+G2,data=new_grade)

The linear regression model:

G3 = 0.61310 + 0.19679 × sex − 0.15235 × age − 0.13924 × studytime

− 0.19862 × failures + 0.26384 × higher + 0.04208

× absences + 0.16637 × G1 + 0.96039 × G2

- The residuals are the differences between the actual values of G3 and the estimated value of G3 when applying the linear regression model As we can see from the figure 19, the residual’s Min value is -9.1217; Max value is 3.6379; the first quantile value is -0.4473; the third quantile value is 0.9743 and the median is 0.3160

- The adjusted R-squared informs us that about 82.49% of variation in G3 that

is explained by the different values of the independent variables compared to the total variation

Trang 18

- The F-statistic informs us if G3 does not linearly depend on the values of the inputs Let assume with the significant value α = 0.05 that:

enough evidence to reject H and we conclude that it has no effect on G3 As 0

we can see from the figure 19, only “absences”; “G1”; “G2” does have effect

summary(model_2)

Result:

Trang 19

Figure 20 Code R and the result of linear regression model model_2.

ii Model 3 without the variable “sex” from the model 2

Code:

model_3<-lm(G3~age+studytime+failures+absences+G1+G2,data=new_grade) summary(model_3)

Result:

iii Model 4 without the variable “studytime” from the model 3

Code:

model_4<-lm(G3~age+failures+absences+G1+G2,data=new_grade)

summary(model_4)

Result:

iv Model 5 without the variable “failures” from the model 4

Code:

model_5<-lm(G3~age+absences+G1+G2,data=new_grade)

summary(model_5)

Tiêu đề	Boxplot Graph
Tác giả	Lê Trung Hiếu, Long Gia Hưng, Đỗ Minh Huy, Nguyễn Vĩnh Đạt
Người hướng dẫn	GVHD: Phan Thi Huong
Trường học	Ho Chi Minh City University Of Technology
Chuyên ngành	Statistics
Thể loại	Report
Thành phố	Ho Chi Minh City

Định dạng
Số trang	38
Dung lượng	2,72 MB