Big assignment group 4 project title house price prediction

About our data Link to the data: poland?select=apartments_pl_2023_12.csv https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-Variables:  city - the name of the city wher

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI INTERNATIONAL SCHOOL

Students’ Name Students’ ID

Nguyen Anh Duc 20070921

Nguyen Danh Hai

Dang

21070277Tran Bao Ngoc 19071578

Dong Anh Tuan 21070550

Trang 2

Table of Contents

I Introduce: 1

1 About our data 1

2 About our Project 1

II Data preprocessing 2

III Perform multiple linear regression and check collinearity 6

1 Perform multiple linear regression 6

2 Check collinearity 7

IV Interaction models: 8

V Check assumption & Tranformation 18

1 Check assumption 18

2 Tranformation 20

VI Build the best possible model 22

1 Backwark search 22

2 Forward search 22

3 Stepwise search 23

4 Search procedures 23

VII Conclude 24

Trang 3

I Introduce:

1 About our data

Link to the data:

poland?select=apartments_pl_2023_12.csv

https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-Variables:

 city - the name of the city where the property is located

 type - type of the building

 squareMeters - the size of the apartment in square meters

 rooms - number of rooms in the apartment

 floor / floorCount - the floor where the apartment is located and the total number of floors in the building

 buildYear - the year when the building was built

 latitude, longitude - geo coordinate of the property

 centreDistance - distance from the city centre in km

 poiCount - number of points of interest in 500m range from theapartment (schools, clinics, post offices, kindergartens, restaurants, colleges, pharmacies)

 [poiName]Distance - distance to the nearest point of interest (schools, clinics, post offices, kindergartens, restaurants, colleges, pharmacies)

 ownership - the type of property ownership

 condition - the condition of the apartment

 has[features] - whether the property has key features such as assigned parking space, balcony, elevator, security, storage room

 price - offer price in Polish Zloty (sale offers: sale price, rent offers: monthly rent)

2 About our Project

Trang 4

In this project, we use the R programming language to build a house price prediction model based on influencing factors Our goal is to develop an effective linear regression model that predicts house prices based on independent variables such as area, number of rooms, year

of construction, location, and other possible factors affects house prices

II Data preprocessing

We use the read_excel function from the readxl library to read data from an Excel file The head function is then used to display the first few rows of the dataset

Using the ‘select’ function from the ‘dplyr’ package to remove the column named 'id' from the dataset

We calculate the frequency of each unique value in the 'city' column to select the city with the highest frequency

Trang 5

from the subset.

We checks if there are any null values (is.null) or missing values (is.na)

in the dataset

Based on the results, the dataset has missing values So, we calculate missing ratio in each column

We see that some columns have a fairly large missing rate

So, we remove columns with high missing ratios (>15%) from the dataset

Trang 6

We remove rows with any missing values in the remaining columns.

We converts the 'ownership' column values to numeric factors

We converts the has[features] column values to numeric factors

Trang 7

new column 'averageDistance' by calculating the average of the centerDistance column and the [poiName] columns.

We remove outliers based on Z-Score to make the data more stable and reduce the impact of outliers on subsequent analyses

We do:

 Loop Through Numeric Columns: Use a ‘for’ loop to iterate through all columns in the dataset

 Check Numeric Column and Not 'price': Use ‘is.numeric’ to check

if the column contains numeric data and ensure that the column

is not 'price' because we do not want to remove outliers from the price column

 Compute Z-Score: Use ‘scale’ to compute the Z-Score for each numeric column and store it in a new column 'z_score'

 Threshold and Remove Outliers: Set the threshold (‘threshold’) to

3 Rows with Z-Score values exceeding this threshold (assuming 3standard deviations) will be removed

Trang 8

 Filter Out Outliers: Use the condition ‘abs(data$z_score) <= threshold’ to filter out rows less affected by outliers.

 Remove 'z_score' Column: After removing outliers, the 'z_score' column is no longer needed and is removed

We remove some uninteresting variables

III Perform multiple linear regression and check collinearity

1 Perform multiple linear regression

We build the model based on the dependent variable “price”

R-quared (R²): The R² value is 0.7508, which indicates that our model explains about 75.08% of the variation in the dependent variable (house price), based on the independent variables used in the model.The independent variables have very low p-values (< 0.05), which shows that all the selected independent variables have a significant influence on house prices

Trang 9

2 Check collinearity

We use VIF to evaluate the co-variation between independent variables

All VIF values were acceptable, with no variables causing major problems of strong covariation So, we do not need remove any variables

However, the two variables "squareMeters" and "room" have much higher VIF values than the remaining variables, so we should check thisvariable

Model “model”, with the inclusion of the squareMeters variable, has better predictive performance than model 1 Retaining this variable may provide a more accurate prediction value for house prices

Trang 10

Model “model”, with the inclusion of the room variable, has better predictive performance than model 1 Retaining this variable may provide a more accurate prediction value for house prices.

Model “model”, with the inclusion of the 2 variable, has better

predictive performance than model 1 Retaining 2 variables may provide a more accurate prediction value for house prices

Trang 11

IV Interaction models:

The variable squareMeters: Has a very significant impact on the price, with a high F value and a p-value below the threshold of statistical significance (p < 0.001) This indicates that the property's area is an important factor influencing the price

The variable ownership: Also significantly affects the price, with a very low p-value (p < 0.001), meaning that ownership status (whether it is owned or rented) is also an important factor

Interaction squareMeters:ownership: Has a significant impact (p < 0.05), indicating that the relationship between the area and price may vary depending on the property's ownership status

Intercept (6402.181): This is the estimated price when squareMeters and ownership2 (which seems to represent cooperative ownership) are both zero It’s the base price for the reference category, which is likely

to be condominiums

squareMeters (17519.252): This coefficient suggests that for each additional square meter, the price increases by approximately 17,519 units of currency, holding the ownership status constant

Trang 12

ownership2 (32859.822): The positive coefficient for ownership2 indicates that cooperatives are, on average, priced 32,859 units higher than the base category (condominiums), when the size is zero This is somewhat counterintuitive given the plot, suggesting the need to carefully interpret this in the context of the data and the model used.squareMeters:ownership2 (-3089.111): The negative interaction term suggests that the positive effect of square meters on price is less for cooperatives by approximately 3,089 units of currency for each square meter compared to condominiums.

The red line (condominiums) has a steeper slope than the blue line (cooperatives), which aligns with the negative interaction term in the model This indicates that the price per square meter for

condominiums increases faster than for cooperatives

Trang 13

The red line is generally above the blue line, especially as the square meters increase, suggesting that condominiums tend to be more expensive than cooperatives for the same size.

The intersection of the lines suggests that at smaller sizes,

cooperatives may be priced similarly or slightly higher than

condominiums, but as size increases, condominiums become

significantly more expensive

In summary, size has a positive effect on the price for both types of properties, but the increase in price with size is more pronounced for condominiums than for cooperatives

There is a clear positive correlation between property size and price forboth categories of ownership

Trang 14

The solid line representing ownership category 1 generally has higher prices across most sizes compared to the dashed line of ownership category 2.

Ownership category 1 shows a greater spread in prices, especially for larger properties, indicating more variability in price at higher square meters for this category

Category 2, while following the same positive trend, seems to have a less steep slope, suggesting a slower rate of price increase per square meter compared to category 1

Trang 15

The slope of the trend line for ownership type 1 is steeper, indicating a more significant increase in price with each additional room compared

Ownership also significantly affects price (p < 0.001), suggesting differences in price based on ownership status

The interaction term squareMeters:ownership is significant (p < 0.05), which implies the effect of property size on price varies by ownership type

Trang 16

ANOVA Table Insights:

buildYear: The variable is statistically significant (p < 0.001), indicating that the year a property was built affects its price Properties built in different years may have varying prices, likely due to factors like design, materials, and building standards evolving over time

ownership: This is also a significant predictor (p < 0.001), suggesting that the ownership status of a property (whether it is type 1 or type 2) has a substantial impact on its price

buildYear:ownership Interaction: The interaction term is significant (p <0.05), implying that the effect of the build year on price is different depending on the ownership status

Trang 17

The plot shows a dispersion of prices across different build years for two ownership categories.

There is a general trend where newer properties (those built closer to 2025) seem to have higher prices, especially for ownership type 1 (bluepoints)

Ownership type 2 (orange points) shows a more clustered trend, with less variation in price across the build years

The interaction effect is visible as the divergence between the two trend lines for different ownership types, indicating that the build year's impact on price varies by ownership status

Trang 18

Key Points from the ANOVA Table:

poiCount: Significant effect on price (p < 0.001), suggesting that the number of POIs near a property has a positive correlation with its price.ownership: Also a significant predictor (p < 0.001), meaning that the ownership type influences the property price

poiCount:ownership Interaction: Not significant (p = 0.6755), indicatingthat the effect of POI count on price does not differ significantly between the two ownership statuses

Trang 19

There is a general trend where an increase in POI count is associated with an increase in price for both ownership types.

Ownership type 1 (blue points) displays a wide spread in prices across POI counts, suggesting a varied influence of POIs on price

Ownership type 2 (orange points) shows a trend line with a flatter slope, indicating a less pronounced effect of POI count on price compared to ownership type 1

Trang 20

Model 1:

This is a more complex model with many interaction terms

It considers the effects of square meters, the number of rooms, build year, point of interest count (poiCount), average price distance, and advertising count (count_adv), along with all their two-way interactions.The interaction terms suggest that the relationship between any of these predictors and the property price is not constant but varies depending on the level of another predictor

The complexity of the model indicates an attempt to capture a more nuanced relationship between the predictors and the property price.Model 2:

This is a simpler model compared to Model 1 as it includes fewer interaction terms

It still captures the main effects and some interaction effects among

Trang 21

The F statistic (2.869) and the associated p-value (< 0.001) indicate that there is a statistically significant difference in the fit of the two models, with Model 2 providing a better fit to the data.

Model 1:

Incorporates interaction terms, suggesting an attempt to capture the nuanced effects of these variables on property prices For instance, it includes the interaction between squareMeters, rooms, buildYear, poiCount, ownership, and average price distance with count_adv (advertising count)

The complexity of Model 1 indicates that it's trying to model the price

by considering how these factors might influence each other in addition

to their individual effects

Model 2:

Appears to be a more straightforward model without interaction terms

It includes variables like squareMeters, rooms, and buildYear, along with distances to various points of interest (e.g., schools, clinics, restaurants), and amenities (e.g., parking, balcony, elevator, security).This model seems to consider that each factor independently

contributes to the property price

ANOVA Comparison:

The ANOVA results show a significant difference between the models (F

= 4.6338, p < 0.001) This indicates that the additional complexity in

Trang 22

Model 1 (with interaction terms) provides a statistically better fit to the data than Model 2.

V Check assumption & Tranformation

1 Check assumption

Durbin-Watson Test for Autocorrelation:

 The Durbin-Watson statistic tests the null hypothesis that there is

no autocorrelation among the residuals

 A value close to 2 suggests there is no autocorrelation

 The statistic value here is 1.940218 with a p-value of 0.084

 Since the p-value is greater than the common alpha level of 0.05,

we fail to reject the null hypothesis, indicating there is not enough evidence of autocorrelation in the residuals

Breusch-Pagan Test for Homoscedasticity:

 This test checks if the variance of errors from a regression is

Trang 23

 A low p-value (here practically zero) leads us to reject the null hypothesis of constant variance, indicating the presence of heteroscedasticity in the model residuals.

Fitted vs Residuals Plot:

 Scatter plot of the residual value versus the value predicted from

a model The points are colored blue, and there is an orange dashed line across the middle of the plot at y=0, indicating wherethe residual value has no deviation from the predicted value

 Fitted values of 1000000 or less cause uneven distribution while values above 1000000 appear evenly distributed

From the Shapiro-Wilk Test:

 The test statistic (`W`) is 0.96381

 The p-value is less than 2.2e-16, which is a very small number essentially indicating zero for practical purposes

Trang 24

 Given that the p-value is less than common significance levels (e.g., 0.05), we reject the null hypothesis of normality This suggests that the residuals do not follow a normal distribution.From the Q-Q Plot (not shown but the command is provided):

 The `qqnorm()` function would create a normal Q-Q plot, which is

a graphical tool to assess if the residuals follow a normal

distribution

 Points following a straight line (usually compared against a line created by `qqline()`) indicate normality

 The `col = "red"` argument in `qqline()` would draw this

reference line in red on the plot

2 Tranformation

Tiêu đề	House Price Prediction
Tác giả	Nguyen Anh Duc, Nguyen Danh Hai, Dang Tran Bao Ngoc, Dong Anh Tuan
Người hướng dẫn	PhD. Pham Thi Viet Huong
Trường học	Vietnam National University
Chuyên ngành	International School
Thể loại	Big Assignment
Thành phố	Hanoi

Định dạng
Số trang	28
Dung lượng	3,57 MB