About our data Link to the data: poland?select=apartments_pl_2023_12.csv https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-Variables: city - the name of the city wher
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI INTERNATIONAL SCHOOL
Students’ Name Students’ ID
Nguyen Anh Duc 20070921
Nguyen Danh Hai
Dang
21070277Tran Bao Ngoc 19071578
Dong Anh Tuan 21070550
Trang 2Table of Contents
I Introduce: 1
1 About our data 1
2 About our Project 1
II Data preprocessing 2
III Perform multiple linear regression and check collinearity 6
1 Perform multiple linear regression 6
2 Check collinearity 7
IV Interaction models: 8
V Check assumption & Tranformation 18
1 Check assumption 18
2 Tranformation 20
VI Build the best possible model 22
1 Backwark search 22
2 Forward search 22
3 Stepwise search 23
4 Search procedures 23
VII Conclude 24
Trang 3I Introduce:
1 About our data
Link to the data:
poland?select=apartments_pl_2023_12.csv
https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-Variables:
city - the name of the city where the property is located
type - type of the building
squareMeters - the size of the apartment in square meters
rooms - number of rooms in the apartment
floor / floorCount - the floor where the apartment is located and the total number of floors in the building
buildYear - the year when the building was built
latitude, longitude - geo coordinate of the property
centreDistance - distance from the city centre in km
poiCount - number of points of interest in 500m range from theapartment (schools, clinics, post offices, kindergartens, restaurants, colleges, pharmacies)
[poiName]Distance - distance to the nearest point of interest (schools, clinics, post offices, kindergartens, restaurants, colleges, pharmacies)
ownership - the type of property ownership
condition - the condition of the apartment
has[features] - whether the property has key features such as assigned parking space, balcony, elevator, security, storage room
price - offer price in Polish Zloty (sale offers: sale price, rent offers: monthly rent)
2 About our Project
Trang 4In this project, we use the R programming language to build a house price prediction model based on influencing factors Our goal is to develop an effective linear regression model that predicts house prices based on independent variables such as area, number of rooms, year
of construction, location, and other possible factors affects house prices
II Data preprocessing
We use the read_excel function from the readxl library to read data from an Excel file The head function is then used to display the first few rows of the dataset
Using the ‘select’ function from the ‘dplyr’ package to remove the column named 'id' from the dataset
We calculate the frequency of each unique value in the 'city' column to select the city with the highest frequency
Trang 5from the subset.
We checks if there are any null values (is.null) or missing values (is.na)
in the dataset
Based on the results, the dataset has missing values So, we calculate missing ratio in each column
We see that some columns have a fairly large missing rate
So, we remove columns with high missing ratios (>15%) from the dataset
Trang 6We remove rows with any missing values in the remaining columns.
We converts the 'ownership' column values to numeric factors
We converts the has[features] column values to numeric factors
Trang 7new column 'averageDistance' by calculating the average of the centerDistance column and the [poiName] columns.
We remove outliers based on Z-Score to make the data more stable and reduce the impact of outliers on subsequent analyses
We do:
Loop Through Numeric Columns: Use a ‘for’ loop to iterate through all columns in the dataset
Check Numeric Column and Not 'price': Use ‘is.numeric’ to check
if the column contains numeric data and ensure that the column
is not 'price' because we do not want to remove outliers from the price column
Compute Z-Score: Use ‘scale’ to compute the Z-Score for each numeric column and store it in a new column 'z_score'
Threshold and Remove Outliers: Set the threshold (‘threshold’) to
3 Rows with Z-Score values exceeding this threshold (assuming 3standard deviations) will be removed
Trang 8 Filter Out Outliers: Use the condition ‘abs(data$z_score) <= threshold’ to filter out rows less affected by outliers.
Remove 'z_score' Column: After removing outliers, the 'z_score' column is no longer needed and is removed
We remove some uninteresting variables
III Perform multiple linear regression and check collinearity
1 Perform multiple linear regression
We build the model based on the dependent variable “price”
R-quared (R²): The R² value is 0.7508, which indicates that our model explains about 75.08% of the variation in the dependent variable (house price), based on the independent variables used in the model.The independent variables have very low p-values (< 0.05), which shows that all the selected independent variables have a significant influence on house prices
Trang 92 Check collinearity
We use VIF to evaluate the co-variation between independent variables
All VIF values were acceptable, with no variables causing major problems of strong covariation So, we do not need remove any variables
However, the two variables "squareMeters" and "room" have much higher VIF values than the remaining variables, so we should check thisvariable
Model “model”, with the inclusion of the squareMeters variable, has better predictive performance than model 1 Retaining this variable may provide a more accurate prediction value for house prices
Trang 10Model “model”, with the inclusion of the room variable, has better predictive performance than model 1 Retaining this variable may provide a more accurate prediction value for house prices.
Model “model”, with the inclusion of the 2 variable, has better
predictive performance than model 1 Retaining 2 variables may provide a more accurate prediction value for house prices
Trang 11IV Interaction models:
The variable squareMeters: Has a very significant impact on the price, with a high F value and a p-value below the threshold of statistical significance (p < 0.001) This indicates that the property's area is an important factor influencing the price
The variable ownership: Also significantly affects the price, with a very low p-value (p < 0.001), meaning that ownership status (whether it is owned or rented) is also an important factor
Interaction squareMeters:ownership: Has a significant impact (p < 0.05), indicating that the relationship between the area and price may vary depending on the property's ownership status
Intercept (6402.181): This is the estimated price when squareMeters and ownership2 (which seems to represent cooperative ownership) are both zero It’s the base price for the reference category, which is likely
to be condominiums
squareMeters (17519.252): This coefficient suggests that for each additional square meter, the price increases by approximately 17,519 units of currency, holding the ownership status constant
Trang 12ownership2 (32859.822): The positive coefficient for ownership2 indicates that cooperatives are, on average, priced 32,859 units higher than the base category (condominiums), when the size is zero This is somewhat counterintuitive given the plot, suggesting the need to carefully interpret this in the context of the data and the model used.squareMeters:ownership2 (-3089.111): The negative interaction term suggests that the positive effect of square meters on price is less for cooperatives by approximately 3,089 units of currency for each square meter compared to condominiums.
The red line (condominiums) has a steeper slope than the blue line (cooperatives), which aligns with the negative interaction term in the model This indicates that the price per square meter for
condominiums increases faster than for cooperatives
Trang 13The red line is generally above the blue line, especially as the square meters increase, suggesting that condominiums tend to be more expensive than cooperatives for the same size.
The intersection of the lines suggests that at smaller sizes,
cooperatives may be priced similarly or slightly higher than
condominiums, but as size increases, condominiums become
significantly more expensive
In summary, size has a positive effect on the price for both types of properties, but the increase in price with size is more pronounced for condominiums than for cooperatives
There is a clear positive correlation between property size and price forboth categories of ownership
Trang 14The solid line representing ownership category 1 generally has higher prices across most sizes compared to the dashed line of ownership category 2.
Ownership category 1 shows a greater spread in prices, especially for larger properties, indicating more variability in price at higher square meters for this category
Category 2, while following the same positive trend, seems to have a less steep slope, suggesting a slower rate of price increase per square meter compared to category 1
Trang 15The slope of the trend line for ownership type 1 is steeper, indicating a more significant increase in price with each additional room compared
Ownership also significantly affects price (p < 0.001), suggesting differences in price based on ownership status
The interaction term squareMeters:ownership is significant (p < 0.05), which implies the effect of property size on price varies by ownership type
Trang 16ANOVA Table Insights:
buildYear: The variable is statistically significant (p < 0.001), indicating that the year a property was built affects its price Properties built in different years may have varying prices, likely due to factors like design, materials, and building standards evolving over time
ownership: This is also a significant predictor (p < 0.001), suggesting that the ownership status of a property (whether it is type 1 or type 2) has a substantial impact on its price
buildYear:ownership Interaction: The interaction term is significant (p <0.05), implying that the effect of the build year on price is different depending on the ownership status
Trang 17The plot shows a dispersion of prices across different build years for two ownership categories.
There is a general trend where newer properties (those built closer to 2025) seem to have higher prices, especially for ownership type 1 (bluepoints)
Ownership type 2 (orange points) shows a more clustered trend, with less variation in price across the build years
The interaction effect is visible as the divergence between the two trend lines for different ownership types, indicating that the build year's impact on price varies by ownership status
Trang 18Key Points from the ANOVA Table:
poiCount: Significant effect on price (p < 0.001), suggesting that the number of POIs near a property has a positive correlation with its price.ownership: Also a significant predictor (p < 0.001), meaning that the ownership type influences the property price
poiCount:ownership Interaction: Not significant (p = 0.6755), indicatingthat the effect of POI count on price does not differ significantly between the two ownership statuses
Trang 19There is a general trend where an increase in POI count is associated with an increase in price for both ownership types.
Ownership type 1 (blue points) displays a wide spread in prices across POI counts, suggesting a varied influence of POIs on price
Ownership type 2 (orange points) shows a trend line with a flatter slope, indicating a less pronounced effect of POI count on price compared to ownership type 1
Trang 20Model 1:
This is a more complex model with many interaction terms
It considers the effects of square meters, the number of rooms, build year, point of interest count (poiCount), average price distance, and advertising count (count_adv), along with all their two-way interactions.The interaction terms suggest that the relationship between any of these predictors and the property price is not constant but varies depending on the level of another predictor
The complexity of the model indicates an attempt to capture a more nuanced relationship between the predictors and the property price.Model 2:
This is a simpler model compared to Model 1 as it includes fewer interaction terms
It still captures the main effects and some interaction effects among
Trang 21The F statistic (2.869) and the associated p-value (< 0.001) indicate that there is a statistically significant difference in the fit of the two models, with Model 2 providing a better fit to the data.
Model 1:
Incorporates interaction terms, suggesting an attempt to capture the nuanced effects of these variables on property prices For instance, it includes the interaction between squareMeters, rooms, buildYear, poiCount, ownership, and average price distance with count_adv (advertising count)
The complexity of Model 1 indicates that it's trying to model the price
by considering how these factors might influence each other in addition
to their individual effects
Model 2:
Appears to be a more straightforward model without interaction terms
It includes variables like squareMeters, rooms, and buildYear, along with distances to various points of interest (e.g., schools, clinics, restaurants), and amenities (e.g., parking, balcony, elevator, security).This model seems to consider that each factor independently
contributes to the property price
ANOVA Comparison:
The ANOVA results show a significant difference between the models (F
= 4.6338, p < 0.001) This indicates that the additional complexity in
Trang 22Model 1 (with interaction terms) provides a statistically better fit to the data than Model 2.
V Check assumption & Tranformation
1 Check assumption
Durbin-Watson Test for Autocorrelation:
The Durbin-Watson statistic tests the null hypothesis that there is
no autocorrelation among the residuals
A value close to 2 suggests there is no autocorrelation
The statistic value here is 1.940218 with a p-value of 0.084
Since the p-value is greater than the common alpha level of 0.05,
we fail to reject the null hypothesis, indicating there is not enough evidence of autocorrelation in the residuals
Breusch-Pagan Test for Homoscedasticity:
This test checks if the variance of errors from a regression is
Trang 23 A low p-value (here practically zero) leads us to reject the null hypothesis of constant variance, indicating the presence of heteroscedasticity in the model residuals.
Fitted vs Residuals Plot:
Scatter plot of the residual value versus the value predicted from
a model The points are colored blue, and there is an orange dashed line across the middle of the plot at y=0, indicating wherethe residual value has no deviation from the predicted value
Fitted values of 1000000 or less cause uneven distribution while values above 1000000 appear evenly distributed
From the Shapiro-Wilk Test:
The test statistic (`W`) is 0.96381
The p-value is less than 2.2e-16, which is a very small number essentially indicating zero for practical purposes
Trang 24 Given that the p-value is less than common significance levels (e.g., 0.05), we reject the null hypothesis of normality This suggests that the residuals do not follow a normal distribution.From the Q-Q Plot (not shown but the command is provided):
The `qqnorm()` function would create a normal Q-Q plot, which is
a graphical tool to assess if the residuals follow a normal
distribution
Points following a straight line (usually compared against a line created by `qqline()`) indicate normality
The `col = "red"` argument in `qqline()` would draw this
reference line in red on the plot
2 Tranformation