VIET NAM NATIONAL UNIVERSITY HO CHI MINH CITYUNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMS LAM HA TUAN CANH THESIS GRADUATION APPLYING PREDICTION MODELS TO
Trang 1VIET NAM NATIONAL UNIVERSITY HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
ADVANCED PROGRAM IN INFORMATION SYSTEMS
LAM HA TUAN CANH
THESIS GRADUATION
APPLYING PREDICTION MODELS TO
FORECAST REAL ESTATE PRICES
BANCHELOR OF ENGINEERING IN INFORMATION SYSTEMS
HO CHI MINH CITY, 2021
Trang 2LAM HA TUAN CANH -15520056
THESIS GRADUATION
APPLYING PREDICTION MODELS TO
FORECAST REAL ESTATE PRICES
IN HO CHI MINH CITY
BANCHELOR OF ENGINEERING IN INFORMATION SYSTEMS
THESIS ADVISOR
Dr CAO THI NHAN
Trang 3ASSESSMENT COMMITTEEThe Assessment Committee is established under the Decision - -s
đate by Rector of the University of Information Technology
- Chairman
- Secretary
cá — - Member
Trang 4First of all, I would like to show my appreciation Dr Cao Thi Nhan for being my thesis
advisor not only during the time we work on graduation thesis but since I joined the
University of Information Technology as a consultant With patience, motivation, andimmense knowledge, she helped us to keep track of the direction of the research andgave us lots of advice to have the thesis completed
I also express our sincere thanks to Dr Do Trong Hop for the very careful review of
my thesis, and for all the insightful comments, suggestions, and corrections
Trang 5TABLE OF CONTENTS
css
TABLE OF CONTENT S cccsssssssssssssesesecssscccsesessescseseseseseesesesessncseseseeseeeseseeeaees 2LIST OF FIGURES ccceccssssssssesesessseseessssesseeesessssssesesesasssessesnsssscsesessssssssesnsesaeseens 4LIST OF TABLES << 5-5-5555 Es+SS4 3 3E1ES.E13 3 8031104010001811 010g 6
LIST OF ABBREYVIA TIONS -5< 55 SĂ se S22 S11191030301683830303030841401010856 7
ABSTRACCTT 5c HS 003030108013001080808403010808004040401004040101010404010896 8
Chapter 1 : INTRODUCTION sscssssssssssssssssssssssssessseesssssseessseessseseessssessesssseeseee 9
1.1 Background
1.2 Objective and scope _
1.2.1 Objectivcs sổ ⁄72722 oPHỆ0 VÀ nh xe etseereresese
Collection -2.2.2 Data Description ¿St sseeeeeerrrrierriee LT
2.2.3 Exploratorry Data Analysis ResuÏL -<+-+c c + 232.2.4 Data pre-processing oo cecsecseseseseseesesessenesesssseseseseeneacssseessaeeeenseeesseenenees 29
2.3 Regression Model and Evaluation Metrics Used tre «30
2.3.1 Linear Regression Models — Stochastic Dual Coordinate “Ascent
G000 ố -3 30
2.3.2 Decision Tree — Fast Forest Regression -5©5+55-++ 30
2.3.3 Gradient Boosting Regression — LightGBM and Fast TreeG000 ố -3 31
2.3.4 Performance Metrics - ¿+ ¿5S Sky 31
Trang 6Chapter 3 : IMPLEMBENTA TION -555<5ssssessesessesstsrstsrtsrsensrsee 34
3.4 Demo “ SG
3.4.1 Introduction t to issih
3.4.2 Main ẨunCtÏOPS 5-56-5222 2222 22 1212111111121 1xx 5
Chapter 4 : CONCLUSIONS 5-5-5 S*HỲHnHxgg H11 ung rsee 60
4.1 Conclusions Qt em 5 /\EE Ả i HH da 60
4.2 Limitations and challenges
4.3 Future works “
REEFERENCES Ăn 101 010101403030108080040401010004040101000196
Trang 7LIST OF FIGURES
css
Figure 2-1 Data Preparation Process ccsssessessessseereeeessetessseeeacerseeeseeeeeaeeesseteees L4Figure 2-2 Process Flow of Prediction Model [12] . -c e-ecc<c«-.e-.e- LỘFigure 2-3 Website UI [3] - - - + + St kg g1 1tr gret 16
Figure 2-4 A raw data set example [3]
Figure 2-5 Processed Data Set
Figure 2-6 SellPrice Scatterplot in HCMC
Figure 2-7 SellPrice Distribution of HCMC -. - ¿5< 555c<5ceccccecce-cec 23Figure 2-8 Average price for types of real ©SA(€ cty 24
Figure 2-9 Relationship between SellPrice and Legal Document - 25
Figure 2-10 Average Price in each District ¿5-55-5255 c+ccseseeererseeeeeee 27,
Figure 3-1 Options to Add ML Model
Figure 3-2 Scenarios Choosing UL
Figure 3-3 Information of training environment
Figure 3-4 Data PT€VI€W nh HH Hi OD
Figure 3-5 Data types settings of the variables -¿- - + s++x+cererrxeterrrkeee 38Figure 3-6 Set time for training da(a ¿55c sescssekekerererrrrrrre OOFigure 3-7 Recommended time for training data ¿-‹ 5 <e-<+< c OD)Figure 3-8 I“ experiment’s R-squared
Figure 3-9 1* experiment sample no.1
Figure 3-10 Actual price of alike property of I* experiment sample no.1 [13].
Figure 3-11 Housing prices Reference of Lac Long Quan Street from Mogi.vn
Figure 3-12 1“ experiment sample 0.2 - + + e++£+++x££keEkeErkrkerkekkrrerkee
Figure 3-13 Actual alike property of 1 experiment sample no.2 [13]
Figure 3-14 2"4 experiment’s R-squared
Figure 3-15 2" experiment sample
Figure 3-16 Actual alike property of 2"4 experiment sample [13] 49
Figure 3-17 Housing prices Reference of Linh Dong from Mogi.vn [ I] 49
Figure 3-18 3 experiment’s R-squared cscsssesssssssseesseesstecssesssseessecsseesssecsseesseeessees SO)
Figure 3-19 3" experiment sample no . -¿5-++cxsrxetrteetretrrirrierirrr 2
Trang 8Figure 3-20 Actual alike property of 3 experiment sample no.1 [13] 53
Figure 3-21 Housing prices Reference of My Hue Street from Mogi.vn [11] 53
Figure 3-22 3TM experiment sample 10.2 c.ssssccssssssessssecsseesssecsseccssecssecssscsssecsseesseeessees 4
Figure 3-23 UI of the website
Figure 3-24 Filters available
Figure 3-25 UI when choosing a property “
Figure 3-26 Displayed reSuÏtS 55-555 S++£sketkerrkrrkerkerrrerkrerrerrercev ODFigure 3-27 Actual average price of the property on location -. .- 2Ø
Trang 9le 2-1 Raw Data Summary cecceesesecseessseeeeseeeeenessseesessseensaeeeseeesaeseeseeeessseeeaees 2
le 2-2 Data Description c.ccccecscsceseescesesesesseseseesesesssssessssesssssssssesssesssssssessesees 2
le 2-3 Processed Data SUMMALY ce sce sseesseetesesesessessseeneneassesessseeteneatenenesees 20)
le 3-1 1“ experiment dataset
le 3-2 2"4 experiment dataset
le 3-3 3 experiment dataset
le 3-4 1“ Experimental results -. 55 55escssseezteeerseseererereseee AL
le 3-5 Testing examples of 1S experiment -. ©5555 +xes++£vzverxerererxrre 42
le 3-6 2" Experimental Metrics c.ccscsessssesssseesssseecssseesssneessnneesssneessnneeesnneeesnees 46
e 3-7 Testing Examples of 2° experiment - :-cccccsccecrerrrrrrrrrrrrrrrreere 47,
e 3-8 3 Experimental results
e 3-9 Testing Examples of 3" experiment
Trang 10LIST OF ABBREVIATIONSEDA Exploratory Data Analysis
LASSO Least absolute shrinkage and selection operator
VAR Vector autoregressive
ADL Autoregressive distributed lag
XGBoost Extreme Gradient Boosting
SVR Support Vector Regression
SGD Stochastic Gradient Descent
GBR Gradient Boosting Regression
SDCA Stochastic Dual Coordinate Ascent
HCMC Ho Chi Minh City
MAE Mean absolute error
MSE Mean squared error
Trang 11ABSTRACTDifferent models used in house price forecasting are tested on their predictionaccuracy Using data from detailed house price indices to Ho Chi Minh City of Viet
Nam in the third and fourth quarter of 2021 Some regression techniques such as
Stochastic Dual Coordinate Ascent (SDCA) method, Fast Forest for Decision Treemodels and Gradient Boosting algorithms, namely Fast Tree Regression and LightGradient Boosting Machine are selected to forecast house price index changes Suchmodels are used to build a predictive model, and to pick the best performing model by
performing a comparative analysis on the predictive errors obtained between these
models The data set used for this report was downloaded from the websitewww.laydulieu.com The data set consisted of nearly 2000 observations and 18
variables The target variable from the given data set was Price The results of the
experiments are illustrated through a website built with Net Core and Angular toobtain a demonstration of the used algorithms’ performance
KEYWORDS: House Price Prediction, Linear Regression, Gradient BoostingRegression, Decision Tree
Trang 12Chapter 1: INTRODUCTION
1.1 Background
Vietnam is developing into a rapidly growing and prosperous real estatemarket in Southeast Asia It is considered one of the hotspots of the mostdeveloped real estate market in Asia, with a growing economy, some laws havemade it easier for foreigners to buy the property As of 2017, the increase in thenumber of investors, including national and international players in therespective sub-markets, has led to the development of new housing companies,
green buildings, etc of several mega-projects in major cities are fundamental to
the growth of the residential real estate market, both in the basic and in theluxury segment
In the context that there has been a significant increase in the number of peopledemanding of owning a house or land so the real estate market in Viet Nam hasbecome more and more appealing to the investors along with the pricecountinuously fluctuates This is also synonymous with the confusion of
whether they have purchased a property at a proper price among the customers
and a massive number of scammers taking advantage of this situation isinevitable So this is the major consideration why should we need predictivemodels In short, predictive modeling is an applied mathematics techniqueexploitation machine learning and data processing to predict and forecastpossible future outcomes with the help of historical and existing information Itworks by analyzing current and historical data and sticking out what it learns on
a model generated to forecast likely outcomes In this thesis, Forcast Model is
used because of its popularity in working with numerical values based ontraining data [1] The data used in this thesis is available from the websites
laydulieu.com under the link of [3] by Nguyễn Đức Nam and mainly focus on
Ho Chi Minh City territory In particular, 4 typical areas of the city areinvestigated, namely District 1 — the city’s heart, Thu Duc District — the newlyemerged as the most potential area, Tan Binh District — the place for laborers
Trang 13from other provinces to come and settle down and Hoc Mon District — theoutskirts of this metropolis where the infrastructure is still not worth-concerned.The full list of data variables is given in Section 2.2.1.
There are various considerations influencing the price of properties.According to [6],[7], price of real estate is influenced by several factors like:
e Property-related factors
¢ Locational Factors
¢ Environmental FactorsThe purpose of this thesis is first to examine the influence of various variables
on the real esate prices in Ho Chi Minh City by using EDA Secondly,
researching the linear regression algorithm for their theories and operations toindicate the highest predictor through experiments is the main target Three
models based on Linear Regression, Decision Tree and Gradient Boosting
respectively are proposed and demonstrated through a website
1.2 Objective and scope
1.2.1 Objectives
e Understand the implementation of business data analysis and machine
learning on providing results
¢ Covering real estates in 4 typical areas in HCMC territory, which are:
o District 1: The downtown
o Hoc Mon District: The suburb
o Tan Binh District: The stable area
o Thu Duc District: The developing area
e There will be clear explanations for the reasons why these 4 areas are
selected in Section 2.2.1 below
10
Trang 141.2.2 Scope
Using Linear Regression Algorithms and Decision Tree Models for thevalue predictor of real estate prices Besides, R-squared is the main mectric
used for a evaluation in terms of the efficiency
A demonstration website is built to illustrate the result of the experimentsand performance of the used algorithms
Trang 15Chapter 2: DATA SET AND METHODOLOGY
2.1 Related works
In recent decades, there has been a demand to extend the house priceprediction services which help investors and settlers to take a correct decision.This section describes the previous work done by several researchers in theselected domain of housing price prediction Following are the contributions ofvarious researcher done in this domain:
In 2016, Martijn Duijster [4] used ARIMA/ADL/VAR to forecast the Dutch
house index changes The experiemental analysis was based on the the period
of 1995 — 2016 Netherlands house price data This paper shows that ADL hadthe best performance among the algorithms used and claims that 1-to-6-quarterforward prediction is available but because of the out-dated data and a broadspectrum are the major limitations
In 2019, Nebojša Dubošanini, Jan Eric Biihlmann2 and Pauline Offeringa [2]
applied Linear Regression, RIDGE Regression, LASSO Regression, Random
Forrest on predicting the index of dataset of Melbourne, Australia This project
indicated that the Decision Tree model can be simple and still had the bestperformance compared to RIDGE and LASSO Regression Moreover, thismodel also provide an explicit look at the scheme and how the target variablewas computed However, the drawbacks is the inability of above thresholdprediction and unspecified data time so it would be difficult to get a preciseforecast because of the fluctuation in prices of this area
In 2020, Yichen Zhou [8] put Linear Regression, LASSO Regression,
Random Forrest and XGBoost into practice to predict house price in Ames,
Iowa, USA with the 79-variable data of 5-year period from 2005 to 2010 Theexperiemtal results expressed high accuracy of XGBoost model’s predictedvalues in comparison with actual figures, with more than 94% accurate.Nevertheless, the considerable variables data would take loads of time to do theEDA for feature extraction and the recency is also worth-concerned
12
Trang 16Even in 2021, LASSOLARS Regression, Bayesian Ridge Regression, SVR,SGD and GBR are still used for price prediction of house dataset of Islamabad
— Capital of Parkistan [9] by Imran , Umar Zaman ,Muhammad Wagar and Atif
Zaman The results show that SVR performs best than the rest of the machinelearning algorithms
From the information collected, there are many types of research with various
methods such as ARIMA, ADL,XGBoost integrating with Linear Regression,
which are still applicable for Prediction Models even until 2021 However, Iwould take the upgrade versions of Linear Regression, Decision Tree Model
and Gradient Boosting Regression, namely Stochastic Dual Coordinate Ascent
(SDCA), Fast Forest Regression, Light Gradient Boosting Machine and FastTree Tweedie Regression respectively for a faster training runtime, less memoryusage and minimization of convex loss functions combining with a visual
demonstration on website running at localhost Additonally, the data set is
downloaded from laydulieu.com because of its weekly regular updates Thereare 2 basic processes related to this problem which are :
e Data preparation: Because raw data isn’t suitable to be used directly
so it need to be pre-processed before it can be used to fit the parameter
of the models It is required to remove redundant attributes andconverted into appropriate data type Figure 2-1 shows the process of
data preparation after the data is downloaded
Trang 17Building Models: After the data is ready, a feasible model shouldn’t
be over-looked Multiple algorithms are put into account after data istrained to acquire the best-performed one for making predictions.Figure 2-2 shows the fundamental process for a Prediction Model
Trang 18Figure 2-2 Process Flow of Prediction Model [12]
2.2 Data Collection and Data Set Generating
2.2.1 Data Collection
The thesis is mainly concentrate on the real estate of Vietnamese
market, in particular HCMC so although there are plenty of websites forusers to look up information, to retrieve their data is a tough problem Thedata was manually downloaded from laydulieu.com [3] - a website
specializing in collecting and statistical data on the fields of: Real estate,
cars, electronics, home appliances, jobs, finance and fruits Figure 2-3shows the UI of the website after accessed
Trang 19LẤYDỮLIỆU MHÀDẤT XE ĐẾNTỦ C0ĐGADNG GHOVA TRẤCÂY
Figure 2-3 Website UI [3]
Because the time to finish the thesis is limited so it is insufficient tocover all of the districts in HCMC so I took only 4 typical ones to examine
the fitness of the models used The reasons for the selection of these
regions will be available in Section 2.2.2
Each downloaded data set is exported into an excel (.xIs) file and thereare only 50 records for each set which means for about 2,000 records, I
have done at least 40 downloading times For every file downloaded, the
raw data contained 18 variables with a lot of information and several
missing records as well as false information such as 18,000 m? of area in
District 1 At first, I gathered all of the data set into one and sorted them
alphabetically by districts Then all of the redundant records were removed
to clean the data In the last step, some variables were renamed and somewere added to better fit the models and the file was converted into csv filefor training format suitability, an thorough explanation is available inSection 2.2.3 Figure 2-4 shows the raw data when downloaded andFigure 2-5 shows when data has been processed
16
Trang 20- woe
sn hao 35 BG BS cv li hổ ioe E172 Ban:i08 ve hà lon — |Ona lhàcwdouaeL I [oreo [oun30s quận 12568 [ost [canes |no cuwidaugn + [poh nscos6 | conn
a] oK Gh RE cao 5 BANG THU HONG TEU NGAY [ost lean bn HB chin cut 3 [bag na
Sos đạp hat iểndi bể Ea, [ost [ean bin [nb dt + [osc s6 oon
SÌMgttên dường cg nia Binh ih Von [0st [ean bn [Wo CMOS Tose |- wine
7] af 38 vil [os [Canin [nb cu widougn ps css6 [aon
D lạm — [ean lhotzwdoodn+ Jpssss eins
i Tit Gm iD fo Lim Seg os [Can bin luồdiwdouinL [oso sé [own
quận 2 ft Ny Th Đan bm— |Gmea lhachwdoar+ lbsss lung:
le dons bo ng rg 0/002 [os [Canin [no chi wiouin + [asco [ae
12108 hin cc bie Si ay Long TH [ost [ean bin [ub chic om Jose sé | hước
11Ì5ảahượngnàttê Ngo Thả oe D410 gre [0st [ono [nO casa [ps<b5 [owen
ilsini6 2e ME Te i GOTO 3.8 [ast [ean bin [ub cd cutn 1 [peng Cn i IETET-ST- 7
‘Nuh gắn Ho IRIN OWOTAxSSMGAATTY [ost JQiaen-JHàCýM2Buảa1— JườngNỊL3Ó Đời ayn fon [+ ning
Can tt đấm 6 gn ca Bu [os [can [no on wid cutn + [boaghon [an
17002 ngngto di 27 gi an [os |Canbin [wo cu widougn [Pang 0am) CEO
inn duận thận hông Thos [ost [conan lhồ cu midaugn 1 JBwngtắpveng Te be Josessé leant
19[Cin tn vd ng bn 1 in cog tườnghộc [os [ean bn ]nd cuidcutn 1 [rng 80 vr) [ose 6 eine
20s Tay To Mb Ob Ton Em [os [Canin] Cu wid cugn [Peg as, Over [oc 35 [aor21[08 0 ne phó ane [ost [Canin ]wB ch widguin 1 Jmvegbie 0 (bine lsessé |anđbc
dụ Gnd gia cle ĐẤT ONG TUẦN [ost |G na ThộOt¿Mdoodnl— JvngbiDveng lạc josessé |Oene
[DI Bo Ue Soom2 Si View Hồ, Đà CB [as [ean binning Dvr [ps5 [sents
Dal Nn MT igo, Than, ct Pxaem Sa [05 [ean nan_[6 Cub 1 [mươngĐỆ9vmg [ose 6 [aan
25s a Pe ince Sa UENO [os [cane [ub chiMigusn 1 [peng Bsn IS
do wht at Mos in Pn, duậoTDEAIE-270n8) [0st [ch ut lồ Cw M1 Prag OL OUeRg Ocha Cfo13709720 wart EErarea
Ziv Đồng Enon cn bn tog [ost |canbin [nb cuwiaugn 1 [onveg novo nena ie 349sig)300uiả, ere
287m2 me Cao SA hg a bảng len [ean bin [HB crite 1 jmvnghdowxgcalweminA67oanliost) [bid s6 vn
Figure 2-4 A raw data set example [3]
mm = era [ety [bein i sen is Par > Frsoritn > [ia Some [ice
Heri fase seal hề 3 E
lanasn le Fr ie 3 Fe
suas ests ron a4, ego Tg Po i gpd Gas, Tu AI ra el ri 1 h A 1— sẽ
[soos faint ant eleva St nà 13 Bang ae nội yến vi FE ja Dị T ‘| 7 |
tô chang zo eins —rnsona acon Bug Pon ea ake x E7 4 ae E ‘| it 3—sa|{dae chung oon be favo i cook Peng ath, avn ps NM ET bị eis
“ông thakev|t7na2on ein of hte be lo tub al be bị ee{oberg suze dst 0 Pig Por TT sa at i 3 af DI:
‘ies cheng zt oui! pra sen roc tk cường er bgt oe lẻ Am, aa | a if 3 Dị DI:
Em) rm Em: is 3 Dị er
na ee hing Fang nb te Te Cu sai s—4 i 3 Dị DIeng os Pig chs Go Toes ra ja he 3 h Dị aear ing ab Tp Ck cn ja ase aI 1 +t st — $54
prong a cy ho Pg Ss TM sl s—4 ae bị 3 af —
eer a] s—ä Dị 1 h Dị 1 —ang = ja 3 F +t or
oh chnge aoe ein " Ts at Dị esGẦN cnnge oust outa — [hưng sia Em: 3 Dị DISN tng uz la Pree ễ Sa | ——a 3 Dị DA{nbs thepavlians son [ven Prong sore ng cor noes ese Error a ja Dị 3 i Dị 3—eorm Gut Tosca se | —a 3 Dị DI
wore oh Pate ods Jpg MLA Bag es ho Pv ana ath pS rn s—4 oh bị 3 +f — isCF Wik fo, BOTA 5 ja Dị aI 1 +t | — Tếnest Đá oe Pry ish abe BO rn s—ä aH 1 i Dị BEna in oat pon ro SE—4 as 3 F +t er{ehcp sos fasta = hs Bạn Gan 8 sr bị Dị ej
Figure 2-5 Processed Data Set
2.2.2 Data Description
The data set of 4 different districts of Ho Chi Minh City, namely:
e District 1: District 1 is the central district of Ho Chi Minh City
House prices in District 1 are always attractive to investorsbecause District 1 is home to many government agencies,Consulates of many countries and long-standing historical sites
In the district, there are many commercial centers, buildings forrent, office buildings, etc It is also a gathering place for many
Trang 21companies, from Vietnamese companies to foreign investedcompanies The district owns arterial roads and famous roads
such as Nguyen Hue pedestrian street, Nguyen Van Binh book
street In addition, it is also close to the canal system, easy toaccess to the Mekong Delta provinces Another plus point is that
the Metro system is in the finishing stage Therefore, housing
prices in District 1 always increase strongly Focus on many
quality schools, from public schools to private schools withinternational standards This place also has many large hospitals,
spas, clinics, beauty salons, etc., bringing many advantages to
people living here
Thu Duc District: Because of the Prime Minister agreement ofestablishing Thu Duc City including District 2, District 9, Thu
Duc, real estate in this area has rocketed Leading to the fact that
most of the opinions are that it will grow strongly Not only that,the prices of apartments and townhouses in Thu Duc, District 2,
District 9 are also becoming more and more attractive in the eyes
of investors The completion and operation of the new MienDong bus station and the construction project of Long Thanhairport, which is in the process of being implemented, also make
investors more interested in land prices in Thu Duc district For
those who intend to buy a house or land as a place to live andsettle down with a small amount of capital, it is appropriate to
refer to Thu Duc housing prices The current Thu Duc land price
has increased, but it is still at the threshold that can be considered
for a reasonable investment
Tan Binh District: Tan Binh is an inner city district located to thenorthwest of Ho Chi Minh City This is where many largecompanies, factories and industrial parks gather A lot of
18
Trang 22laborers focus on the district area to live and do business.Therefore, there are many people who have the need to search
for real estate in Tan Binh The district owns major roads of the
city such as Nguyen Van Troi, Cong Hoa, Hoang Sa, Truong Sa,and Hoang Van Thu streets At the same time, Tan Son Nhat
airport is also in Tan Binh district With the advantages of traffic
and transportation, housing prices in Tan Binh district are always
at a high level In addition, this place also has many schools,shopping malls, and large hospital systems, so the district's
amenities are considered adequate As a result, house prices in
Tan Binh district are constantly fluctuating
Hoc Mon District: Hoc Mon District is a suburban district in theNorthwest of Ho Chi Minh City Although the Hoc Mon housing
market is attracting a lot of attention, housing prices in Hoc Mon
district are still cheap Transport infrastructure is synchronouslyplanned Because it is developed in the following years, the
infrastructure is well planned The planning here is long-term
and strategic, urban construction synchronously Hoc Mon realestate has potential for future development When the real estatemarket in the central area is increasing in price, the land fund is
dwindling, housing in Hoc Mon area has received more
attention
This data set consists of about 2000 records with 18 variables It was
downloaded from laydulieu.com, of which the reliability is assured based
on the Decision No 02/2020/QD-UBND promulgating regulations on
land price list in Ho Chi Minh City for the period of 2020-2024 issued bythe People's Committee of Ho Chi Minh City on January 16, 2020 [7]
Moreover, a website of Mogi.vn [11] is a representative website of Dinh
Anh Joint Stock Company and is an online website specializing in real
Trang 23estate posting, providing information of current real estate which isaccompanied by market data, area information and a tool to calculate the
cost as well as the installment period of each different property It applies
artificial intelligence (AT) technology, through a survey of over 2,000,000real estate listings of Mogi.vn and Muaban.net combined with aspecialized calculation engine Mogi updates the monthly housing price
list of the areas to help users keep up-to-date with the price This is a
continuously updated data set with the latest information of the properties
so it can keep the predictors up-to-date with the current price and give thesatisfying results Table 2-1 is the summary of raw data by districts and adetailed data description is shown in Table 2-2
Table 2-1 Raw Data Summary
District Record quantity Attribute
Quan | 712
Huyện Hóc Môn 553
18Thủ Đức 589
Tân Bình 360
Total 2,214 18
Table 2-2 Data Description
Attribute Description Data Type
STT The order number of the Numeric
property
20
Trang 24Chuyên mục The form of the String
Trang 25Hướng The direction of the String
Trang 26Figure 2-6 SellPrice Scatterplot in HCMC
Figure 2-6 gives us the scatter plot of the sell price It can be clearlyseen that most of the points are assembled on the bottom
SellPrice Distribution in HCMC
Figure 2-7 SellPrice Distribution of HCMC
Trang 27The graph above shows that the distribution of price is right-skewedwhich is pretty reasonable (under 10 billion VND) because few peoplecan afford the exorbitant properties (greater than 10 billion VND) FromFigure 2-6 and Figure 2-7, most of real estate prices in Ho Chi MinhCity are ranged from 1-10 billion VND with more than half of the data
records while only a minor number properties are appraised with high
price This means that affordable real estates are still the most populareven in one the metropolises of Viet Nam
It is crucial to identify the variables which have a strong correlation
with the target attribute (Price) According to [5,6], the factor influencingreal estate prices can be classified into 3 categories: Property-Related,
Location, and Environment
2.2.3.1 Property-related Factors
»
There are 11 variables related to this section which are :
Chuyên mục, Nhu cầu, Người đăng, Điện thoại, Ngày đăng, Diện
tích, Hướng, Số tầng, Số phòng, Nhà vệ sinh and Giấy tờ pháp
`
0 | |
Căn hô, Chung cư Đất Nhà
Figure 2-8 Average price for types of real estate
24
Trang 28Firstly, the buyers tend to have serious consideration on the
form and the size of the properties for their demands [6] Figure
2-8 proves that the average price of “Nhà” is the highest (13.31
billion VND) so that form of property can affect seriously on the
price Particularly in Viet Nam, if someone is looking for a house,
the structure is worth-concerned because some will move in
immediately or slightly re-build the house sooner or later
In addition, the legality is also a important factor affecting the
properties’ value because if there is no problem with the
documents such as mutual ownership, the procedure owner’s
name transition will be much more simpler Figure 2-9 would
show a clearer visualization
SellPrice vs Legal Document
700.00
600.00 500.00
Figure 2-9 Relationship between SellPrice and Legal Document
So, it can be derived that the “Price” variable correlates most
with “Chuyên mục, Diện tích, Hướng, Số tang, Số phòng, Nhà vệ
sinh and Giấy Tờ Pháp Ly” These variables will be renamed into:
+ EDS @ se @
Trang 29e “Type”: This represents for the forms of property available
such as “Dat, Nha or Căn hộ, Chung cư” The data type of
this attribute is “string”
e “Area”: Repesenting the size of properties with
“Nummeric” data type
e “Direction” : This represents the direction of the property.There are 9 directions included which are given in thedataset ”S,W,E,N,SW,NW,SE,NE” and “blank” because
of the missing data
se “No Room”: This shows the number of rooms of the
house or apartment
e “No Floor’: This shows the number of floors of the house.
e “Legal Document” : This tells whether that real estate has
required documents or not, so the data type would be
“Boolean” with “1” for “true” and “0” for “false” to makesimplify the process
2.2.3.2 Locational Factors
Location is considered to be the most significant feature of
house price determination After a consideration in the data set,
it is observed that “Tinh/Thành phố, Quận/Huyện, Phường/Xã,
Đường, khu vực” are the variables are representing theadministrative regions
26
Trang 30Figure 2-10 Average Price in each District
From figure 2-10, it is obvious that the more the location
closer to the center, the higher the price is so beside “Tinh/Thanh
phố” variable is eliminated because all of the regions are in
HCMC territory, all of the remainings are strongly correlated to
“Price”
“Quan/Huyén” gives the area of the properties whether it is
the downtown or suburb, which identifies the general price ofthe objects “Phường/Xã” shows the area of the properties in
which there are various difference between those areas “Đường,
khu vuc” shows the specific location of the properties with the
format of “Street’s name, Ward, District, City” and will help the
model become fitter because there are significant differences
between streets or wards in one district [7], especially in HCMC
Thus, these two are taken into account while the others are
removed Besides, “Quận/Huyện”, “Phường/Xã” and “Đường,
khu vực” will be renamed into “District”, “Ward” and “Address”
respectively with the data type of “String”
Trang 312.2.3.3 Environment Factors
“M6 ta” is the variable which contains all of information ofsurrounding stuffs of the real estates such as “neighborhood,transport or amenities” and this can be included in deciding theprice of the properties Efficiency of public education, community
social status and proximity to shopping malls typically improve
the worth of a property [6] Since the “M6 ta” is a long string ofcharacter which will struggle strongly the model so this isseparated into 4 primary criteria, namely “Townhouse” ,
“Amenity”, “Centrality”, “Transportation”
“Townhouse”: There are myriad alleys in Viet Nam,
especially HCMC so if the properties’ location is on a roadwhich is convenient for business, the price will be higher
than ones in the small and rough alleys
“Amenity”: As mentioned above, proximity to educational
institutions or shopping malls can be beneficial
“Centrality”: Owning a property in the center of a
developed city like Ho Chi Minh is a fortune because the
options to make a killing are various
“Transportation”: A lot of buyers opt for a tranquilneighborhood but not too hindering to commute,specifically those use cars So, an alley but wide enoughfor the cars to freely turn around is worth-mentioned
> All these 4 new attributes’ data type is “Boolean” which
is the same with “LegalDocument”
28
Trang 322.2.4 Data pre-processing
Steps to pre-process data:
e Step 1: Research the influential attributes through the Government
Decrees to identify the features of the data set
e Step 2: Eliminate the unnecessary fields in the data set such as
“Tiêu đề”, “Nhu cầu” or “Số điện thoại” to avoid redundancy and
rename several fields like “Mô tả” or “Giấy tờ hợp pháp” whentraining to fit models used
e Step 3: The remaining fields need to have standard formats with
no NULL values to avert unsupported data type in the algorithms
For example: Legal Document should be Boolean type or the value
of Price should be converted into Billion Unit
e Step 4: Convert the data set into csv file Table 2-3 shows the
summary of data after being processed
Table 2-3 Processed Data Summary
District Record quantity Attribute
Trang 332.3 Regression Model and Evaluation Metrics Used
2.3.1 Linear Regression Models — Stochastic Dual Coordinate Ascent
Regression
Linear Regress Model is a common Machine Learning algorithmwhich allows us to model the relationship between two or more variables
by fitting a linear equation to observed data in which one variable is
considered to be an explanatory variable, and the other is considered to
be a dependent variable It gives the model the ability to predict outputsfor inputs it has never seen before A linear regression line has an
equation of the form Y = a + bX, where X is the explanatory variable and
Y is the dependent variable The slope of the line is b, and a is theintercept
[10] SDCA has been recently considered as a state-of-the-artprimal-dual optimization algorithm for large-scale machine learningproblems which requires sequential random-order training examples and
performs a iterative coordinate updates to maximize the objectives
2.3.2 Decision Tree — Fast Forest Regression
Decision Tree builds regression models in the form of tree structure
It breaks down the dataset into smaller and smaller subsets which are
denoted as nodes Completed tree has desion node and leaf node Adecision node has two or more branches, each representing values for theattribute tested Leaf node represents a decision on the numerical target
Fast forest is a random forest implementation The model consists of
an ensemble of decision trees Each tree in a decision forest generates a
Gaussian distribution as a prediction Aggregation is performed over the
set of trees to find a Gaussian distribution that most closely approximatesthe combined distribution for all trees in the model
30