1. Trang chủ
  2. » Luận Văn - Báo Cáo

Khóa luận tốt nghiệp Hệ thống thông tin: Applying prediction models to forecast the real estate price in Ho Chi Minh city

67 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Applying Prediction Models to Forecast Real Estate Prices in Ho Chi Minh City
Tác giả Lam Ha Tuan Canh
Người hướng dẫn Dr. Cao Thi Nhan
Trường học University of Information Technology
Chuyên ngành Information Systems
Thể loại Thesis Graduation
Năm xuất bản 2021
Thành phố Ho Chi Minh City
Định dạng
Số trang 67
Dung lượng 17,69 MB

Nội dung

VIET NAM NATIONAL UNIVERSITY HO CHI MINH CITYUNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMS LAM HA TUAN CANH THESIS GRADUATION APPLYING PREDICTION MODELS TO

Trang 1

VIET NAM NATIONAL UNIVERSITY HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

ADVANCED PROGRAM IN INFORMATION SYSTEMS

LAM HA TUAN CANH

THESIS GRADUATION

APPLYING PREDICTION MODELS TO

FORECAST REAL ESTATE PRICES

BANCHELOR OF ENGINEERING IN INFORMATION SYSTEMS

HO CHI MINH CITY, 2021

Trang 2

LAM HA TUAN CANH -15520056

THESIS GRADUATION

APPLYING PREDICTION MODELS TO

FORECAST REAL ESTATE PRICES

IN HO CHI MINH CITY

BANCHELOR OF ENGINEERING IN INFORMATION SYSTEMS

THESIS ADVISOR

Dr CAO THI NHAN

Trang 3

ASSESSMENT COMMITTEEThe Assessment Committee is established under the Decision - -s

đate by Rector of the University of Information Technology

- Chairman

- Secretary

cá — - Member

Trang 4

First of all, I would like to show my appreciation Dr Cao Thi Nhan for being my thesis

advisor not only during the time we work on graduation thesis but since I joined the

University of Information Technology as a consultant With patience, motivation, andimmense knowledge, she helped us to keep track of the direction of the research andgave us lots of advice to have the thesis completed

I also express our sincere thanks to Dr Do Trong Hop for the very careful review of

my thesis, and for all the insightful comments, suggestions, and corrections

Trang 5

TABLE OF CONTENTS

css

TABLE OF CONTENT S cccsssssssssssssesesecssscccsesessescseseseseseesesesessncseseseeseeeseseeeaees 2LIST OF FIGURES ccceccssssssssesesessseseessssesseeesessssssesesesasssessesnsssscsesessssssssesnsesaeseens 4LIST OF TABLES << 5-5-5555 Es+SS4 3 3E1ES.E13 3 8031104010001811 010g 6

LIST OF ABBREYVIA TIONS -5< 55 SĂ se S22 S11191030301683830303030841401010856 7

ABSTRACCTT 5c HS 003030108013001080808403010808004040401004040101010404010896 8

Chapter 1 : INTRODUCTION sscssssssssssssssssssssssssessseesssssseessseessseseessssessesssseeseee 9

1.1 Background

1.2 Objective and scope _

1.2.1 Objectivcs sổ ⁄72722 oPHỆ0 VÀ nh xe etseereresese

Collection -2.2.2 Data Description ¿St sseeeeeerrrrierriee LT

2.2.3 Exploratorry Data Analysis ResuÏL -<+-+c c + 232.2.4 Data pre-processing oo cecsecseseseseseesesessenesesssseseseseeneacssseessaeeeenseeesseenenees 29

2.3 Regression Model and Evaluation Metrics Used tre «30

2.3.1 Linear Regression Models — Stochastic Dual Coordinate “Ascent

G000 ố -3 30

2.3.2 Decision Tree — Fast Forest Regression -5©5+55-++ 30

2.3.3 Gradient Boosting Regression — LightGBM and Fast TreeG000 ố -3 31

2.3.4 Performance Metrics - ¿+ ¿5S Sky 31

Trang 6

Chapter 3 : IMPLEMBENTA TION -555<5ssssessesessesstsrstsrtsrsensrsee 34

3.4 Demo “ SG

3.4.1 Introduction t to issih

3.4.2 Main ẨunCtÏOPS 5-56-5222 2222 22 1212111111121 1xx 5

Chapter 4 : CONCLUSIONS 5-5-5 S*HỲHnHxgg H11 ung rsee 60

4.1 Conclusions Qt em 5 /\EE Ả i HH da 60

4.2 Limitations and challenges

4.3 Future works “

REEFERENCES Ăn 101 010101403030108080040401010004040101000196

Trang 7

LIST OF FIGURES

css

Figure 2-1 Data Preparation Process ccsssessessessseereeeessetessseeeacerseeeseeeeeaeeesseteees L4Figure 2-2 Process Flow of Prediction Model [12] . -c e-ecc<c«-.e-.e- LỘFigure 2-3 Website UI [3] - - - + + St kg g1 1tr gret 16

Figure 2-4 A raw data set example [3]

Figure 2-5 Processed Data Set

Figure 2-6 SellPrice Scatterplot in HCMC

Figure 2-7 SellPrice Distribution of HCMC -. - ¿5< 555c<5ceccccecce-cec 23Figure 2-8 Average price for types of real ©SA(€ cty 24

Figure 2-9 Relationship between SellPrice and Legal Document - 25

Figure 2-10 Average Price in each District ¿5-55-5255 c+ccseseeererseeeeeee 27,

Figure 3-1 Options to Add ML Model

Figure 3-2 Scenarios Choosing UL

Figure 3-3 Information of training environment

Figure 3-4 Data PT€VI€W nh HH Hi OD

Figure 3-5 Data types settings of the variables -¿- - + s++x+cererrxeterrrkeee 38Figure 3-6 Set time for training da(a ¿55c sescssekekerererrrrrrre OOFigure 3-7 Recommended time for training data ¿-‹ 5 <e-<+< c OD)Figure 3-8 I“ experiment’s R-squared

Figure 3-9 1* experiment sample no.1

Figure 3-10 Actual price of alike property of I* experiment sample no.1 [13].

Figure 3-11 Housing prices Reference of Lac Long Quan Street from Mogi.vn

Figure 3-12 1“ experiment sample 0.2 - + + e++£+++x££keEkeErkrkerkekkrrerkee

Figure 3-13 Actual alike property of 1 experiment sample no.2 [13]

Figure 3-14 2"4 experiment’s R-squared

Figure 3-15 2" experiment sample

Figure 3-16 Actual alike property of 2"4 experiment sample [13] 49

Figure 3-17 Housing prices Reference of Linh Dong from Mogi.vn [ I] 49

Figure 3-18 3 experiment’s R-squared cscsssesssssssseesseesstecssesssseessecsseesssecsseesseeessees SO)

Figure 3-19 3" experiment sample no . -¿5-++cxsrxetrteetretrrirrierirrr 2

Trang 8

Figure 3-20 Actual alike property of 3 experiment sample no.1 [13] 53

Figure 3-21 Housing prices Reference of My Hue Street from Mogi.vn [11] 53

Figure 3-22 3TM experiment sample 10.2 c.ssssccssssssessssecsseesssecsseccssecssecssscsssecsseesseeessees 4

Figure 3-23 UI of the website

Figure 3-24 Filters available

Figure 3-25 UI when choosing a property “

Figure 3-26 Displayed reSuÏtS 55-555 S++£sketkerrkrrkerkerrrerkrerrerrercev ODFigure 3-27 Actual average price of the property on location -. .- 2Ø

Trang 9

le 2-1 Raw Data Summary cecceesesecseessseeeeseeeeenessseesessseensaeeeseeesaeseeseeeessseeeaees 2

le 2-2 Data Description c.ccccecscsceseescesesesesseseseesesesssssessssesssssssssesssesssssssessesees 2

le 2-3 Processed Data SUMMALY ce sce sseesseetesesesessessseeneneassesessseeteneatenenesees 20)

le 3-1 1“ experiment dataset

le 3-2 2"4 experiment dataset

le 3-3 3 experiment dataset

le 3-4 1“ Experimental results -. 55 55escssseezteeerseseererereseee AL

le 3-5 Testing examples of 1S experiment -. ©5555 +xes++£vzverxerererxrre 42

le 3-6 2" Experimental Metrics c.ccscsessssesssseesssseecssseesssneessnneesssneessnneeesnneeesnees 46

e 3-7 Testing Examples of 2° experiment - :-cccccsccecrerrrrrrrrrrrrrrrreere 47,

e 3-8 3 Experimental results

e 3-9 Testing Examples of 3" experiment

Trang 10

LIST OF ABBREVIATIONSEDA Exploratory Data Analysis

LASSO Least absolute shrinkage and selection operator

VAR Vector autoregressive

ADL Autoregressive distributed lag

XGBoost Extreme Gradient Boosting

SVR Support Vector Regression

SGD Stochastic Gradient Descent

GBR Gradient Boosting Regression

SDCA Stochastic Dual Coordinate Ascent

HCMC Ho Chi Minh City

MAE Mean absolute error

MSE Mean squared error

Trang 11

ABSTRACTDifferent models used in house price forecasting are tested on their predictionaccuracy Using data from detailed house price indices to Ho Chi Minh City of Viet

Nam in the third and fourth quarter of 2021 Some regression techniques such as

Stochastic Dual Coordinate Ascent (SDCA) method, Fast Forest for Decision Treemodels and Gradient Boosting algorithms, namely Fast Tree Regression and LightGradient Boosting Machine are selected to forecast house price index changes Suchmodels are used to build a predictive model, and to pick the best performing model by

performing a comparative analysis on the predictive errors obtained between these

models The data set used for this report was downloaded from the websitewww.laydulieu.com The data set consisted of nearly 2000 observations and 18

variables The target variable from the given data set was Price The results of the

experiments are illustrated through a website built with Net Core and Angular toobtain a demonstration of the used algorithms’ performance

KEYWORDS: House Price Prediction, Linear Regression, Gradient BoostingRegression, Decision Tree

Trang 12

Chapter 1: INTRODUCTION

1.1 Background

Vietnam is developing into a rapidly growing and prosperous real estatemarket in Southeast Asia It is considered one of the hotspots of the mostdeveloped real estate market in Asia, with a growing economy, some laws havemade it easier for foreigners to buy the property As of 2017, the increase in thenumber of investors, including national and international players in therespective sub-markets, has led to the development of new housing companies,

green buildings, etc of several mega-projects in major cities are fundamental to

the growth of the residential real estate market, both in the basic and in theluxury segment

In the context that there has been a significant increase in the number of peopledemanding of owning a house or land so the real estate market in Viet Nam hasbecome more and more appealing to the investors along with the pricecountinuously fluctuates This is also synonymous with the confusion of

whether they have purchased a property at a proper price among the customers

and a massive number of scammers taking advantage of this situation isinevitable So this is the major consideration why should we need predictivemodels In short, predictive modeling is an applied mathematics techniqueexploitation machine learning and data processing to predict and forecastpossible future outcomes with the help of historical and existing information Itworks by analyzing current and historical data and sticking out what it learns on

a model generated to forecast likely outcomes In this thesis, Forcast Model is

used because of its popularity in working with numerical values based ontraining data [1] The data used in this thesis is available from the websites

laydulieu.com under the link of [3] by Nguyễn Đức Nam and mainly focus on

Ho Chi Minh City territory In particular, 4 typical areas of the city areinvestigated, namely District 1 — the city’s heart, Thu Duc District — the newlyemerged as the most potential area, Tan Binh District — the place for laborers

Trang 13

from other provinces to come and settle down and Hoc Mon District — theoutskirts of this metropolis where the infrastructure is still not worth-concerned.The full list of data variables is given in Section 2.2.1.

There are various considerations influencing the price of properties.According to [6],[7], price of real estate is influenced by several factors like:

e Property-related factors

¢ Locational Factors

¢ Environmental FactorsThe purpose of this thesis is first to examine the influence of various variables

on the real esate prices in Ho Chi Minh City by using EDA Secondly,

researching the linear regression algorithm for their theories and operations toindicate the highest predictor through experiments is the main target Three

models based on Linear Regression, Decision Tree and Gradient Boosting

respectively are proposed and demonstrated through a website

1.2 Objective and scope

1.2.1 Objectives

e Understand the implementation of business data analysis and machine

learning on providing results

¢ Covering real estates in 4 typical areas in HCMC territory, which are:

o District 1: The downtown

o Hoc Mon District: The suburb

o Tan Binh District: The stable area

o Thu Duc District: The developing area

e There will be clear explanations for the reasons why these 4 areas are

selected in Section 2.2.1 below

10

Trang 14

1.2.2 Scope

Using Linear Regression Algorithms and Decision Tree Models for thevalue predictor of real estate prices Besides, R-squared is the main mectric

used for a evaluation in terms of the efficiency

A demonstration website is built to illustrate the result of the experimentsand performance of the used algorithms

Trang 15

Chapter 2: DATA SET AND METHODOLOGY

2.1 Related works

In recent decades, there has been a demand to extend the house priceprediction services which help investors and settlers to take a correct decision.This section describes the previous work done by several researchers in theselected domain of housing price prediction Following are the contributions ofvarious researcher done in this domain:

In 2016, Martijn Duijster [4] used ARIMA/ADL/VAR to forecast the Dutch

house index changes The experiemental analysis was based on the the period

of 1995 — 2016 Netherlands house price data This paper shows that ADL hadthe best performance among the algorithms used and claims that 1-to-6-quarterforward prediction is available but because of the out-dated data and a broadspectrum are the major limitations

In 2019, Nebojša Dubošanini, Jan Eric Biihlmann2 and Pauline Offeringa [2]

applied Linear Regression, RIDGE Regression, LASSO Regression, Random

Forrest on predicting the index of dataset of Melbourne, Australia This project

indicated that the Decision Tree model can be simple and still had the bestperformance compared to RIDGE and LASSO Regression Moreover, thismodel also provide an explicit look at the scheme and how the target variablewas computed However, the drawbacks is the inability of above thresholdprediction and unspecified data time so it would be difficult to get a preciseforecast because of the fluctuation in prices of this area

In 2020, Yichen Zhou [8] put Linear Regression, LASSO Regression,

Random Forrest and XGBoost into practice to predict house price in Ames,

Iowa, USA with the 79-variable data of 5-year period from 2005 to 2010 Theexperiemtal results expressed high accuracy of XGBoost model’s predictedvalues in comparison with actual figures, with more than 94% accurate.Nevertheless, the considerable variables data would take loads of time to do theEDA for feature extraction and the recency is also worth-concerned

12

Trang 16

Even in 2021, LASSOLARS Regression, Bayesian Ridge Regression, SVR,SGD and GBR are still used for price prediction of house dataset of Islamabad

— Capital of Parkistan [9] by Imran , Umar Zaman ,Muhammad Wagar and Atif

Zaman The results show that SVR performs best than the rest of the machinelearning algorithms

From the information collected, there are many types of research with various

methods such as ARIMA, ADL,XGBoost integrating with Linear Regression,

which are still applicable for Prediction Models even until 2021 However, Iwould take the upgrade versions of Linear Regression, Decision Tree Model

and Gradient Boosting Regression, namely Stochastic Dual Coordinate Ascent

(SDCA), Fast Forest Regression, Light Gradient Boosting Machine and FastTree Tweedie Regression respectively for a faster training runtime, less memoryusage and minimization of convex loss functions combining with a visual

demonstration on website running at localhost Additonally, the data set is

downloaded from laydulieu.com because of its weekly regular updates Thereare 2 basic processes related to this problem which are :

e Data preparation: Because raw data isn’t suitable to be used directly

so it need to be pre-processed before it can be used to fit the parameter

of the models It is required to remove redundant attributes andconverted into appropriate data type Figure 2-1 shows the process of

data preparation after the data is downloaded

Trang 17

Building Models: After the data is ready, a feasible model shouldn’t

be over-looked Multiple algorithms are put into account after data istrained to acquire the best-performed one for making predictions.Figure 2-2 shows the fundamental process for a Prediction Model

Trang 18

Figure 2-2 Process Flow of Prediction Model [12]

2.2 Data Collection and Data Set Generating

2.2.1 Data Collection

The thesis is mainly concentrate on the real estate of Vietnamese

market, in particular HCMC so although there are plenty of websites forusers to look up information, to retrieve their data is a tough problem Thedata was manually downloaded from laydulieu.com [3] - a website

specializing in collecting and statistical data on the fields of: Real estate,

cars, electronics, home appliances, jobs, finance and fruits Figure 2-3shows the UI of the website after accessed

Trang 19

LẤYDỮLIỆU MHÀDẤT XE ĐẾNTỦ C0ĐGADNG GHOVA TRẤCÂY

Figure 2-3 Website UI [3]

Because the time to finish the thesis is limited so it is insufficient tocover all of the districts in HCMC so I took only 4 typical ones to examine

the fitness of the models used The reasons for the selection of these

regions will be available in Section 2.2.2

Each downloaded data set is exported into an excel (.xIs) file and thereare only 50 records for each set which means for about 2,000 records, I

have done at least 40 downloading times For every file downloaded, the

raw data contained 18 variables with a lot of information and several

missing records as well as false information such as 18,000 m? of area in

District 1 At first, I gathered all of the data set into one and sorted them

alphabetically by districts Then all of the redundant records were removed

to clean the data In the last step, some variables were renamed and somewere added to better fit the models and the file was converted into csv filefor training format suitability, an thorough explanation is available inSection 2.2.3 Figure 2-4 shows the raw data when downloaded andFigure 2-5 shows when data has been processed

16

Trang 20

- woe

sn hao 35 BG BS cv li hổ ioe E172 Ban:i08 ve hà lon — |Ona lhàcwdouaeL I [oreo [oun30s quận 12568 [ost [canes |no cuwidaugn + [poh nscos6 | conn

a] oK Gh RE cao 5 BANG THU HONG TEU NGAY [ost lean bn HB chin cut 3 [bag na

Sos đạp hat iểndi bể Ea, [ost [ean bin [nb dt + [osc s6 oon

SÌMgttên dường cg nia Binh ih Von [0st [ean bn [Wo CMOS Tose |- wine

7] af 38 vil [os [Canin [nb cu widougn ps css6 [aon

D lạm — [ean lhotzwdoodn+ Jpssss eins

i Tit Gm iD fo Lim Seg os [Can bin luồdiwdouinL [oso sé [own

quận 2 ft Ny Th Đan bm— |Gmea lhachwdoar+ lbsss lung:

le dons bo ng rg 0/002 [os [Canin [no chi wiouin + [asco [ae

12108 hin cc bie Si ay Long TH [ost [ean bin [ub chic om Jose sé | hước

11Ì5ảahượngnàttê Ngo Thả oe D410 gre [0st [ono [nO casa [ps<b5 [owen

ilsini6 2e ME Te i GOTO 3.8 [ast [ean bin [ub cd cutn 1 [peng Cn i IETET-ST- 7

‘Nuh gắn Ho IRIN OWOTAxSSMGAATTY [ost JQiaen-JHàCýM2Buảa1— JườngNỊL3Ó Đời ayn fon [+ ning

Can tt đấm 6 gn ca Bu [os [can [no on wid cutn + [boaghon [an

17002 ngngto di 27 gi an [os |Canbin [wo cu widougn [Pang 0am) CEO

inn duận thận hông Thos [ost [conan lhồ cu midaugn 1 JBwngtắpveng Te be Josessé leant

19[Cin tn vd ng bn 1 in cog tườnghộc [os [ean bn ]nd cuidcutn 1 [rng 80 vr) [ose 6 eine

20s Tay To Mb Ob Ton Em [os [Canin] Cu wid cugn [Peg as, Over [oc 35 [aor21[08 0 ne phó ane [ost [Canin ]wB ch widguin 1 Jmvegbie 0 (bine lsessé |anđbc

dụ Gnd gia cle ĐẤT ONG TUẦN [ost |G na ThộOt¿Mdoodnl— JvngbiDveng lạc josessé |Oene

[DI Bo Ue Soom2 Si View Hồ, Đà CB [as [ean binning Dvr [ps5 [sents

Dal Nn MT igo, Than, ct Pxaem Sa [05 [ean nan_[6 Cub 1 [mươngĐỆ9vmg [ose 6 [aan

25s a Pe ince Sa UENO [os [cane [ub chiMigusn 1 [peng Bsn IS

do wht at Mos in Pn, duậoTDEAIE-270n8) [0st [ch ut lồ Cw M1 Prag OL OUeRg Ocha Cfo13709720 wart EErarea

Ziv Đồng Enon cn bn tog [ost |canbin [nb cuwiaugn 1 [onveg novo nena ie 349sig)300uiả, ere

287m2 me Cao SA hg a bảng len [ean bin [HB crite 1 jmvnghdowxgcalweminA67oanliost) [bid s6 vn

Figure 2-4 A raw data set example [3]

mm = era [ety [bein i sen is Par > Frsoritn > [ia Some [ice

Heri fase seal hề 3 E

lanasn le Fr ie 3 Fe

suas ests ron a4, ego Tg Po i gpd Gas, Tu AI ra el ri 1 h A 1— sẽ

[soos faint ant eleva St nà 13 Bang ae nội yến vi FE ja Dị T ‘| 7 |

tô chang zo eins —rnsona acon Bug Pon ea ake x E7 4 ae E ‘| it 3—sa|{dae chung oon be favo i cook Peng ath, avn ps NM ET bị eis

“ông thakev|t7na2on ein of hte be lo tub al be bị ee{oberg suze dst 0 Pig Por TT sa at i 3 af DI:

‘ies cheng zt oui! pra sen roc tk cường er bgt oe lẻ Am, aa | a if 3 Dị DI:

Em) rm Em: is 3 Dị er

na ee hing Fang nb te Te Cu sai s—4 i 3 Dị DIeng os Pig chs Go Toes ra ja he 3 h Dị aear ing ab Tp Ck cn ja ase aI 1 +t st — $54

prong a cy ho Pg Ss TM sl s—4 ae bị 3 af —

eer a] s—ä Dị 1 h Dị 1 —ang = ja 3 F +t or

oh chnge aoe ein " Ts at Dị esGẦN cnnge oust outa — [hưng sia Em: 3 Dị DISN tng uz la Pree ễ Sa | ——a 3 Dị DA{nbs thepavlians son [ven Prong sore ng cor noes ese Error a ja Dị 3 i Dị 3—eorm Gut Tosca se | —a 3 Dị DI

wore oh Pate ods Jpg MLA Bag es ho Pv ana ath pS rn s—4 oh bị 3 +f — isCF Wik fo, BOTA 5 ja Dị aI 1 +t | — Tếnest Đá oe Pry ish abe BO rn s—ä aH 1 i Dị BEna in oat pon ro SE—4 as 3 F +t er{ehcp sos fasta = hs Bạn Gan 8 sr bị Dị ej

Figure 2-5 Processed Data Set

2.2.2 Data Description

The data set of 4 different districts of Ho Chi Minh City, namely:

e District 1: District 1 is the central district of Ho Chi Minh City

House prices in District 1 are always attractive to investorsbecause District 1 is home to many government agencies,Consulates of many countries and long-standing historical sites

In the district, there are many commercial centers, buildings forrent, office buildings, etc It is also a gathering place for many

Trang 21

companies, from Vietnamese companies to foreign investedcompanies The district owns arterial roads and famous roads

such as Nguyen Hue pedestrian street, Nguyen Van Binh book

street In addition, it is also close to the canal system, easy toaccess to the Mekong Delta provinces Another plus point is that

the Metro system is in the finishing stage Therefore, housing

prices in District 1 always increase strongly Focus on many

quality schools, from public schools to private schools withinternational standards This place also has many large hospitals,

spas, clinics, beauty salons, etc., bringing many advantages to

people living here

Thu Duc District: Because of the Prime Minister agreement ofestablishing Thu Duc City including District 2, District 9, Thu

Duc, real estate in this area has rocketed Leading to the fact that

most of the opinions are that it will grow strongly Not only that,the prices of apartments and townhouses in Thu Duc, District 2,

District 9 are also becoming more and more attractive in the eyes

of investors The completion and operation of the new MienDong bus station and the construction project of Long Thanhairport, which is in the process of being implemented, also make

investors more interested in land prices in Thu Duc district For

those who intend to buy a house or land as a place to live andsettle down with a small amount of capital, it is appropriate to

refer to Thu Duc housing prices The current Thu Duc land price

has increased, but it is still at the threshold that can be considered

for a reasonable investment

Tan Binh District: Tan Binh is an inner city district located to thenorthwest of Ho Chi Minh City This is where many largecompanies, factories and industrial parks gather A lot of

18

Trang 22

laborers focus on the district area to live and do business.Therefore, there are many people who have the need to search

for real estate in Tan Binh The district owns major roads of the

city such as Nguyen Van Troi, Cong Hoa, Hoang Sa, Truong Sa,and Hoang Van Thu streets At the same time, Tan Son Nhat

airport is also in Tan Binh district With the advantages of traffic

and transportation, housing prices in Tan Binh district are always

at a high level In addition, this place also has many schools,shopping malls, and large hospital systems, so the district's

amenities are considered adequate As a result, house prices in

Tan Binh district are constantly fluctuating

Hoc Mon District: Hoc Mon District is a suburban district in theNorthwest of Ho Chi Minh City Although the Hoc Mon housing

market is attracting a lot of attention, housing prices in Hoc Mon

district are still cheap Transport infrastructure is synchronouslyplanned Because it is developed in the following years, the

infrastructure is well planned The planning here is long-term

and strategic, urban construction synchronously Hoc Mon realestate has potential for future development When the real estatemarket in the central area is increasing in price, the land fund is

dwindling, housing in Hoc Mon area has received more

attention

This data set consists of about 2000 records with 18 variables It was

downloaded from laydulieu.com, of which the reliability is assured based

on the Decision No 02/2020/QD-UBND promulgating regulations on

land price list in Ho Chi Minh City for the period of 2020-2024 issued bythe People's Committee of Ho Chi Minh City on January 16, 2020 [7]

Moreover, a website of Mogi.vn [11] is a representative website of Dinh

Anh Joint Stock Company and is an online website specializing in real

Trang 23

estate posting, providing information of current real estate which isaccompanied by market data, area information and a tool to calculate the

cost as well as the installment period of each different property It applies

artificial intelligence (AT) technology, through a survey of over 2,000,000real estate listings of Mogi.vn and Muaban.net combined with aspecialized calculation engine Mogi updates the monthly housing price

list of the areas to help users keep up-to-date with the price This is a

continuously updated data set with the latest information of the properties

so it can keep the predictors up-to-date with the current price and give thesatisfying results Table 2-1 is the summary of raw data by districts and adetailed data description is shown in Table 2-2

Table 2-1 Raw Data Summary

District Record quantity Attribute

Quan | 712

Huyện Hóc Môn 553

18Thủ Đức 589

Tân Bình 360

Total 2,214 18

Table 2-2 Data Description

Attribute Description Data Type

STT The order number of the Numeric

property

20

Trang 24

Chuyên mục The form of the String

Trang 25

Hướng The direction of the String

Trang 26

Figure 2-6 SellPrice Scatterplot in HCMC

Figure 2-6 gives us the scatter plot of the sell price It can be clearlyseen that most of the points are assembled on the bottom

SellPrice Distribution in HCMC

Figure 2-7 SellPrice Distribution of HCMC

Trang 27

The graph above shows that the distribution of price is right-skewedwhich is pretty reasonable (under 10 billion VND) because few peoplecan afford the exorbitant properties (greater than 10 billion VND) FromFigure 2-6 and Figure 2-7, most of real estate prices in Ho Chi MinhCity are ranged from 1-10 billion VND with more than half of the data

records while only a minor number properties are appraised with high

price This means that affordable real estates are still the most populareven in one the metropolises of Viet Nam

It is crucial to identify the variables which have a strong correlation

with the target attribute (Price) According to [5,6], the factor influencingreal estate prices can be classified into 3 categories: Property-Related,

Location, and Environment

2.2.3.1 Property-related Factors

»

There are 11 variables related to this section which are :

Chuyên mục, Nhu cầu, Người đăng, Điện thoại, Ngày đăng, Diện

tích, Hướng, Số tầng, Số phòng, Nhà vệ sinh and Giấy tờ pháp

`

0 | |

Căn hô, Chung cư Đất Nhà

Figure 2-8 Average price for types of real estate

24

Trang 28

Firstly, the buyers tend to have serious consideration on the

form and the size of the properties for their demands [6] Figure

2-8 proves that the average price of “Nhà” is the highest (13.31

billion VND) so that form of property can affect seriously on the

price Particularly in Viet Nam, if someone is looking for a house,

the structure is worth-concerned because some will move in

immediately or slightly re-build the house sooner or later

In addition, the legality is also a important factor affecting the

properties’ value because if there is no problem with the

documents such as mutual ownership, the procedure owner’s

name transition will be much more simpler Figure 2-9 would

show a clearer visualization

SellPrice vs Legal Document

700.00

600.00 500.00

Figure 2-9 Relationship between SellPrice and Legal Document

So, it can be derived that the “Price” variable correlates most

with “Chuyên mục, Diện tích, Hướng, Số tang, Số phòng, Nhà vệ

sinh and Giấy Tờ Pháp Ly” These variables will be renamed into:

+ EDS @ se @

Trang 29

e “Type”: This represents for the forms of property available

such as “Dat, Nha or Căn hộ, Chung cư” The data type of

this attribute is “string”

e “Area”: Repesenting the size of properties with

“Nummeric” data type

e “Direction” : This represents the direction of the property.There are 9 directions included which are given in thedataset ”S,W,E,N,SW,NW,SE,NE” and “blank” because

of the missing data

se “No Room”: This shows the number of rooms of the

house or apartment

e “No Floor’: This shows the number of floors of the house.

e “Legal Document” : This tells whether that real estate has

required documents or not, so the data type would be

“Boolean” with “1” for “true” and “0” for “false” to makesimplify the process

2.2.3.2 Locational Factors

Location is considered to be the most significant feature of

house price determination After a consideration in the data set,

it is observed that “Tinh/Thành phố, Quận/Huyện, Phường/Xã,

Đường, khu vực” are the variables are representing theadministrative regions

26

Trang 30

Figure 2-10 Average Price in each District

From figure 2-10, it is obvious that the more the location

closer to the center, the higher the price is so beside “Tinh/Thanh

phố” variable is eliminated because all of the regions are in

HCMC territory, all of the remainings are strongly correlated to

“Price”

“Quan/Huyén” gives the area of the properties whether it is

the downtown or suburb, which identifies the general price ofthe objects “Phường/Xã” shows the area of the properties in

which there are various difference between those areas “Đường,

khu vuc” shows the specific location of the properties with the

format of “Street’s name, Ward, District, City” and will help the

model become fitter because there are significant differences

between streets or wards in one district [7], especially in HCMC

Thus, these two are taken into account while the others are

removed Besides, “Quận/Huyện”, “Phường/Xã” and “Đường,

khu vực” will be renamed into “District”, “Ward” and “Address”

respectively with the data type of “String”

Trang 31

2.2.3.3 Environment Factors

“M6 ta” is the variable which contains all of information ofsurrounding stuffs of the real estates such as “neighborhood,transport or amenities” and this can be included in deciding theprice of the properties Efficiency of public education, community

social status and proximity to shopping malls typically improve

the worth of a property [6] Since the “M6 ta” is a long string ofcharacter which will struggle strongly the model so this isseparated into 4 primary criteria, namely “Townhouse” ,

“Amenity”, “Centrality”, “Transportation”

“Townhouse”: There are myriad alleys in Viet Nam,

especially HCMC so if the properties’ location is on a roadwhich is convenient for business, the price will be higher

than ones in the small and rough alleys

“Amenity”: As mentioned above, proximity to educational

institutions or shopping malls can be beneficial

“Centrality”: Owning a property in the center of a

developed city like Ho Chi Minh is a fortune because the

options to make a killing are various

“Transportation”: A lot of buyers opt for a tranquilneighborhood but not too hindering to commute,specifically those use cars So, an alley but wide enoughfor the cars to freely turn around is worth-mentioned

> All these 4 new attributes’ data type is “Boolean” which

is the same with “LegalDocument”

28

Trang 32

2.2.4 Data pre-processing

Steps to pre-process data:

e Step 1: Research the influential attributes through the Government

Decrees to identify the features of the data set

e Step 2: Eliminate the unnecessary fields in the data set such as

“Tiêu đề”, “Nhu cầu” or “Số điện thoại” to avoid redundancy and

rename several fields like “Mô tả” or “Giấy tờ hợp pháp” whentraining to fit models used

e Step 3: The remaining fields need to have standard formats with

no NULL values to avert unsupported data type in the algorithms

For example: Legal Document should be Boolean type or the value

of Price should be converted into Billion Unit

e Step 4: Convert the data set into csv file Table 2-3 shows the

summary of data after being processed

Table 2-3 Processed Data Summary

District Record quantity Attribute

Trang 33

2.3 Regression Model and Evaluation Metrics Used

2.3.1 Linear Regression Models — Stochastic Dual Coordinate Ascent

Regression

Linear Regress Model is a common Machine Learning algorithmwhich allows us to model the relationship between two or more variables

by fitting a linear equation to observed data in which one variable is

considered to be an explanatory variable, and the other is considered to

be a dependent variable It gives the model the ability to predict outputsfor inputs it has never seen before A linear regression line has an

equation of the form Y = a + bX, where X is the explanatory variable and

Y is the dependent variable The slope of the line is b, and a is theintercept

[10] SDCA has been recently considered as a state-of-the-artprimal-dual optimization algorithm for large-scale machine learningproblems which requires sequential random-order training examples and

performs a iterative coordinate updates to maximize the objectives

2.3.2 Decision Tree — Fast Forest Regression

Decision Tree builds regression models in the form of tree structure

It breaks down the dataset into smaller and smaller subsets which are

denoted as nodes Completed tree has desion node and leaf node Adecision node has two or more branches, each representing values for theattribute tested Leaf node represents a decision on the numerical target

Fast forest is a random forest implementation The model consists of

an ensemble of decision trees Each tree in a decision forest generates a

Gaussian distribution as a prediction Aggregation is performed over the

set of trees to find a Gaussian distribution that most closely approximatesthe combined distribution for all trees in the model

30

Ngày đăng: 23/10/2024, 00:54

TÀI LIỆU CÙNG NGƯỜI DÙNG

  • Đang cập nhật ...

TÀI LIỆU LIÊN QUAN