Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 122 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
122
Dung lượng
2,78 MB
Nội dung
Vietnam National University Ho Chi Minh City University of Technology Faculty of Computer Science and Engineering GRADUATION THESIS A MICROSERVICE-BASED DATA CRAWLING AND ANALYZING FOR REAL ESTATE WEBSITES IN VIETNAM USING MACHINE LEARNING Major: Computer Science Council: Instructor: Reviewer: Students: Computer Science (English Program) Assoc Prof Quan Thanh Tho Assoc Prof Bui Hoai Thang Pham Thi Mai - 1752335 Nguyen Ngo Chi Khang - 1752275 Nguyen Huu Nguyen - 1752036 Ho Chi Minh City, October 2021 Declaration Of Authenticity We declare that this research is our own work, conducted under the supervision and guidance of Assoc Prof Quan Thanh Tho The result of our research is legitimate and has not been published in any forms prior to this All materials used within this researched are collected ourself by various sources and are appropriately listed in the references section In addition, within this research, we also used the results of several other authors and organizations They have all been aptly referenced In any case of plagiarism, we stand by my actions and will be responsible for it Ho Chi Minh city University of Technology therefore are not responsible for any copyright infringements conducted within our research Ho Chi Minh, July, 2021 Authors Pham Thi Mai Nguyen Ngo Chi Khang Nguyen Huu Nguyen Acknowledgment We are using this opportunity to express our gratitude to everyone who supported us during our study and life We are thankful for their aspiring guidance, invaluably constructive criticism and friendly advice We offer my sincerest and deepest gratitude to my supervisor, Assoc Prof Quan Thanh Tho, for his support and guidance We would like to give thanks to Ho Chi Minh City University of Technology for giving us the opportunity to learn great lessons of theory and practical experience as well as many teachers and professors companioning with us during the curriculum Finally, we recognize that this research would not have been possible without the support from our families and from bottom of our hearts, we must acknowledge our parents without whose love, encouragement and sacrifice, we would not have finished this thesis Abstract First of all, transactional application has been regarded as the profound facilitation of the success of many websites A typical application contains three main components: frontend, backend, and database Database is designed for recording transactional data and represents some elements in real world The logic is then performed through mapping, data objects, to name but a few Finally, a visual interface is placed on top of application to illustrate necessities Apart from it, for the purpose of analyzing, Data Warehouse is intended to use as the solution The mission of a Data Warehouse is based on operational data to integrate, transform, extract, etc It is the business decision support system aimed for helping users have comprehensive knowledge about business-affecting factors by making business reporting In problem of analyzing data, query performance is one of the most important metrics As the historical data grows up, there is a host of techniques used when performing complex query with small response time Another fact taken into account is that in decision making system, prediction and recommendation are exerted to give users best estimation regarding specific provided data Algorithms and mathematics play a principal knowledge to implement any machine learning models, and data and its quality are the key factor to the output of any prediction concepts Various well-known models have been tried and put in comparison to achieve the highest accuracy Regarding to this thesis and the above motivation, we decided to select the topic A Microservice-based Data Crawling and Analyzing for Real Estate Websites in Vietnam using Machine Learning to implement a web application aimed for reporting real estate status, offering some real estate forecast and real estate items emerging in the market in present List of Figures 2.1 ETL from sources to Data Warehouse 2.2 Three processes in ETL 2.3 Data Warehouse Concepts 11 2.4 Example of Dimension table 13 2.5 Example of Fact table 13 2.6 Example of Star schema 14 2.7 Step in Dimensional Modeling 15 2.8 Example of point outliers in a time series 18 2.9 Example of contextual outliers in a time series 19 2.10 Example of collective outliers in a time series 19 2.11 Z-score in the normal distribution 20 2.12 Symmetric distribution and two types of skewed data 21 2.13 Example of Label encoding and One-hot encoding in Food Name column 23 2.14 Example of a decision tree for regression of playing hours based on weather 25 2.15 Bootstrap and Aggregation 26 2.16 Bootstrap and Aggregation 27 2.17 ANN architecture 35 2.18 Example of K-fold cross validation with test data is validation data in our project 36 2.19 Microservices Achitecture 39 3.1 Use-case diagram for the whole system 46 3.2 Get real estate dashboard activity diagram 52 3.3 Search Post Activity Diagram 53 3.4 Get Predicted Real Estate Price Activity Diagram 54 4.1 High-level of system design 57 4.2 Data flow in Scrapy 58 4.3 Real estate posts 59 4.4 Metadata after integrated 60 4.5 Class diagram for integration process 61 4.6 Dashboard System 62 4.7 Data Warehouse schema design 63 4.8 Class Diagram for Dashboard service 64 4.9 Sequence Diagram for Dashboard service 65 4.10 Listings Workflow 65 4.11 Entity Relationship Diagram 66 4.12 Class Diagram for Listings Service 67 4.13 Sequence Diagram for Listings Service 68 4.14 Price Prediction Service Workflow 68 4.15 Price Prediction Service Class Diagram 70 4.16 Price Prediction Service Sequence Diagram 71 4.17 Summary of dataset used for evaluation 72 4.18 Missing values in columns of dataset 72 4.19 Example of label encoded column after applying label encoding 73 4.20 Positive distribution and skewness in target in housing data, land data, renting data respectively 74 4.21 Distribution and skewness after log transformation in price column in housing data, land data, renting data respectively 75 4.22 Positive distribution and skewness in area feature in housing data, land data, renting data respectively 77 4.23 Distribution and skewness after log transformation in area column in housing data, land data, renting data respectively 78 4.24 Example of extracting features from created_date 79 4.25 10 rows with some columns of one-hot encoding 79 4.26 Data splits for training, validation and testing 80 4.27 Component Diagram of Frontend Layer 81 5.1 Asynchronous Programming in Vertx 85 5.2 Structure of Vue component 86 5.3 The use of declarative rendering in template syntax of Vue.js 87 5.4 Code example of Vuejs 88 5.5 Code example of TypeScript in Vuejs 89 5.6 Warning in VSCode 89 5.7 Docker Engine 90 5.8 Docker Architecture 91 5.9 Layered Image 92 5.10 Kubernetes Components in Node 94 5.11 Kubernetes Relication Mechanism 95 5.12 Kubernetes Architecture 96 6.1 Dashboard service web interface 103 6.2 Real estate listing service web interface 105 6.3 Price prediction service web interface 106 List of Tables 2.1 Comparison between Full load and Incremental load 2.2 Comparison between Database and Data Warehouse 10 2.3 Comparison between Microservices and Monolithic Architecture 41 4.1 Table of skewness and kurtosis before and after pre-processing of price column 76 4.2 Table of skewness and kurtosis before and after pre-processing of area column 78 6.1 Result of K-fold cross validation with k=10 for house data 99 6.2 Result of K-fold cross validation with k=10 for land data 99 6.3 Result of K-fold cross validation with k=10 for renting data 100 6.4 Performance of XGBoost finalized model on testing data 100 6.5 Performance of Star Schema versus Flat Table in BigQuery 101 6.6 Performance of Spring Boot versus Flask 102 Contents Introduction 1.1 Problem Statement 1.2 Goals and Scopes 1.3 Scientific Significance 1.4 Practical Significance 2 Theoretical Background 2.1 Extract Tranform Load (ETL) 2.2 Data Warehouse 2.2.1 Introduction to Data Warehouse 2.2.2 Data Warehouse versus Database 2.2.3 Components and Architecture of Data Warehouse 10 Dimensional Modeling 12 2.3.1 Introduction to Dimensional Modeling 12 2.3.2 Elements in Dimensional Data Model 12 2.3.3 Star Schema from Dimensional Modeling 14 2.3.4 Steps of Dimensional Modeling 14 Feature Engineering 16 2.4.1 Numerical imputation 17 2.4.2 Outliers removal 18 2.4.3 Log transformation 20 2.4.4 Label encoding and One-hot encoding 23 2.4.5 Date extraction 23 2.4.6 Binning 24 Machine Learning 24 2.3 2.4 2.5 2.5.1 Decision Tree Regression 24 2.5.2 Random Forest Regression 26 2.5.3 Gradient Tree Boosting 28 2.5.4 Extreme Gradient Boosting (XGBoost) 29 2.5.5 K-Nearest Neighbors (KNN) 31 2.5.6 Bayesian Ridge Regression 31 2.5.7 Linear Regression 32 2.5.8 Lasso Regression 33 2.5.9 Ridge Regression 33 2.5.10 Artificial Neural Networks 34 Cross validation and evaluation metrics 36 2.6.1 K-fold cross validation 36 2.6.2 Evaluation metrics 37 2.7 Database normalization 38 2.8 Microservices 39 2.6 System analysis and requirement 3.1 3.2 3.3 3.4 42 General System Features 42 3.1.1 Real Estate Dashboard Service 42 3.1.2 Listings Service 42 3.1.3 Price Prediction Service 43 Functional requirements 43 3.2.1 Real Estate Dashboard Service 43 3.2.2 Listings service 44 3.2.3 Price prediction service 44 Nonfunctional requirements 44 3.3.1 Real Estate Dashboard Service 45 3.3.2 Listings service 45 3.3.3 Price prediction service 45 Diagrams 45 3.4.1 Use case description 47 3.4.2 Activity diagrams 52 With regard of Kubernetes worker nodes, these will handle the actual work and are furnished with following installed processes: • Container Runtime: is the underlying software that is used to run containers which is typically Docker It runs encapsulated application in a relatively isolated manner and lightweight operating environment • Kubelet: This is responsible for starting the pod with a container inside It interacts with ETCD to read configuration details and write values to receive commands and work as well as maintain state of work • Kube Proxy: proxy service which runs on each node and helps in making services available to the external host It ensures the request forwarded to correct containers and performance of primitive load balancing 97 Chapter Result and Evaluation 6.1 General System Information The system is implement across multiple platforms and servers: • Crawling, integration, and all process related to data processing is implemented in a Google Virtual Machine with Linux Operating System (e2-medium with vCPUs, GB memory, 20GB disk size), located in us-central1-a • Database is Postgres 9.6 (1 vCPUs, GB memory, 34 GB disk size) with built-in backup mechanism, located in us-central1-f • Data repository is Google Cloud Storage in both multi-region and region location type with three separate buckets serving for data processing of three services • Images of microservices is stored in Google Container Registry • It is the Google Kubernetes Engine that is responsible for containers of three microservices Machine type: e2-medium, nodes, vCPUs, GB memory in total, located in us-central1c 6.2 Models Evaluation • Every 15 days, new machine learning models are trained and put to Cloud Storage • Later when the amount of data grows up, machine learning models perform better because they can learn more aspects of data • XGBoost is light and good, so our team uses it for development • Models sometimes are evaluated again for confirming their performance that is considered as maintenance stage 98 (a) Cross validation for housing data Table 6.1: Result of K-fold cross validation with k=10 for house data Models XGBoost Random Forest ANN Gradient Boosting Bayesian Ridge KNN Decision Tree Ridge Lasso R2 score 0.8221 0.8424 0.7070 0.7277 0.5162 0.7490 0.7056 0.5162 0.5162 RMSE 0.3358 0.3169 0.4320 0.4165 0.5551 0.3998 0.4331 0.5551 0.5551 MSE 0.1128 0.1004 0.1867 0.1735 0.3082 0.1599 0.1876 0.3082 0.3082 MAE 0.2402 0.2094 0.3179 0.3035 0.4120 0.2750 0.2769 0.4120 0.4120 Tree-based models are better than linear models in housing data, especially Random Forest and XGBoost are the best algorithms which have 0.3169 and 0.3358 in RMSE respectively R2 scores of them are 0.8221 and 0.8424 that proves our data pre-processing stage is suitable and good enough for application ANN with layers, 50 epoches of training and ADAM optimizer shows an modest result with 0.7070 in R2 and 0.4320 in RMSE (b) Cross validation for land data Table 6.2: Result of K-fold cross validation with k=10 for land data Models XGBoost Random Forest ANN Gradient Boosting Bayesian Ridge KNN Decision Tree Ridge Lasso R2 score 0.8059 0.8200 0.6361 0.6901 0.4286 0.7410 0.6728 0.4286 0.4284 RMSE 0.6277 0.5433 0.7724 0.7130 0.9683 0.6519 0.7326 0.9683 0.9685 MSE 0.3941 0.2953 0.5970 0.5084 0.9378 0.4251 0.5369 0.9377 0.9381 MAE 0.3963 0.3484 0.5788 0.5304 0.7518 0.4346 0.4507 0.7517 0.7522 Tree-based models are also better than linear models in housing data, especially Random Forest and XGBoost are also the top algorithms which have 0.5433 and 0.6277 in RMSE respectively R2 scores of them are 0.8200 and 0.8054 and ANN shows rather poor result (c) Cross validation for renting data 99 Table 6.3: Result of K-fold cross validation with k=10 for renting data Models XGBoost Random Forest ANN Gradient Boosting Bayesian Ridge KNN Decision Tree Ridge Lasso R2 score 0.6930 0.7019 0.5191 0.5771 0.3973 0.5136 0.4367 0.3971 0.3971 RMSE 0.3217 0.3170 0.4026 0.3776 0.4508 0.4050 0.4357 0.4509 0.4509 MSE 0.1035 0.1005 0.1622 0.1426 0.2032 0.1640 0.1899 0.2033 0.2033 MAE 0.2348 0.2203 0.3071 0.2892 0.3496 0.2885 0.2915 0.3497 0.3497 Again, tree-based models are also better than linear models in housing data, especially Random Forest and XGBoost are in the third time of being the top algorithms which have 0.3170 and 0.3217 in RMSE respectively R2 scores of them are 0.7019 and 0.6930 and ANN shows rather poor result However, performance of all algorithms in renting data are worse than above dataset because after pre-processing stage, there are a plenty of rows that have been deleted Random Forest is the best algorithm, but we use XGBoost models as models of backend of price prediction servive The reason is that Random Forest is too heavy to deploy to Kubernetes (d) Testing data XGBoost performs well and it is chosen for developing backend, so we finalize models using XGBoost with train/validation data and evaluate them based on predictions and testing data which is split as described in data split strategy part The result final assessment on predictions and testing data is obtained: Table 6.4: Performance of XGBoost finalized model on testing data Data Land data Housing data Renting data R2 score 0.8005 0.8233 0.6926 RMSE 0.5695 0.3332 0.3207 MSE 0.3243 0.1110 0.1028 MAE 0.3966 0.2397 0.2348 The out-fold result is good enough that R2 scores are above 0.69 Therefore, XGBoost with the above explained feature engineering step can be applied for website 100 6.3 Data Warehouse Evaluation • Executing time for every time running transformation to Data Warehouse is five minutes every day and it is triggered by time • Every day, there are approximately 200,000 real estate records loaded incrementally into BigQuery • Later when the amount of data grows up, scaling server is not an issue • The log for error, output and updated time when loading are kept every day • All actions and processes are executed by service account managed by our team • Star schema is preferable since it has been proven to outweigh other type of schema in term of performance, flat table to say • It can be seen that BigQuery is suitable for complex analytic query in a specific table This data warehouse also supports built-in cache which takes approximately milliseconds every time re-query • As data warehouse is updated daily, BigQuery will return the cache results and try not to execute query again during the day Table 6.5: Performance of Star Schema versus Flat Table in BigQuery Query description Average price in terms of transaction type, property type, and month per year Number of posts based on transaction type with known property type Number of posts based on property type Number of real estate posts and average price based on province/city with known transaction type Total area and average price of metropolis Number of posts and average price of projects with known transaction type Star Schema(s) Flat Table(s) 0.2 1.2 0.3 0.7 0.4 0.4 0.4 0.7 1.1 0.6 0.6 6.4 Listing Service Evaluation • Executing time for every time running normalization to operational database is five minutes every day and it is triggered by time 101 • Every day, there are approximately 1000 real estate records loaded incrementally into operational database • The log for error, output and updated time when loading are kept every day • Database after normalization consumes less storage Table 6.6: Performance of Spring Boot versus Flask Query description Select all posts Select posts with known province Select posts with known province and district Select posts with known province, district and ward Select posts with known transaction type Select posts with known property type Select posts with known project 6.5 Web application (a) Dashboard page 102 Spring Boot(s) 5.7 6.1 4.6 4.2 5.7 6.3 3.7 Flask(s) 7.5 6.3 5.3 4.2 7.1 7.1 3.3 Figure 6.1: Dashboard service web interface (b) Post listing page 103 104 Figure 6.2: Real estate listing service web interface (c) Price prediction page 105 Figure 6.3: Price prediction service web interface 106 Chapter Summary 7.1 Achievement In this topic, to meet the scalability and processing of big data collected from real estate website, we have researched and given an approach to use Big Data technologies as the foundation for storing, querying and processing data From there, we have developed a complete data collection and analysis system which includes full components of data collection, extraction, preprocessing, storage, transactional application, analysis, and prediction real estate data The microservices in the system are developed independently of each other, interacting only through common data in the database and data warehouse Along with them is the powerful data recovery mechanism ensuring that when a component fails, it does not cause damage affect the remaining components Furthermore, all components related to data processing is handled and gathered daily and incrementally for users to have a well-rounded perspective about both historical and present records We also taken into account a variety of steps to integrate, standardize data, and remove outliers from multiple sources in order to enhance the quality as well as guarantee the diversity of data Besides, we also use data to develop an application helps users catch up with real estate products and make decisions about which real estate types to invest and have overall viewpoint about the market The application consists of three highlights, which are respectively to display charts for real estate status, allow the user to see the suggested price for an input real estate information, and help users to keep up the latest prominent real estate items 7.2 Thesis Assessment Up to now, real estate data is collected daily and is integrated The first version of Data Warehouse is designed Raw data now is up to more than 650k rows after three months and this number is increasing by time 107 This thesis proposes a basis for building a system to meet business requirements that is able to collect, store, and market research to make business decisions: • Update quickly from multiple sources • High query options according to business purposes • Responding to system queries is fast even though the data is large 7.3 Future Development (a) Limitations Our thesis has many limitations: • Although all models have good performance, they are not really excellent that we can enhance and make models become better • Noisy data still exists in our data though data is pre-processed carefully, so it effects performance of models and statistical dashboard • The website has many features from analytical dashboard, listing real estate posts by location, area, to price estimation that is enough for supporting users in analyzing real estate market in Vietnam Nonetheless, missing pictures of real estate in posts needs to be considered (b) Future Development For solving the above downsides, in near future, we need to develop our thesis as following: • Continue to research and improve machine learning models by using hyperparameter tuning and applying more feature engineering techniques Besides, we can use deep learning models for price prediction service • Research and find out more data pre-processing techniques to clean data in database and data warehouse • Crawling more data from more real estate websites to enrich our system • Add more new features for our websites • Maintenance is considered as important stage of our project so that we need to check database, server, backend, frontend and models frequently 108 Bibliography [1] Seifedine Kadry and Khaled Smaili, Massively Parallel Processing Distributed Database for Business Intelligence, Accessed 2020-12-28 [2] Fajar Ciputra, Daeng Bania, Suharjito, Diana and Abba Suganda Girsang, Implementation of Database Massively Parallel Processing System to Build Scalability on Process Data Warehouse, 2018 [3] Sandeep Singh and Sona Malhotra, DATA WAREHOUSE AND ITS METHODS Accessed 2020-12-28 [4] Larissa Moss, Traditional Decision Support Systems Accessed 2020-12-28 [5] Vangie Beal, ETL – Extract, Transform, Load Accessed 2020-12-28 [6] Stitch, What is Data Extraction? Accessed 2020-12-28 [7] GeeksforGeeks, Characteristics and Functions of Data warehouse, 2018-22-10 [8] Talend, What is a Data Warehouse and Why Does It Matter To Your Business? Accessed 2020-12-28 [9] Guru99, What is Data Warehouse? Types, Definition & Example Accessed 2020-12-28 [10] AWS Amazon, Data Warehouse Concepts Accessed 2020-12-28 [11] Guru99, Data Warehouse Architecture, Concepts and Components Accessed 2020-12-28 [12] Nida Fatima, Data Warehouse Architecture: Types, Components, Concepts, December 11th, 2020 [13] SAP Insights, What Is a Data Warehouse? Accessed 2020-12-28 [14] IQBAL AHMED ALVI, Understanding Dimensional Modeling: The Basics of a Kimball Data Warehouse, September 6, 2018 [15] Aviral Srivastava, Dimensional Data Modeling, Dec 11, 2019 [16] Lithmee, Difference Between Cluster and Grid Computing, September 8, 2018 109 [17] Scrapy Documentation, https://docs.scrapy.org/en/latest/topics/architecture.html Accessed 2020-12-28 [18] Maksym Goroshkevych, When Does it Make Sense to Use Google BigQuery, Accessed 2020-07-01 [19] Gaurav Gupta, Components of Kubernetes Architecture, September 25th, 2019 [20] Christopher M Bishop 2006 Pattern recognition and machine learning Springer [21] Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining: Concepts and Techniques”, Third Edition, Morgan Kaufmann Publishers, 2012 [22] Pavan Vadapalli, “Bagging vs Boosting in Machine Learning: Difference Between Bagging and Boosting”, 2020 [23] David Hand, Heikki Mannila, Padhraic Smyth, “Principles of Data Mining”, MIT Press, 2001 [24] Ahmad Abdulal, Nawar Aghi, Spring September 2020, House Price Prediction, Bachelor thesis, Kristianstad University, Sweden [25] Emre Renỗberolu, Fundamental Techniques of Feature Engineering for Machine Learning, April 1, 2019 [26] Coursera, How to Win a Data Science Competition: Learn from Top Kagglers, HSE University, Accessed 2021-06-01 [27] Kaggle Kaggle - URL: https://www.kaggle.com/competitions [28] Dr Alex Loannides, Deploying Python ML Models with Flask, Docker and Kubernetes, January 10th, 2019 [29] Ziad Patrous, July 2018, Evaluating XGBoost for User Classification by using Behavioral Features Extracted from Smartphone Sensors, Master thesis, KTH Royal Institute Of Technology, Stockholm Sweden [30] Adrienne Watt, Database Design, Second Edition, BC Campus, October 24th, 2014 [31] Rukshan Pramoditha, k-fold cross-validation explained in plain English, December 19th, 2020 [32] Pardo, María del Carmen and Tomáš Hobza "Outlier Detection Method In Gees" Biom J., 56(5): p 838-850, 2014 [33] Diesner, J "Small Decisions With Big Impact On Data Analytics" Big Data Society 2.2 (2015) 110 [34] Pshenychnyi, O.Y "Associative Dependencies Properties In Data Analysis" Radio Electronics, Computer Science, Control, 2013 [35] Caltech, Learning From Data, California Institute of Technology, Accessed 2021-06-01 [36] Kimball Group, Dimensional Modeling Techniques, Accessed by 2021-02-01 [37] I Sommerville, Software engineering New York: Addison-Wesley, 2011 [38] R Elmasri Fundamentals of database systems Pearson Education India, 2008 [39] Xie, D X., Xia, W F (2013) Design and Implementation of the TopicFocused Crawler Based on Scrapy Advanced Materials Research, 850-851, 487–490 doi:10.4028/www.scientific.net/amr.850-851.487 111 ... warehouse, data is read-only and reloaded at particular intervals Some operations in normal databases including delete, update, and insert are invalid here Instead, data loading and data accessing are... data in suitable arrangment and load it into Data Warehouse These tools will handle Databases and data heterogeneity challenges (c) Metadata Metadata is the data about the data warehouse and offers... and the above motivation, we decided to select the topic A Microservice- based Data Crawling and Analyzing for Real Estate Websites in Vietnam using Machine Learning to implement a web application