Luận văn tốt nghiệp Công nghệ phần mềm: Development of a web application to monitor statistical data of real estate properties

Unlike React, whichuses one-way data binding, both React JS and Angular JS use Component to build theweb application means that when the component or a model is rendered in the viewcan h

Trang 1

O O O

TP.HCMBK

GRADUATION THESIS

DEVELOPMENT OF A WEB APPLICATION TO MONITOR

STATISTICAL DATA OF REAL ESTATE

PROPERTIES

DEPARTMENT OF SOFTWARE ENGINEERING

INSTRUCTOR: Assoc Prof Quan Thanh Tho REVIEWER: Assoc Prof Bui Hoai Thang

STUDENT: Pham Minh Tuan 1752595

Ho Chi Minh 7/2021

Trang 2

KHOA:KH & KT Máy tính NHIỆM VỤ LUẬN ÁN TỐT NGHIỆP

BỘ MÔN:KHMT Chú ý: Sinh viên phải dán tờ này vào trang nhất của bản thuyết trình

HỌ VÀ TÊN: Phạm Minh Tuấn MSSV: 1752595

1 Đầu đề luận án:

Development of a Web application to monitor statistical data of real estate properties

2 Nhiệm vụ (yêu cầu về nội dung và số liệu ban đầu):

✔ Investigate Selenium webdriver to collect data in anti-crawling websites

✔ Investigate and build a price-predicting model using Linear Regression based oncollected real estate data

✔ Crawl data everyday and update new data into database

✔ Investigate python libraries to implement model and train model with collected data

3 Ngày giao nhiệm vụ luận án: 03/09/2015

4 Ngày hoàn thành nhiệm vụ: 20/12/2015

5 Họ tên giảng viên hướng dẫn: Phần hướng dẫn:

PGS.TS Quản Thành Thơ

PHẦN DÀNH CHO KHOA, BỘ MÔN:

Người duyệt (chấm sơ bộ):

Trang 3

Ngày tháng năm

PHIẾU CHẤM BẢO VỆ LVTN

(Dành cho người hướng dẫn/phản biện)

1 Họ và tên SV: Phạm Minh Tuấn

2 Đề tài: Development of a Web application to monitor statistical data of real estate properties

3 Họ tên người hướng dẫn/phản biện: PGS.TS Quản Thành Thơ

4 Tổng quát về bản thuyết minh:

6 Những ưu điểm chính của LVTN:

The student has successfully developed a system as required, which consists of the following features: (i) crawling data from real estate web sites; (ii) normalizing and preprocessing data; and (iii) using AI technique for data forecasting

7 Những thiếu sót chính của LVTN:

The student should have conducted more analysis about the business requirement of the project

8 Đề nghị: Được bảo vệ  Bổ sung thêm để bảo vệ  Không được bảo vệ 

9 3 câu hỏi SV phải trả lời trước Hội đồng:

10 Đánh giá chung (bằng chữ: giỏi, khá, TB): Điểm : 7.6/10

Ký tên (ghi rõ họ tên)

PGS.TS Quản Thành Thơ

Trang 4

Ngày 06 tháng 8 năm 2021

PHIẾU CHẤM BẢO VỆ LVTN

(Dành cho người hướng dẫn/phản biện)

1 Họ và tên SV: Phạm Minh Tuấn

2 Đề tài: Development of a web application to monitor statistical data of real estate properties

3 Họ tên người hướng dẫn/phản biện: Bùi Hoài Thắng

4 Tổng quát về bản thuyết minh:

6 Những ưu điểm chính của LVTN:

- Showed an understanding about crawling data, especially real estate data, from some popular websites about real estate

- Showed an understanding about some techniques in predicting price based on collected real estate prices

- Designed and implemented a web-based application to show statistics of collected data, and

performed some experientations on predicting prices

7 Những thiếu sót chính của LVTN:

8 Đề nghị: Được bảo vệ X Bổ sung thêm để bảo vệ  Không được bảo vệ 

9 3 câu hỏi SV phải trả lời trước Hội đồng:

a

b

c

10 Đánh giá chung (bằng chữ: giỏi, khá, TB): Khá Điểm : 6.5/10

Ký tên (ghi rõ họ tên)

Bùi Hoài Thắng

Trang 5

I guarantee that the thesis’s content is written and illustrated on my own All thecontents, which I read from the documents of the technologies will be used in my webapplication, are defined in the References page and I promise that I do not copy or takethe content of others’ thesis without permission.

Trang 6

First, I want to say thanks to my instructor, Assoc Prof Quan Thanh Tho for instructing

my topic as well as the knowledge to implement the web application from the first step

Beside, I am extremely happy because of my family ’s support during the time I dothe thesis and throughout three years of my study at Ho Chi Minh city University ofTechnology This work cannot be finished well if I don’t receive many encouragementsfrom my parents and my little brother

Trang 7

The real estate market plays an important role in developing Vietnam’s economy Thiskind of industry starts growing fast in recent years to respond to the need of customers,and some real estate companies want to collect the data in this area as much as possible

to analyze the specific area that abstract most investors as well as sellers However, thedata gained about real estate is enormous with lots of websites It is hard for them togather all information at once This thesis will introduce a web application that willcollect information about real estate websites so that real estate agents can give a preciseanalysis of the resources-real property

Trang 8

List of Figures 4

1.1 Introduction 8

1.2 Objectives and Scope 8

2 Background Knowledge 9 2.1 Frontend 10

2.1.1 Single-page Application (SPA) 10

2.1.2 JSX 12

2.1.3 Components and Props 13

2.1.4 States 14

2.1.5 Events 15

2.1.6 Code Splitting 16

2.1.7 Fragment 17

2.2 Backend 18

2.2.1 Model Layer 19

2.2.2 View Layer 20

2.2.3 Template Layer 20

2.3 AJAX Request - Response 21

1

Trang 9

2.5 Scrapy 22

2.5.1 Basic concept 23

2.5.2 Scrapy’s Architecture 29

2.5.3 Working with Dynamically-loaded Content 30

2.5.4 Selenium 32

2.6 Linear Regression 36

2.6.1 Linear Model 36

2.6.2 Cost Function 37

2.7 Polynomial Features 37

2.8 Evaluation Metrics 38

2.8.1 Mean Squared Error (MSE) 38

2.8.2 R-Squared Score (R2) 39

2.8.3 Cross Validation Score 39

2.9 Underfitting and Overfitting 40

2.10 Regularization 41

2.10.1 Ridge Regression 41

2.10.2 Lasso Regression 42

2.11 Feature Engineering 42

2.11.1 Handling missing data 42

2.11.2 Outliers removal 43

2.11.3 Log Transformation 44

2.12 Supported Python Libraries 46

2.12.1 Numpy 46

2.12.2 Pandas 46

2.12.3 Matplotlib 47

Trang 10

2.12.4 Scikit-Learn 47

3 System Implementation 48 3.1 Use-case Diagram 49

3.2 Architecture Diagram 50

3.3 Database 51

3.4 Workflow 51

3.5 Crawling Data 52

3.5.1 Handling duplicated data 52

3.5.2 Formating Item’s name 52

3.5.3 Handling web pages with Selenium 53

3.6 Data Modeling 54

3.6.1 Data Preparation 55

3.6.2 Training Models 59

3.6.3 Model Evaluation Result 60

3.7 Web Application 61

3.7.1 Register page 61

3.7.2 Login page 62

3.7.3 Dashboard page 62

3.7.4 Data page 63

3.7.5 Price prediction page 64

3.7.6 Admin page 65

4 Summary 66 4.1 Achievement 67

4.2 Future Development 67

4.2.1 Thesis limitation 67

Trang 11

4.2.2 Further development 68

Trang 12

2.1 Single-page Application 10

2.2 Scrapy’s architecture 29

2.3 How Selenium WebDriver works 33

2.4 Non-linear dataset 38

2.5 Five-folds cross-validation 40

2.6 Undefitting & Overfitting 41

2.7 Boxplot components 44

2.8 Matplotlib Boxplot 44

2.9 Skewned Data 45

2.10 Data before using Log Transform 45

2.11 Data after using Log Transform 46

3.1 Use-case diagram 49

3.2 Architecture diagram 50

3.3 CRED System Database Schema 51

3.4 Scrapy collected data sample 54

3.5 Cu Chi Selling Land dataset 55

3.6 Dataset in Ba Thien, Nhuan Duc, Cu Chi 55

3.7 Origin dataset in Nguyen Huu Canh, 22, Binh Thanh 56

3.8 Dataset after using Log Transformation 56

5

Trang 13

3.9 Dataset Histogram before & after Log Transformation 57

3.10 Dataset after removing duplicates 57

3.11 Boxplot of area values 58

3.12 Divide dataset into equal parts 58

3.13 Outliers detection after apply on each part 58

3.14 Polynomial Regression degree choosing by train set’s RMSE 59

3.15 Polynomial Regression degree choosing by validation set’s RMSE 59

3.16 Register page 61

3.17 Login page 62

3.18 Dashboard page 62

3.19 Dashboard page 63

3.20 Data page 63

3.21 Price prediction page 64

3.22 Admin page - Manage users 65

3.23 Admin page - Retrain models 65

Trang 14

In this chapter, I going to introduce about my thesis topic, then I will show the targetsand the scope of my thesis for the web application.

1.1 Introduction 81.2 Objectives and Scope 8

7

Trang 15

1.1 Introduction

Crawling data is a common fields that appeared in many web application which mainfunction is collecting data However a web app that collects and monitor data in realestate fields for common real estate agents is rare Therefore, they need a software tohelp them synthesizing all data from other real estate website and then started makinganalysis about collected data

In the thesis phase, I will implement a web application that will collect information

on different real estate websites and display them on my current web page Clients canregister accounts and log in to the web app and view the data The web app also providesfilter functionality to let users filter the data for custom viewing and help users predictprice of a specific real estate properties based on elements like area, street, ward anddistrict Moreover, the web app provides users a general view of real estate data throughthe chart displayed on the Dashboard page

1.2 Objectives and Scope

The objectives of the thesis are:

Implement a web application to show the data collected

Build a custom crawler to get the data from three main real estate website: dongsan.com.vn, homedy.com, and propzy.vn

bat- Provide some real estate data statistics based on collected data

Build machine learning models to predict price of real estate properties based oncollected data

For a topic scope:

The web app is limit for a few users only (less than ten users) and it works locally

The collected real estate posts is limited to only in Ho Chi Minh city only Thereal estate post type is selling

Trang 16

In this chapter, I am going to illustrate about the knowledge I research to implment the web application including some definitions of the technologies I will use in both Frontend and Backend part

2.1 Frontend 10

2.2 Backend 18

2.3 AJAX Request - Response 21

2.4 Postgres SQL 21

2.5 Scrapy 22

2.6 Linear Regression 36

2.7 Polynomial Features 37

2.8 Evaluation Metrics 38

2.9 Underfitting and Overfitting 40

2.10 Regularization 41

2.11 Feature Engineering 42

2.12 Supported Python Libraries 46

9

Trang 17

2.1 Frontend

2.1.1 Single-page Application (SPA)

A single-page application is a web application that can interact with the clients cally with an only-one-single web page at the time Whenever a user changes the content

dynami-in the web page, it will be rewritten with the new content dynami-instead of reloaddynami-ing a newpage

Each page of the web application usually has the JavaScript layer at the header or bottom

to communicate with the web services on the server-side The content of the web page isloaded from the server-side in response to events that users make in a current web pagelike clicking on a link or a button

Figure 2.1: Single-page Application

Nowadays, with the development of the technologies used to build web applications pecially in the frontend area, there are three popular frameworks that most developersuse, that is React JS, Angular JS, and Vue JS I will state the comparison between them,and why I choose to use React JS to implement the frontend of my web application

es-Angular JS

Angular JS is a full-fledged MVC framework which provides a set of defined library andfunctionality for building a web application It is developed and maintained by Google,was first released in 2010 and it is based on TypeScript rather than using JavaScriptitself

Trang 18

Angular is unique with its built-in two-way data binding feature Unlike React, whichuses one-way data binding, both React JS and Angular JS use Component to build theweb application means that when the component (or a model) is rendered in the viewcan have different data inside based on the web page the user is seeing For Angular,when the data in the view changes or also let to the change of the data in the model,while React is not This is more convenient for developers to build a web applicationeasier However, in my opinion, it would be hard for managing the data when the webapplication becomes larger as well as debugging.

React JS

React JS is actually a JavaScript library not a framework like the other two, that isdeveloped and maintained by Facebook It is released in March 2013, and is described as

”A JavaScript library for building user interfaces”

Since ReactJS is mostly used to create interactive UIs in client view ReactJS is notreally difficult to learn at first However, when I get used to some concepts it provides, itwill be easier to maintain the content of the web page Moreover, it comes with the bigresources of the library to support developers to create the website quickly with the help

of available components The developers will have a freedom in choosing the suitablelibrary to use in building web application because React JS only work at the client viewonly not like Angular JS, which is built as an MVC Framework, so to develop the fullweb application, developers have to follow strictly the template it provides

Vue JS

Vue JS is a frontend framework that is developed to make a flexible web application It

is developed by the ex-Google employee Evan You in 2014 It shares many similaritieswith the two above, it uses Components to build the web application, also two-way databinding like Angular

Vue is versatile, and it helps you with multiple tasks, it comes with flexibility to designthe web application architecture Vue syntax is simple with people just get started withJavaScript background can still learn it, and it also support TypeScript as well

Why React ?

The reason I choose to use React JS to build a frontend of the web application is because

of its simplicity and its freedom to build the web app React JS let developers freely build

a web page on their library using not like Angular, beside Angular requires developer getused to TypeScript and it is also hard for a beginner in using frontend framework tostart with since Angular is considered as the most complex one among three frontend

Trang 19

web framework I listed above For Vue JS, although it is simple and easy to approach fordevelopers at first, it is not stable like the other two and its support community is small.

To conclude, each framework has its pros and cons depend on how programmers decide

to implement their web application For me, I found that React JS is good to start first

to build the web application

Firstly, I want to introduce the JSX before present about some main concepts that areused in building the view of the web application in React JS

const element = <h4>I am JSX!</h4>

The line above is neither HTML nor String, it is called JSX and this syntax is used inReact JS to render elements It is like the combination between JavaScript and XML,XML stands for Extensible Markup Language It is more like the tag in HTML but itcan be customizable By using JSX developer can manipulate the data inside an element

by embedding it inside the HTML tag

const hello = <h1>Hello, {element}!</h1>

By default, When React DOM render element, it will escape any value that is embeddedinside JSX to prevent XSS (Cross-site-scripting) attacks Each element defined in JSXcan then be rendered in the web page by React DOM

Trang 20

const hello = <h1>Hello, {element}!</h1>

ReactDOM.render(

element,

document.getElementById('root')

);

element is rendered inside a div tag with id=”root”

Components in React JS let developers split a web page into independent, reusable pieces,each of them will work isolated to each other A web page can contain lots of componentsinside, whenever a user changes the contents of the web page, the current page will berewritten with the new component or update the data inside the current componentwithout reloading a whole page

The way components work is similar to how we use functions or methods in most gramming languages When we define a function we declare the parameter for a function,

pro-it is like props in Component Then, we can call pro-it inside render of ReactDOM to renderthat component in the web page we want to show

To define a component in React JS two ways use JavaScript Function and ES6 JavaScriptClass:

Trang 21

const helloComponent = <HelloReact name="tuanminh" />

is designed dynamically, the content in UI can be changed over time, therefore React JScomes with the concept be used Component that is states

2.1.4 States

State is similar to props, but it is private in Component only and it is changeable State

is introduced in React JS when we define Component as JavaScript class However, sincev16.8, React introduce a new concept which is called Hooks that let developers can usestate inside a function without writing a class During the time I read the documentabout React JS, I only learn to define Component in class only because it is the basicone in building and customize the Component Hooks are a new concept and most of thelibraries that support React JS are changed to use Hook to build their Component now.When I start building the full web application in the next phase of the thesis I will try

to use it to build the Components in a web application

To define state in React Component, we firstly create a constructor then create statewith its initial value

class HelloReact extends React.Component {

constructor(props) {

super(props);

this.state = {

name: 'tuanminh',}

Trang 22

One important concept related to using state in React is Lifecycle methods In case

of loading the contents with many components into the web page, it is necessary tofree up resources taken by previous components used when they are destroyed, or afternew contents have loaded we want to run a function to operate the functionality ofthe current web page like getting data from the server That how React comes withcomponentDidMount() and componentWillUnmount() to control when a component ismounted or unmounted

// This will change the state name of Component

this.setState({ name: 'tuanminh' })

<button onClick={handleClick}>Click me</button>

However when we declare a function inside JavaScript class, we need to use bind() to bindthe current function to class when we use this keyword This is JavaScript’s characteristic,the default binding of this in JavaScript is the global context which is window, not thecurrent class we declare our function in

class ReactComponent extends React.Component {

constructor(props) {

Trang 23

In case you want to use parameter in handleClick() we can try JavaScript arrow function

or use JavaScript bind() instead

<button onClick={() => this.handleClick(param)}>Click me</button>

import { Component } from 'react';

import NavButton from './NavButton';

class Navbar extends Component {

Trang 24

2.1.7 Fragment

Component class in React JS allows developers to return one element only, which could

be a div tag or the Component that is imported from the library However, with the help

of Fragment, it allows returning multiple elements

Trang 25

2.2 Backend

Since React only provides the client view only so to communicate with the database andthe crawler to get the data for the web app, we need to build the backend to handle thisproblem React JS can integrate with most backend frameworks nowadays easily as long

as it provides the API for the React JS to render the content in view Nowadays it isnot hard to find yourself a backend framework to create a full-stack web app Yet we canbuild the backend part from scratch too, we can start with building the architecture likethe common MVC architecture that is most used in many software not just in the aspect

of the web application, then is database design and how to manipulate the data in thedatabase, routing and control the view that will be rendered to the client

These parts now are handled in most of the backend web framework currently Besidesframework provides a way to manage the architecture more efficiently, we do not need

to care much about the web services (localhost) to run the server-side or the security ofour web application It also lets developers deploy their web quickly since they alreadyprovide a reusable piece of code that most web applications have For example, cookie orsession management for users when they logged in to the web app It now handles thatjob immediately for you, yet we can custom that part to let it works individually

One disadvantage of using the framework is that if we want to make a unique functionthat is based on the current one it provides However, the framework may not have thefunctionality or the API to let us custom it for your own So we will need to rewrite thatfunction entirely and that piece of work will not be simple if we do not have a look atthe way it is implemented Eventually, there is still a way to change it as overriding thatfunction by using our function which is based on the one framework provide

The backend framework I choose to implement my web application in Django It is ommended as one of the best frameworks on the Internet since Django lets developers usePython to work with the server-side of the web Python is a high-level programming lan-guage, yet friendly for most programmers due to its straightforward syntax and support

rec-of various libraries or tools Besides, Django comes with Django REST which providesRESTFUL APIs that work well with React JS

Django follows the Model-View-Controller (MVC) architecture, which is Model for aging database, Views is the client’s view, Controller acts as the middleman betweenModel and View When a client sends a request which is from the View to the server,Controller will handle the request and communicate with the Model to retrieve, update,

man-or add new data to the database through Model

Django provides developers three separate layers, which are:

The Model Layer, which plays the role Model in MVC architecture This layerprovides classes and modules for structuring and manipulating the data of the webapplication

Trang 26

The View Layer, which is a Controller This layer lets developers define response functions or classes that can be used to access data through the Modellayer.

request- The Template Layer are the templates, where store HTML files that are used todisplay as the view to the client of the web application However, in my web appReact JS already handles this role so I will not present much about this layer

from django.db import models

class RealEstateData(models.Model):

url = models.CharField(max_length=100)

post_title = models.CharField(max_length=50)

url and post title is the two fields of the table named myapp real estate data

By using models we need to specify the name of the app in settings.py in INSTALLED APPS

= [ ] firstly Then when we define a model in models.py, we can save that modelinto database as a table with two CLI are makemigrations and migrate The first CLI

is used when we first define a model or change some characteristics in the model likeadding attributes, changing the type of attributes or deleting attributes After that weuse migrate to take that changes into the data inside database

Making queries

Models provided by Django gives developers a database-abstraction API to let you nipulate directly the data inside database without making any SQL queries I will statesome common methods:

ma- To create an object, we simply call the name of models given available attributesinside the model like creating a class in Python, then object.save() let the objectsaved in database

Trang 27

data = RealEstateDatắwww.realestatẹcom', 'mua_ban_nha_dat')

datạsave()

To retrieve objects, simply call methods objects.all()

To retrieve only parts of objects, we can use filter(), the inputs of filter are attributes

from djangọurls import path

from myapp import views

Trang 28

2.3 AJAX Request - Response

AJAX stands for Asynchronous JavaScript And XML, it is the use of XMLHttpRequest

to communicate with the backend, it allows the frontend can send and receive data indifferent format like text, JSON, HTML and update the data in frontend view withouthaving to refresh the current page

React JS provide the interactive view to the client to view the web page, however, it is just

a view so that whenever the user triggers an event in the view the make a component’sdata change the view should communicate with the server-side to load the suitable datainto the current component To communicate with the backend, the changeable compo-nent in React will send an AJAX request into the backend then the backend respondsdata back to the requested component and changes the data displaying the component,this update will not refresh the whole page of the web application, fewer data will beloaded at the time and this will increase user’s experience in using the web app

There are two common ways for React JS to integrate with the backend, that is usingfetch() API provided by the current browser or by using a JavaScript library axios

In my web app, I decided to use axios to handle AJAX request/response rather thanusing fetch() because axios is the famous library for handling AJAX in most of webapplications that developing using JavaScript It provides trusted security to protect thecommunication between the frontend and backend I will not recommend using AJAXJQuery in the web application developed by React since the way React and JQueryoperate is different from each other They will conflict if you put them in one place

2.4 Postgres SQL

I choose to use Postgres SQL to set up the database of my web application PostgresSQL is a powerful, open-source object-relational database system that uses and extendsthe SQL language combined with many features that safely store and scale the most com-plicated data workloads Postgres is famous for its proven architecture, reliability, dataintegrity, robust feature set, extensibility, and the dedication of the open-source commu-nity that stands behind the software to consistently deliver performant and innovativesolutions

Postgres SQL comes with many features aimed to help developers build applications,administrators to protect data integrity and build fault-tolerant environments Moreover,

it helps developers manage the data no matter how big or small the dataset PostgresSQL is a free and open-source database, however, it is highly extensible It is available

in most operating systems and it is used in many web applications as well as mobile andanalytics applications

Trang 29

2.5 Scrapy

The main function of web application is collect data from different real estate websites

to display in my web app at once, so that I need to make a crawler for crawling thoseinformation Currently, I am researching about Scrapy, an Python application frameworkfor crawling websites

Scrapy is an application framework for crawling web sites and extracting structured datawhich can be used for a wide range of useful applications, like data mining, informationprocessing or historical archival Although Scrapy was originally designed for web scrap-ing, it can also be used to extract data using APIs or as a general purpose web crawler.The main component of Scrapy is the spider

def parse(self, response):

for quote in response.css('div.quote'):

yield {

'author': quote.xpath('span/small/text()') get(),

'text': quote.css('span.text::text') get(),}

next_page = response.css('li.next a::attr("href")') get()

if next_page is not None:

yield response.follow(next_page, self.parse)

Below is the sample code of using spider for crawling author and text in the sample website

’http://quotes.toscrape.com/tag/humor/’ To run the spider we use this command:

scrapy runspider quotes spider.py -o quotes.jl

After executing the command finishes it will create one file named quotes.jl, which tains a list of quotes in JSON format

con-quotes.jl

{

Trang 30

"author": "Jane Austen",

"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in

a good novel, must be intolerably stupid.\u201d"

}

{

"author": "Steve Martin",

"text": "\u201cA day without sunshine is like, you know, night.\u201d"

}

When we run the command scrapy runspider quotes spider.py, Scrapy looked for a Spider

definition inside it and ran it through its crawler engine

The crawl started by making requests to the URLs defined in the start urls attribute (in

this case, only the URL for quotes in humor category) and called the default callback

method parse, passing the response object as an argument In the parse callback, we loop

through the quote elements using a CSS Selector, yield a Python dict with the extracted

quote text and author, look for a link to the next page and schedule another request using

the same parse method as callback Furthermore, I will go through some basic concepts

and tools provided by Scrapy to perform crawling real estate data

2.5.1 Basic concept

Spiders

Spiders are classes defined to instruct Scrapy to perform a crawl on a certain website or

list of websites and process the crawled data in structured items They are the places that

you can customize how your data crawled and control the number of websites involved

To start writing your custom spider you need to inherit the default Scrapy spider, class

scrapy.spiders.Spider This is the simplest spider that comes bundled with Scrapy, it

provides some necessary methods and attributes for crawling:

name - A string attribute is used to define the name of the spider It should be a

unique name

allowed domains - An attribute contains a list of valid domains defined by the

user to control the requested link If the requested link has a domain that does not

belong to the list, it will not be allowed

start urls - An attribute that is used to store a list of urls where the Spider will

begin to crawl

Trang 31

start requests - This method returns an iterable with the first Request to crawl.

It is called by Scrapy when the spider start crawling If the user use start urls, eachurl in this list will be called start request method automatically

parse(response) - This is the default callback method after Scrapy finish loaded response Inside parse method you can extract data into smaller items andprocess them

down-There are other attributes and methods but since it is not necessary so I will skip them,you can refer to Scrapy’s document Besides the default Scrapy spider, there are othersgeneric spiders defined depend on the needs of the scraping site By using generic spidersusers can easily custom the spider for their usage purpose

CrawlSpider - This is the most commonly used spider for crawling websites Thisspider come with some specific rule to control the behaviors of crawling websites

XMLFeedSpider - This spider is designed for parsing XML feeds

CSVFeedSpider - This spider is similar to XML except that it handle CSV files

SitemapSpider - This spider allows users to crawl a site by discovering the URLsusing Sitemaps

Selectors

After Scrapy finish downloaded response, it will call a callback function, which defaultmethod is parse(response) or your custom one So it will be a HTML source, this timeyou will need to extract your data from this source and to achieve it, Scrapy provide amechanism to handle this job, selectors Selectors try to parts in the HTML source byusing some expression like CSS or XPath CSS is the way to select components in HTMLfiles while XPath is used in XML documents To query the response using XPath or CSS,

we use response.xpath(<selector>) and response.css(<selector>), inside the method is theyour component selector, just pass it as a string For example:

< href="image1.png" alt="this is image">My image</a

Trang 32

And XPath will be:

response.xpath('//div[@class="images"]/a')

To get used to these kinds of selectors, you can refer to the CSS Selectors and XPathUsage on the Internet or trusted resources for more details Scrapy only uses it as a way

to manipulate data after we have an HTML source file already

To extract textual data, we use get() and getall() method; get() return a single resultonly while getall() return a list of results

response.xpath('//div[@class="images"]/a') get()

response.xpath('//div[@class="images"]/a') getall()

In case you want to select element attribute, for example the href attribute inside <a>tag Scrapy provide the attrib property of Selector to lookup for the attributes of aHTML element

response.xpath(’//div[@class=”images”]/a’).attrib[”href”]

Selectors are often used inside the callback function, parse(response) (if you define acustom parse function, you can pass your function in callback attribute in your Scrapy’srequest, I will discuss about this kind of request in Request & Response section)

Items

The term items in Scrapy represents a piece of structured information When I scrape

a real estate page, I tried to get as much data as possible like post type, address, email,phone, etc However, the data now is unstructured and may be hard to retrieve for usinglater and Items can help me to access this information conveniently Scrapy supportsfour main types of Items via itemadapter : dictionaries, Item objects, dataclass objects,and attrs objects

Dictionaries: It is similar to a Python dictionary

Item objects: Scrapy provide a class for Item with dict-like API (user can accessthrough Item like a dictionary) Beside this kind of Item provide class Field todefine field name so that user can raise error out (like KeyError) when there isexception that you want to handle or stop the crawling on current page This is thetype of item I choose to use in Scrapy since it is the easiest way to handle Items foraccessing and storing By defining field name in Item class, we can control whichfield we want to scrape visually

Trang 33

Dataclass objects: dataclass() allows defining item classes with field names, sothat item exporters can export all fields by default even the first scraped objectdoes not have values for all of them Beside, dataclass() also let field define its owntype like str or int, float.

Working with Item Objects

To start using Item Objects, I need to modify the items.py file when I create Scrapyproject, create a class for Item object:

class CrawlerItem(scrapy.Item):

url = scrapy.Field()

content = scrapy.Field()

post_type = scrapy.Field()

pass

Since Item is a Python class declaring item is the same as declaring a class However, tomodify the field value for item is through dictionary ’s method, like to change value ofpost type in item:

item = CrawlerItem()

item['post_type'] = something

Beside the default item provided by Scrapy, there is another item types ItemAdapter It

is aminly used in Item Pipeline or Spider Middleware, ItemAdapter is wrapper class tointeract with data container objects, provide a common interface to extract and set datawithout having to take the object’s type into account

Scrapy Shell

Scrapy shell is the interactive shell for users to debug the Scrapy code without runningspider It is used to test XPath or CSS expressions to see if they work successfully, alsoScrapy shell run along with the website I want to scrape so that I can know if that page

is scraped without errors or intercepted Scrapy shell is similar to Python shell if youhave already worked with it Scrapy Shell also works along well with IPython, which is

a powerful interactive Python interpreter with highlight syntax and many other modernfeatures If users have IPython installed, Scrapy Shell will use it as a default shell

Launch the Shell

To open the shell, we use the shell command like this:

Trang 34

scrapy shell <url>

Work with Scrapy Shell

The Scrapy Shell provides some additional functions and objects to used directly in theshell:

shelp: print a help with the list of available objects and functions

fetch(url[, redirect=True]): fetch a new response with a given url and updatedall related objects

fetch(request): fetch a new response from the given request and update all relatedobjects

view(response): open the give response on browser If the response is a HTMLfile, it is rerender from text into a HTML file

crawler: the current Crawler object

spider: the Spider is used to handle current url

request: a Request object of the last fetched url

response: a response returned from fetched url

settings: the current setting in Scrapy

Item Pipeline

After an item is scraped in spider, it will then send to Item Pipeline This is where item

is manipulated like change format of item or give decision to keep or drop item To write

a Item Pipeline, we need to create a class inside pipelines.py, same driectory as items.pyfile The naming convention of the class follows by <Name of class>Pipeline, we canfollow an available class already implemented inside the file

Work with Item Pipeline

Item Pipeline provides some methods to process item, it recommended to use ItemAdapter

to work with items To use ItemAdapter, you need to import it into the file:

from itemadapter import ItemAdapter

I will list some useful methods in ItemPipeline below:

Trang 35

process item(self, item, spider): This method is called for every item pipeline

component, it must returns an item objects at final Inside this method, you can

access item by create ItemAdapter object :

adapter = ItemAdapter(item)

# Then you can access through item like normally:

adapter['post_type'] = something

To drop item, I use DropItem objects provided by scrapy.exceptions, you just need

to import it like ItemAdapter :

from scrapy.exceptions import DropItem

# To drop item you raise DropItem:

raise DropItem("drop item because post type is missing", adapter['post_type'])

open spider(self, spider): This method is called when spider opened

close spider(self, spider): This method is called when spider closed

Activate Item Pipeline in Scrapy Setting

To make Item Pipeline be able to process item, you need to activate them in Scrapy

The value assigned to each pipeline present the the order that they will run

Request & Response

Scrapy Request is used to send a request to the given URL in the spider If you define your

crawled URL in start urls list, Scrapy will call the default request to every URL in the list

However, in some cases, you want to rewrite the request with additional arguments or a

custom callback function, not the default self.parse so that you need to call the request

to get the page source The response is the returned object when executing a request,

both request and response can be used by instantiating their objects The response is

commonly used in Middleware, for example, to return a custom Response when a request

is executed successfully

Trang 36

Passing additional data to callback function

Users may want to pass data to their callback function, which is called when request is

executed successfully You can achieve this by using cb kwargs attribute, passing value

should be a dictionary

def parse(self, response):

yield scrapy.Request(url='something', callback=self.parse_response, cb_kwargs=dict(url=url))

def parse_response(self, response, url):

item['url'] = url # this url is passed

2.5.2 Scrapy’s Architecture

Below is the picture illustrating the Data Flow in Scrapy:

Figure 2.2: Scrapy’s architecture

These flows are controlled by the execution engine and go in the order of numbers defined

in the picture:

1 The Spider will send the request to the Engine

2 Then the Engine transfers it to the scheduler for scheduling the request and

con-tinues listening for the next request to crawl

Trang 37

3 Scheduler returns the next crawl request to the Engine.

4 The Engine now process request to the Downloader Middleware

5 After the Downloader Middleware finish downloading requested item it will sendResponse to the Engine

6 After that, the Engine will send back the Response to the Spider for processingitem

7 The Spider processes the Response and returns scraped item and a new request tothe Engine

8 The Engine send the scraped item to Item Pipeline and schedule the new request

9 The process repeat at step 1

2.5.3 Working with Dynamically-loaded Content

Another practice about Scrapy’s Selector that I want to demonstrate is Working withdynamically-loaded content (content created by JavaScript) Data can be extracted insidescript tag, which is used to store JavaScript code or file used in that page or some datathat is only retrieved when we load that page on a web browser and Scrapy downloadercannot handle them Scrapy’s document present a section discussed about these cases,Parsing JavaScript code and Using a headless browser

Parsing JavaScript code is often used when your data is inside the JavaScript code,usually inside script tag

Using a headless browser, this solution is often used in the condition that your webpage requires running some Javascript code before loading the page in HTML sinceScrapy’s spider only download the HTML source once it firstly downloads the pagewithout running JavaScript code so this raw source may not have the data that Ineed to capture

Parsing JavaScript code

Since desired data is hardcoded inside the script tag and to retrieve it, I need to firstlyget the JavaScript code:

If the code is inside a file, I can read it by using response.text

If the code is within a script tag, I can use Scrapy’s selectors to extract the wholetext in that tag, then try to retrieve the desired data inside

Trang 38

Now, I want to get the specific data only not a whole bunch of JS code so there are many

ways demonstrated by Scrapy, but I prefer using a Python library, js2xml to handle this

problem I use this library to covert the JS code into an XML document, in this way now

that text will look more like an HTML file in structure so now I can use Scrapy selectors

to handle it efficiently For example, in my thesis work, I want to get the data about the

location of the real estate news, longitude, and latitude, and this data is hardcoded on

the page By inspecting the element of the page, I can locate the position of the script

tag that contains that data

Now I can use selectors to retrieve the data normally:

latitude = selector.xpath('//property[@name="latitude"]/number').attrib["value"]item['latitude'] = float(latitude)

longtitude = selector.xpath('//property[@name="longitude"]/number') attrib["value"]item['longitude'] = float(longtitude)

Splash - A JavaScript rendering service

Trang 39

Before going to discuss how to use a headless browser, I want to walk through Splash

-A JavaScript rendering service written in Python 3 Splash is also supported by Scrapy,along with a library scrapy-splash supported for seamless integration between them.Splash executing script written in Lua to render custom JavaScript in the page con-text It works well in many web pages using JavaScript The advantage of using Splash

is reducing CPU resources used compared to running a browser to capture a specific pagesince it is a lightweight web browser with an HTTP API

Splash runs on Docker service so you need to have Docker installed before using Splash

In this pace of my thesis, I already work a little bit with Splash, however, my problemcannot be handle by using this service so I going for using a browser as my best solution.For the detail of this problem, I will explain in Problems and Solution section

Using a headless browser

The main idea of this solution is to send an actual request from a browser to the web page,download an HTML source, and then return it to Scrapy Some websites have JavaScriptcode embedded inside and only execute when loading a page while Scrapy tries to capture

a raw HTML source only and disable running JavaScript code This results in lackingsome significant data or Scrapy’s spider may be prevented to download the page becausethey consider this behavior as the presence of a bot Besides, trying to pass through theprevention is difficult since the website does not want their information to be scraped,

so that it uses an external service to stay middle between the client (my spider) andthe server when handling the request The service uses its anti-bot mechanism to checkwhether an incoming request is sent by the client or a bot and then allows it to pass orblock Therefore, it takes a lot of time and effort to overcome it Moreover, let the spideruse a browser to send a request to the web page should be the best choice at this time

In this way, that middleman guy will let our request into their server smoothly

Running a browser to download multiple pages at a time will consume lots of CPU usages

as well as the download speed is reduced much compare to Scrapy ’s request However, it

is worth doing that since most of the crawled data I have now is from this page I will gointo more detail about how I combine A headless browser with Scrapy in Problems andSolution section

A simple way to use a browser is Selenium, a tool support the automation of the webbrowser

Tiêu đề	Development of a Web application to monitor statistical data of real estate properties
Tác giả	Pham Minh Tuan
Người hướng dẫn	Assoc. Prof. Quan Thanh Tho, Assoc. Prof. Bui Hoai Thang
Trường học	University of Technology
Chuyên ngành	Software Engineering
Thể loại	Graduation Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh

Định dạng
Số trang	78
Dung lượng	1,65 MB