Unlike React, whichuses one-way data binding, both React JS and Angular JS use Component to build theweb application means that when the component or a model is rendered in the viewcan h
Trang 1O O O
TP.HCMBK
GRADUATION THESIS
DEVELOPMENT OF A WEB APPLICATION TO MONITOR
STATISTICAL DATA OF REAL ESTATE
PROPERTIES
DEPARTMENT OF SOFTWARE ENGINEERING
INSTRUCTOR: Assoc Prof Quan Thanh Tho REVIEWER: Assoc Prof Bui Hoai Thang
STUDENT: Pham Minh Tuan 1752595
Ho Chi Minh 7/2021
Trang 2KHOA:KH & KT Máy tính NHIỆM VỤ LUẬN ÁN TỐT NGHIỆP
BỘ MÔN:KHMT Chú ý: Sinh viên phải dán tờ này vào trang nhất của bản thuyết trình
HỌ VÀ TÊN: Phạm Minh Tuấn MSSV: 1752595
1 Đầu đề luận án:
Development of a Web application to monitor statistical data of real estate properties
2 Nhiệm vụ (yêu cầu về nội dung và số liệu ban đầu):
✔ Investigate Selenium webdriver to collect data in anti-crawling websites
✔ Investigate and build a price-predicting model using Linear Regression based oncollected real estate data
✔ Crawl data everyday and update new data into database
✔ Investigate python libraries to implement model and train model with collected data
3 Ngày giao nhiệm vụ luận án: 03/09/2015
4 Ngày hoàn thành nhiệm vụ: 20/12/2015
5 Họ tên giảng viên hướng dẫn: Phần hướng dẫn:
PGS.TS Quản Thành Thơ
PHẦN DÀNH CHO KHOA, BỘ MÔN:
Người duyệt (chấm sơ bộ):
Trang 3Ngày tháng năm
PHIẾU CHẤM BẢO VỆ LVTN
(Dành cho người hướng dẫn/phản biện)
1 Họ và tên SV: Phạm Minh Tuấn
2 Đề tài: Development of a Web application to monitor statistical data of real estate properties
3 Họ tên người hướng dẫn/phản biện: PGS.TS Quản Thành Thơ
4 Tổng quát về bản thuyết minh:
6 Những ưu điểm chính của LVTN:
The student has successfully developed a system as required, which consists of the following features: (i) crawling data from real estate web sites; (ii) normalizing and preprocessing data; and (iii) using AI technique for data forecasting
7 Những thiếu sót chính của LVTN:
The student should have conducted more analysis about the business requirement of the project
8 Đề nghị: Được bảo vệ Bổ sung thêm để bảo vệ Không được bảo vệ
9 3 câu hỏi SV phải trả lời trước Hội đồng:
10 Đánh giá chung (bằng chữ: giỏi, khá, TB): Điểm : 7.6/10
Ký tên (ghi rõ họ tên)
PGS.TS Quản Thành Thơ
Trang 4Ngày 06 tháng 8 năm 2021
PHIẾU CHẤM BẢO VỆ LVTN
(Dành cho người hướng dẫn/phản biện)
1 Họ và tên SV: Phạm Minh Tuấn
2 Đề tài: Development of a web application to monitor statistical data of real estate properties
3 Họ tên người hướng dẫn/phản biện: Bùi Hoài Thắng
4 Tổng quát về bản thuyết minh:
6 Những ưu điểm chính của LVTN:
- Showed an understanding about crawling data, especially real estate data, from some popular websites about real estate
- Showed an understanding about some techniques in predicting price based on collected real estate prices
- Designed and implemented a web-based application to show statistics of collected data, and
performed some experientations on predicting prices
7 Những thiếu sót chính của LVTN:
8 Đề nghị: Được bảo vệ X Bổ sung thêm để bảo vệ Không được bảo vệ
9 3 câu hỏi SV phải trả lời trước Hội đồng:
a
b
c
10 Đánh giá chung (bằng chữ: giỏi, khá, TB): Khá Điểm : 6.5/10
Ký tên (ghi rõ họ tên)
Bùi Hoài Thắng
Trang 5I guarantee that the thesis’s content is written and illustrated on my own All thecontents, which I read from the documents of the technologies will be used in my webapplication, are defined in the References page and I promise that I do not copy or takethe content of others’ thesis without permission.
Trang 6First, I want to say thanks to my instructor, Assoc Prof Quan Thanh Tho for instructing
my topic as well as the knowledge to implement the web application from the first step
Beside, I am extremely happy because of my family ’s support during the time I dothe thesis and throughout three years of my study at Ho Chi Minh city University ofTechnology This work cannot be finished well if I don’t receive many encouragementsfrom my parents and my little brother
Trang 7The real estate market plays an important role in developing Vietnam’s economy Thiskind of industry starts growing fast in recent years to respond to the need of customers,and some real estate companies want to collect the data in this area as much as possible
to analyze the specific area that abstract most investors as well as sellers However, thedata gained about real estate is enormous with lots of websites It is hard for them togather all information at once This thesis will introduce a web application that willcollect information about real estate websites so that real estate agents can give a preciseanalysis of the resources-real property
Trang 8List of Figures 4
1.1 Introduction 8
1.2 Objectives and Scope 8
2 Background Knowledge 9 2.1 Frontend 10
2.1.1 Single-page Application (SPA) 10
2.1.2 JSX 12
2.1.3 Components and Props 13
2.1.4 States 14
2.1.5 Events 15
2.1.6 Code Splitting 16
2.1.7 Fragment 17
2.2 Backend 18
2.2.1 Model Layer 19
2.2.2 View Layer 20
2.2.3 Template Layer 20
2.3 AJAX Request - Response 21
1
Trang 92.5 Scrapy 22
2.5.1 Basic concept 23
2.5.2 Scrapy’s Architecture 29
2.5.3 Working with Dynamically-loaded Content 30
2.5.4 Selenium 32
2.6 Linear Regression 36
2.6.1 Linear Model 36
2.6.2 Cost Function 37
2.7 Polynomial Features 37
2.8 Evaluation Metrics 38
2.8.1 Mean Squared Error (MSE) 38
2.8.2 R-Squared Score (R2) 39
2.8.3 Cross Validation Score 39
2.9 Underfitting and Overfitting 40
2.10 Regularization 41
2.10.1 Ridge Regression 41
2.10.2 Lasso Regression 42
2.11 Feature Engineering 42
2.11.1 Handling missing data 42
2.11.2 Outliers removal 43
2.11.3 Log Transformation 44
2.12 Supported Python Libraries 46
2.12.1 Numpy 46
2.12.2 Pandas 46
2.12.3 Matplotlib 47
Trang 102.12.4 Scikit-Learn 47
3 System Implementation 48 3.1 Use-case Diagram 49
3.2 Architecture Diagram 50
3.3 Database 51
3.4 Workflow 51
3.5 Crawling Data 52
3.5.1 Handling duplicated data 52
3.5.2 Formating Item’s name 52
3.5.3 Handling web pages with Selenium 53
3.6 Data Modeling 54
3.6.1 Data Preparation 55
3.6.2 Training Models 59
3.6.3 Model Evaluation Result 60
3.7 Web Application 61
3.7.1 Register page 61
3.7.2 Login page 62
3.7.3 Dashboard page 62
3.7.4 Data page 63
3.7.5 Price prediction page 64
3.7.6 Admin page 65
4 Summary 66 4.1 Achievement 67
4.2 Future Development 67
4.2.1 Thesis limitation 67
Trang 114.2.2 Further development 68
Trang 122.1 Single-page Application 10
2.2 Scrapy’s architecture 29
2.3 How Selenium WebDriver works 33
2.4 Non-linear dataset 38
2.5 Five-folds cross-validation 40
2.6 Undefitting & Overfitting 41
2.7 Boxplot components 44
2.8 Matplotlib Boxplot 44
2.9 Skewned Data 45
2.10 Data before using Log Transform 45
2.11 Data after using Log Transform 46
3.1 Use-case diagram 49
3.2 Architecture diagram 50
3.3 CRED System Database Schema 51
3.4 Scrapy collected data sample 54
3.5 Cu Chi Selling Land dataset 55
3.6 Dataset in Ba Thien, Nhuan Duc, Cu Chi 55
3.7 Origin dataset in Nguyen Huu Canh, 22, Binh Thanh 56
3.8 Dataset after using Log Transformation 56
5
Trang 133.9 Dataset Histogram before & after Log Transformation 57
3.10 Dataset after removing duplicates 57
3.11 Boxplot of area values 58
3.12 Divide dataset into equal parts 58
3.13 Outliers detection after apply on each part 58
3.14 Polynomial Regression degree choosing by train set’s RMSE 59
3.15 Polynomial Regression degree choosing by validation set’s RMSE 59
3.16 Register page 61
3.17 Login page 62
3.18 Dashboard page 62
3.19 Dashboard page 63
3.20 Data page 63
3.21 Price prediction page 64
3.22 Admin page - Manage users 65
3.23 Admin page - Retrain models 65
Trang 14In this chapter, I going to introduce about my thesis topic, then I will show the targetsand the scope of my thesis for the web application.
1.1 Introduction 81.2 Objectives and Scope 8
7
Trang 151.1 Introduction
Crawling data is a common fields that appeared in many web application which mainfunction is collecting data However a web app that collects and monitor data in realestate fields for common real estate agents is rare Therefore, they need a software tohelp them synthesizing all data from other real estate website and then started makinganalysis about collected data
In the thesis phase, I will implement a web application that will collect information
on different real estate websites and display them on my current web page Clients canregister accounts and log in to the web app and view the data The web app also providesfilter functionality to let users filter the data for custom viewing and help users predictprice of a specific real estate properties based on elements like area, street, ward anddistrict Moreover, the web app provides users a general view of real estate data throughthe chart displayed on the Dashboard page
1.2 Objectives and Scope
The objectives of the thesis are:
Implement a web application to show the data collected
Build a custom crawler to get the data from three main real estate website: dongsan.com.vn, homedy.com, and propzy.vn
bat- Provide some real estate data statistics based on collected data
Build machine learning models to predict price of real estate properties based oncollected data
For a topic scope:
The web app is limit for a few users only (less than ten users) and it works locally
The collected real estate posts is limited to only in Ho Chi Minh city only Thereal estate post type is selling
Trang 16In this chapter, I am going to illustrate about the knowledge I research to implment the web application including some definitions of the technologies I will use in both Frontend and Backend part
2.1 Frontend 10
2.2 Backend 18
2.3 AJAX Request - Response 21
2.4 Postgres SQL 21
2.5 Scrapy 22
2.6 Linear Regression 36
2.7 Polynomial Features 37
2.8 Evaluation Metrics 38
2.9 Underfitting and Overfitting 40
2.10 Regularization 41
2.11 Feature Engineering 42
2.12 Supported Python Libraries 46
9
Trang 172.1 Frontend
2.1.1 Single-page Application (SPA)
A single-page application is a web application that can interact with the clients cally with an only-one-single web page at the time Whenever a user changes the content
dynami-in the web page, it will be rewritten with the new content dynami-instead of reloaddynami-ing a newpage
Each page of the web application usually has the JavaScript layer at the header or bottom
to communicate with the web services on the server-side The content of the web page isloaded from the server-side in response to events that users make in a current web pagelike clicking on a link or a button
Figure 2.1: Single-page Application
Nowadays, with the development of the technologies used to build web applications pecially in the frontend area, there are three popular frameworks that most developersuse, that is React JS, Angular JS, and Vue JS I will state the comparison between them,and why I choose to use React JS to implement the frontend of my web application
es-Angular JS
Angular JS is a full-fledged MVC framework which provides a set of defined library andfunctionality for building a web application It is developed and maintained by Google,was first released in 2010 and it is based on TypeScript rather than using JavaScriptitself
Trang 18Angular is unique with its built-in two-way data binding feature Unlike React, whichuses one-way data binding, both React JS and Angular JS use Component to build theweb application means that when the component (or a model) is rendered in the viewcan have different data inside based on the web page the user is seeing For Angular,when the data in the view changes or also let to the change of the data in the model,while React is not This is more convenient for developers to build a web applicationeasier However, in my opinion, it would be hard for managing the data when the webapplication becomes larger as well as debugging.
React JS
React JS is actually a JavaScript library not a framework like the other two, that isdeveloped and maintained by Facebook It is released in March 2013, and is described as
”A JavaScript library for building user interfaces”
Since ReactJS is mostly used to create interactive UIs in client view ReactJS is notreally difficult to learn at first However, when I get used to some concepts it provides, itwill be easier to maintain the content of the web page Moreover, it comes with the bigresources of the library to support developers to create the website quickly with the help
of available components The developers will have a freedom in choosing the suitablelibrary to use in building web application because React JS only work at the client viewonly not like Angular JS, which is built as an MVC Framework, so to develop the fullweb application, developers have to follow strictly the template it provides
Vue JS
Vue JS is a frontend framework that is developed to make a flexible web application It
is developed by the ex-Google employee Evan You in 2014 It shares many similaritieswith the two above, it uses Components to build the web application, also two-way databinding like Angular
Vue is versatile, and it helps you with multiple tasks, it comes with flexibility to designthe web application architecture Vue syntax is simple with people just get started withJavaScript background can still learn it, and it also support TypeScript as well
Why React ?
The reason I choose to use React JS to build a frontend of the web application is because
of its simplicity and its freedom to build the web app React JS let developers freely build
a web page on their library using not like Angular, beside Angular requires developer getused to TypeScript and it is also hard for a beginner in using frontend framework tostart with since Angular is considered as the most complex one among three frontend
Trang 19web framework I listed above For Vue JS, although it is simple and easy to approach fordevelopers at first, it is not stable like the other two and its support community is small.
To conclude, each framework has its pros and cons depend on how programmers decide
to implement their web application For me, I found that React JS is good to start first
to build the web application
Firstly, I want to introduce the JSX before present about some main concepts that areused in building the view of the web application in React JS
const element = <h4>I am JSX!</h4>
The line above is neither HTML nor String, it is called JSX and this syntax is used inReact JS to render elements It is like the combination between JavaScript and XML,XML stands for Extensible Markup Language It is more like the tag in HTML but itcan be customizable By using JSX developer can manipulate the data inside an element
by embedding it inside the HTML tag
const element = <h4>I am JSX!</h4>
const hello = <h1>Hello, {element}!</h1>
By default, When React DOM render element, it will escape any value that is embeddedinside JSX to prevent XSS (Cross-site-scripting) attacks Each element defined in JSXcan then be rendered in the web page by React DOM
Trang 20const element = <h4>I am JSX!</h4>
const hello = <h1>Hello, {element}!</h1>
ReactDOM.render(
element,
document.getElementById('root')
);
element is rendered inside a div tag with id=”root”
Components in React JS let developers split a web page into independent, reusable pieces,each of them will work isolated to each other A web page can contain lots of componentsinside, whenever a user changes the contents of the web page, the current page will berewritten with the new component or update the data inside the current componentwithout reloading a whole page
The way components work is similar to how we use functions or methods in most gramming languages When we define a function we declare the parameter for a function,
pro-it is like props in Component Then, we can call pro-it inside render of ReactDOM to renderthat component in the web page we want to show
To define a component in React JS two ways use JavaScript Function and ES6 JavaScriptClass:
Trang 21const helloComponent = <HelloReact name="tuanminh" />
is designed dynamically, the content in UI can be changed over time, therefore React JScomes with the concept be used Component that is states
2.1.4 States
State is similar to props, but it is private in Component only and it is changeable State
is introduced in React JS when we define Component as JavaScript class However, sincev16.8, React introduce a new concept which is called Hooks that let developers can usestate inside a function without writing a class During the time I read the documentabout React JS, I only learn to define Component in class only because it is the basicone in building and customize the Component Hooks are a new concept and most of thelibraries that support React JS are changed to use Hook to build their Component now.When I start building the full web application in the next phase of the thesis I will try
to use it to build the Components in a web application
To define state in React Component, we firstly create a constructor then create statewith its initial value
class HelloReact extends React.Component {
constructor(props) {
super(props);
this.state = {
name: 'tuanminh',}
Trang 22One important concept related to using state in React is Lifecycle methods In case
of loading the contents with many components into the web page, it is necessary tofree up resources taken by previous components used when they are destroyed, or afternew contents have loaded we want to run a function to operate the functionality ofthe current web page like getting data from the server That how React comes withcomponentDidMount() and componentWillUnmount() to control when a component ismounted or unmounted
// This will change the state name of Component
this.setState({ name: 'tuanminh' })
<button onClick={handleClick}>Click me</button>
However when we declare a function inside JavaScript class, we need to use bind() to bindthe current function to class when we use this keyword This is JavaScript’s characteristic,the default binding of this in JavaScript is the global context which is window, not thecurrent class we declare our function in
class ReactComponent extends React.Component {
constructor(props) {
Trang 23In case you want to use parameter in handleClick() we can try JavaScript arrow function
or use JavaScript bind() instead
<button onClick={() => this.handleClick(param)}>Click me</button>
import { Component } from 'react';
import NavButton from './NavButton';
class Navbar extends Component {
Trang 242.1.7 Fragment
Component class in React JS allows developers to return one element only, which could
be a div tag or the Component that is imported from the library However, with the help
of Fragment, it allows returning multiple elements
Trang 252.2 Backend
Since React only provides the client view only so to communicate with the database andthe crawler to get the data for the web app, we need to build the backend to handle thisproblem React JS can integrate with most backend frameworks nowadays easily as long
as it provides the API for the React JS to render the content in view Nowadays it isnot hard to find yourself a backend framework to create a full-stack web app Yet we canbuild the backend part from scratch too, we can start with building the architecture likethe common MVC architecture that is most used in many software not just in the aspect
of the web application, then is database design and how to manipulate the data in thedatabase, routing and control the view that will be rendered to the client
These parts now are handled in most of the backend web framework currently Besidesframework provides a way to manage the architecture more efficiently, we do not need
to care much about the web services (localhost) to run the server-side or the security ofour web application It also lets developers deploy their web quickly since they alreadyprovide a reusable piece of code that most web applications have For example, cookie orsession management for users when they logged in to the web app It now handles thatjob immediately for you, yet we can custom that part to let it works individually
One disadvantage of using the framework is that if we want to make a unique functionthat is based on the current one it provides However, the framework may not have thefunctionality or the API to let us custom it for your own So we will need to rewrite thatfunction entirely and that piece of work will not be simple if we do not have a look atthe way it is implemented Eventually, there is still a way to change it as overriding thatfunction by using our function which is based on the one framework provide
The backend framework I choose to implement my web application in Django It is ommended as one of the best frameworks on the Internet since Django lets developers usePython to work with the server-side of the web Python is a high-level programming lan-guage, yet friendly for most programmers due to its straightforward syntax and support
rec-of various libraries or tools Besides, Django comes with Django REST which providesRESTFUL APIs that work well with React JS
Django follows the Model-View-Controller (MVC) architecture, which is Model for aging database, Views is the client’s view, Controller acts as the middleman betweenModel and View When a client sends a request which is from the View to the server,Controller will handle the request and communicate with the Model to retrieve, update,
man-or add new data to the database through Model
Django provides developers three separate layers, which are:
The Model Layer, which plays the role Model in MVC architecture This layerprovides classes and modules for structuring and manipulating the data of the webapplication
Trang 26 The View Layer, which is a Controller This layer lets developers define response functions or classes that can be used to access data through the Modellayer.
request- The Template Layer are the templates, where store HTML files that are used todisplay as the view to the client of the web application However, in my web appReact JS already handles this role so I will not present much about this layer
from django.db import models
class RealEstateData(models.Model):
url = models.CharField(max_length=100)
post_title = models.CharField(max_length=50)
url and post title is the two fields of the table named myapp real estate data
By using models we need to specify the name of the app in settings.py in INSTALLED APPS
= [ ] firstly Then when we define a model in models.py, we can save that modelinto database as a table with two CLI are makemigrations and migrate The first CLI
is used when we first define a model or change some characteristics in the model likeadding attributes, changing the type of attributes or deleting attributes After that weuse migrate to take that changes into the data inside database
Making queries
Models provided by Django gives developers a database-abstraction API to let you nipulate directly the data inside database without making any SQL queries I will statesome common methods:
ma- To create an object, we simply call the name of models given available attributesinside the model like creating a class in Python, then object.save() let the objectsaved in database
Trang 27data = RealEstateDatắwww.realestatẹcom', 'mua_ban_nha_dat')
datạsave()
To retrieve objects, simply call methods objects.all()
To retrieve only parts of objects, we can use filter(), the inputs of filter are attributes
from djangọurls import path
from myapp import views
Trang 282.3 AJAX Request - Response
AJAX stands for Asynchronous JavaScript And XML, it is the use of XMLHttpRequest
to communicate with the backend, it allows the frontend can send and receive data indifferent format like text, JSON, HTML and update the data in frontend view withouthaving to refresh the current page
React JS provide the interactive view to the client to view the web page, however, it is just
a view so that whenever the user triggers an event in the view the make a component’sdata change the view should communicate with the server-side to load the suitable datainto the current component To communicate with the backend, the changeable compo-nent in React will send an AJAX request into the backend then the backend respondsdata back to the requested component and changes the data displaying the component,this update will not refresh the whole page of the web application, fewer data will beloaded at the time and this will increase user’s experience in using the web app
There are two common ways for React JS to integrate with the backend, that is usingfetch() API provided by the current browser or by using a JavaScript library axios
In my web app, I decided to use axios to handle AJAX request/response rather thanusing fetch() because axios is the famous library for handling AJAX in most of webapplications that developing using JavaScript It provides trusted security to protect thecommunication between the frontend and backend I will not recommend using AJAXJQuery in the web application developed by React since the way React and JQueryoperate is different from each other They will conflict if you put them in one place
2.4 Postgres SQL
I choose to use Postgres SQL to set up the database of my web application PostgresSQL is a powerful, open-source object-relational database system that uses and extendsthe SQL language combined with many features that safely store and scale the most com-plicated data workloads Postgres is famous for its proven architecture, reliability, dataintegrity, robust feature set, extensibility, and the dedication of the open-source commu-nity that stands behind the software to consistently deliver performant and innovativesolutions
Postgres SQL comes with many features aimed to help developers build applications,administrators to protect data integrity and build fault-tolerant environments Moreover,
it helps developers manage the data no matter how big or small the dataset PostgresSQL is a free and open-source database, however, it is highly extensible It is available
in most operating systems and it is used in many web applications as well as mobile andanalytics applications
Trang 292.5 Scrapy
The main function of web application is collect data from different real estate websites
to display in my web app at once, so that I need to make a crawler for crawling thoseinformation Currently, I am researching about Scrapy, an Python application frameworkfor crawling websites
Scrapy is an application framework for crawling web sites and extracting structured datawhich can be used for a wide range of useful applications, like data mining, informationprocessing or historical archival Although Scrapy was originally designed for web scrap-ing, it can also be used to extract data using APIs or as a general purpose web crawler.The main component of Scrapy is the spider
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()') get(),
'text': quote.css('span.text::text') get(),}
next_page = response.css('li.next a::attr("href")') get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Below is the sample code of using spider for crawling author and text in the sample website
’http://quotes.toscrape.com/tag/humor/’ To run the spider we use this command:
scrapy runspider quotes spider.py -o quotes.jl
After executing the command finishes it will create one file named quotes.jl, which tains a list of quotes in JSON format
con-quotes.jl
{
{
Trang 30"author": "Jane Austen",
"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in
a good novel, must be intolerably stupid.\u201d"
}
{
"author": "Steve Martin",
"text": "\u201cA day without sunshine is like, you know, night.\u201d"
}
}
When we run the command scrapy runspider quotes spider.py, Scrapy looked for a Spider
definition inside it and ran it through its crawler engine
The crawl started by making requests to the URLs defined in the start urls attribute (in
this case, only the URL for quotes in humor category) and called the default callback
method parse, passing the response object as an argument In the parse callback, we loop
through the quote elements using a CSS Selector, yield a Python dict with the extracted
quote text and author, look for a link to the next page and schedule another request using
the same parse method as callback Furthermore, I will go through some basic concepts
and tools provided by Scrapy to perform crawling real estate data
2.5.1 Basic concept
Spiders
Spiders are classes defined to instruct Scrapy to perform a crawl on a certain website or
list of websites and process the crawled data in structured items They are the places that
you can customize how your data crawled and control the number of websites involved
To start writing your custom spider you need to inherit the default Scrapy spider, class
scrapy.spiders.Spider This is the simplest spider that comes bundled with Scrapy, it
provides some necessary methods and attributes for crawling:
name - A string attribute is used to define the name of the spider It should be a
unique name
allowed domains - An attribute contains a list of valid domains defined by the
user to control the requested link If the requested link has a domain that does not
belong to the list, it will not be allowed
start urls - An attribute that is used to store a list of urls where the Spider will
begin to crawl
Trang 31 start requests - This method returns an iterable with the first Request to crawl.
It is called by Scrapy when the spider start crawling If the user use start urls, eachurl in this list will be called start request method automatically
parse(response) - This is the default callback method after Scrapy finish loaded response Inside parse method you can extract data into smaller items andprocess them
down-There are other attributes and methods but since it is not necessary so I will skip them,you can refer to Scrapy’s document Besides the default Scrapy spider, there are othersgeneric spiders defined depend on the needs of the scraping site By using generic spidersusers can easily custom the spider for their usage purpose
CrawlSpider - This is the most commonly used spider for crawling websites Thisspider come with some specific rule to control the behaviors of crawling websites
XMLFeedSpider - This spider is designed for parsing XML feeds
CSVFeedSpider - This spider is similar to XML except that it handle CSV files
SitemapSpider - This spider allows users to crawl a site by discovering the URLsusing Sitemaps
Selectors
After Scrapy finish downloaded response, it will call a callback function, which defaultmethod is parse(response) or your custom one So it will be a HTML source, this timeyou will need to extract your data from this source and to achieve it, Scrapy provide amechanism to handle this job, selectors Selectors try to parts in the HTML source byusing some expression like CSS or XPath CSS is the way to select components in HTMLfiles while XPath is used in XML documents To query the response using XPath or CSS,
we use response.xpath(<selector>) and response.css(<selector>), inside the method is theyour component selector, just pass it as a string For example:
<div class="images">
< href="image1.png" alt="this is image">My image</a
Trang 32And XPath will be:
response.xpath('//div[@class="images"]/a')
To get used to these kinds of selectors, you can refer to the CSS Selectors and XPathUsage on the Internet or trusted resources for more details Scrapy only uses it as a way
to manipulate data after we have an HTML source file already
To extract textual data, we use get() and getall() method; get() return a single resultonly while getall() return a list of results
response.xpath('//div[@class="images"]/a') get()
response.xpath('//div[@class="images"]/a') getall()
In case you want to select element attribute, for example the href attribute inside <a>tag Scrapy provide the attrib property of Selector to lookup for the attributes of aHTML element
response.xpath(’//div[@class=”images”]/a’).attrib[”href”]
Selectors are often used inside the callback function, parse(response) (if you define acustom parse function, you can pass your function in callback attribute in your Scrapy’srequest, I will discuss about this kind of request in Request & Response section)
Items
The term items in Scrapy represents a piece of structured information When I scrape
a real estate page, I tried to get as much data as possible like post type, address, email,phone, etc However, the data now is unstructured and may be hard to retrieve for usinglater and Items can help me to access this information conveniently Scrapy supportsfour main types of Items via itemadapter : dictionaries, Item objects, dataclass objects,and attrs objects
Dictionaries: It is similar to a Python dictionary
Item objects: Scrapy provide a class for Item with dict-like API (user can accessthrough Item like a dictionary) Beside this kind of Item provide class Field todefine field name so that user can raise error out (like KeyError) when there isexception that you want to handle or stop the crawling on current page This is thetype of item I choose to use in Scrapy since it is the easiest way to handle Items foraccessing and storing By defining field name in Item class, we can control whichfield we want to scrape visually
Trang 33 Dataclass objects: dataclass() allows defining item classes with field names, sothat item exporters can export all fields by default even the first scraped objectdoes not have values for all of them Beside, dataclass() also let field define its owntype like str or int, float.
Working with Item Objects
To start using Item Objects, I need to modify the items.py file when I create Scrapyproject, create a class for Item object:
class CrawlerItem(scrapy.Item):
url = scrapy.Field()
content = scrapy.Field()
post_type = scrapy.Field()
pass
Since Item is a Python class declaring item is the same as declaring a class However, tomodify the field value for item is through dictionary ’s method, like to change value ofpost type in item:
item = CrawlerItem()
item['post_type'] = something
Beside the default item provided by Scrapy, there is another item types ItemAdapter It
is aminly used in Item Pipeline or Spider Middleware, ItemAdapter is wrapper class tointeract with data container objects, provide a common interface to extract and set datawithout having to take the object’s type into account
Scrapy Shell
Scrapy shell is the interactive shell for users to debug the Scrapy code without runningspider It is used to test XPath or CSS expressions to see if they work successfully, alsoScrapy shell run along with the website I want to scrape so that I can know if that page
is scraped without errors or intercepted Scrapy shell is similar to Python shell if youhave already worked with it Scrapy Shell also works along well with IPython, which is
a powerful interactive Python interpreter with highlight syntax and many other modernfeatures If users have IPython installed, Scrapy Shell will use it as a default shell
Launch the Shell
To open the shell, we use the shell command like this:
Trang 34scrapy shell <url>
Work with Scrapy Shell
The Scrapy Shell provides some additional functions and objects to used directly in theshell:
shelp: print a help with the list of available objects and functions
fetch(url[, redirect=True]): fetch a new response with a given url and updatedall related objects
fetch(request): fetch a new response from the given request and update all relatedobjects
view(response): open the give response on browser If the response is a HTMLfile, it is rerender from text into a HTML file
crawler: the current Crawler object
spider: the Spider is used to handle current url
request: a Request object of the last fetched url
response: a response returned from fetched url
settings: the current setting in Scrapy
Item Pipeline
After an item is scraped in spider, it will then send to Item Pipeline This is where item
is manipulated like change format of item or give decision to keep or drop item To write
a Item Pipeline, we need to create a class inside pipelines.py, same driectory as items.pyfile The naming convention of the class follows by <Name of class>Pipeline, we canfollow an available class already implemented inside the file
Work with Item Pipeline
Item Pipeline provides some methods to process item, it recommended to use ItemAdapter
to work with items To use ItemAdapter, you need to import it into the file:
from itemadapter import ItemAdapter
I will list some useful methods in ItemPipeline below:
Trang 35 process item(self, item, spider): This method is called for every item pipeline
component, it must returns an item objects at final Inside this method, you can
access item by create ItemAdapter object :
adapter = ItemAdapter(item)
# Then you can access through item like normally:
adapter['post_type'] = something
To drop item, I use DropItem objects provided by scrapy.exceptions, you just need
to import it like ItemAdapter :
from scrapy.exceptions import DropItem
# To drop item you raise DropItem:
raise DropItem("drop item because post type is missing", adapter['post_type'])
open spider(self, spider): This method is called when spider opened
close spider(self, spider): This method is called when spider closed
Activate Item Pipeline in Scrapy Setting
To make Item Pipeline be able to process item, you need to activate them in Scrapy
The value assigned to each pipeline present the the order that they will run
Request & Response
Scrapy Request is used to send a request to the given URL in the spider If you define your
crawled URL in start urls list, Scrapy will call the default request to every URL in the list
However, in some cases, you want to rewrite the request with additional arguments or a
custom callback function, not the default self.parse so that you need to call the request
to get the page source The response is the returned object when executing a request,
both request and response can be used by instantiating their objects The response is
commonly used in Middleware, for example, to return a custom Response when a request
is executed successfully
Trang 36Passing additional data to callback function
Users may want to pass data to their callback function, which is called when request is
executed successfully You can achieve this by using cb kwargs attribute, passing value
should be a dictionary
def parse(self, response):
yield scrapy.Request(url='something', callback=self.parse_response, cb_kwargs=dict(url=url))
def parse_response(self, response, url):
item['url'] = url # this url is passed
2.5.2 Scrapy’s Architecture
Below is the picture illustrating the Data Flow in Scrapy:
Figure 2.2: Scrapy’s architecture
These flows are controlled by the execution engine and go in the order of numbers defined
in the picture:
1 The Spider will send the request to the Engine
2 Then the Engine transfers it to the scheduler for scheduling the request and
con-tinues listening for the next request to crawl
Trang 373 Scheduler returns the next crawl request to the Engine.
4 The Engine now process request to the Downloader Middleware
5 After the Downloader Middleware finish downloading requested item it will sendResponse to the Engine
6 After that, the Engine will send back the Response to the Spider for processingitem
7 The Spider processes the Response and returns scraped item and a new request tothe Engine
8 The Engine send the scraped item to Item Pipeline and schedule the new request
9 The process repeat at step 1
2.5.3 Working with Dynamically-loaded Content
Another practice about Scrapy’s Selector that I want to demonstrate is Working withdynamically-loaded content (content created by JavaScript) Data can be extracted insidescript tag, which is used to store JavaScript code or file used in that page or some datathat is only retrieved when we load that page on a web browser and Scrapy downloadercannot handle them Scrapy’s document present a section discussed about these cases,Parsing JavaScript code and Using a headless browser
Parsing JavaScript code is often used when your data is inside the JavaScript code,usually inside script tag
Using a headless browser, this solution is often used in the condition that your webpage requires running some Javascript code before loading the page in HTML sinceScrapy’s spider only download the HTML source once it firstly downloads the pagewithout running JavaScript code so this raw source may not have the data that Ineed to capture
Parsing JavaScript code
Since desired data is hardcoded inside the script tag and to retrieve it, I need to firstlyget the JavaScript code:
If the code is inside a file, I can read it by using response.text
If the code is within a script tag, I can use Scrapy’s selectors to extract the wholetext in that tag, then try to retrieve the desired data inside
Trang 38Now, I want to get the specific data only not a whole bunch of JS code so there are many
ways demonstrated by Scrapy, but I prefer using a Python library, js2xml to handle this
problem I use this library to covert the JS code into an XML document, in this way now
that text will look more like an HTML file in structure so now I can use Scrapy selectors
to handle it efficiently For example, in my thesis work, I want to get the data about the
location of the real estate news, longitude, and latitude, and this data is hardcoded on
the page By inspecting the element of the page, I can locate the position of the script
tag that contains that data
Now I can use selectors to retrieve the data normally:
latitude = selector.xpath('//property[@name="latitude"]/number').attrib["value"]item['latitude'] = float(latitude)
longtitude = selector.xpath('//property[@name="longitude"]/number') attrib["value"]item['longitude'] = float(longtitude)
Splash - A JavaScript rendering service
Trang 39Before going to discuss how to use a headless browser, I want to walk through Splash
-A JavaScript rendering service written in Python 3 Splash is also supported by Scrapy,along with a library scrapy-splash supported for seamless integration between them.Splash executing script written in Lua to render custom JavaScript in the page con-text It works well in many web pages using JavaScript The advantage of using Splash
is reducing CPU resources used compared to running a browser to capture a specific pagesince it is a lightweight web browser with an HTTP API
Splash runs on Docker service so you need to have Docker installed before using Splash
In this pace of my thesis, I already work a little bit with Splash, however, my problemcannot be handle by using this service so I going for using a browser as my best solution.For the detail of this problem, I will explain in Problems and Solution section
Using a headless browser
The main idea of this solution is to send an actual request from a browser to the web page,download an HTML source, and then return it to Scrapy Some websites have JavaScriptcode embedded inside and only execute when loading a page while Scrapy tries to capture
a raw HTML source only and disable running JavaScript code This results in lackingsome significant data or Scrapy’s spider may be prevented to download the page becausethey consider this behavior as the presence of a bot Besides, trying to pass through theprevention is difficult since the website does not want their information to be scraped,
so that it uses an external service to stay middle between the client (my spider) andthe server when handling the request The service uses its anti-bot mechanism to checkwhether an incoming request is sent by the client or a bot and then allows it to pass orblock Therefore, it takes a lot of time and effort to overcome it Moreover, let the spideruse a browser to send a request to the web page should be the best choice at this time
In this way, that middleman guy will let our request into their server smoothly
Running a browser to download multiple pages at a time will consume lots of CPU usages
as well as the download speed is reduced much compare to Scrapy ’s request However, it
is worth doing that since most of the crawled data I have now is from this page I will gointo more detail about how I combine A headless browser with Scrapy in Problems andSolution section
A simple way to use a browser is Selenium, a tool support the automation of the webbrowser