unit 14 business intelligence assignment 1

Next you need to understand the types of support for decision-making at different levels operational, tactical and strategic within the company and study which business intelligence feat

Trang 1

1 | P a g e

Higher Nationals in Computing

Unit 14: Business Intelligence

Assessor name: Nguyen Xuan Sam

Assignment due:4 / 3 / 2 0 2 3 Assignment submitted:4 / 3 / 2 0 2 3

Trang 2

1 | P a g e

ASSIGNMENT 1 FRONT SHEET

Qualification BTEC Level 5 HND Diploma in Computing

Unit number and title Unit 14: Business Intelligence

Trang 3

2 |



 Summative Feedback:   Resubmission Feedback:

Signature & Date:

Trang 4

3 | P a g e

ASSIGNMENT 1 BRIEF Qualification BTEC Level 5 HND Diploma in Computing

Unit number and title Unit 14: Business Intelligence

Assignment title Assignment 1: Discover business process and BI technologies

Academic Year 2023

Unit Tutor Nguyen Xuan Sam

Submission Format:

Format: The submission is in the form of an individual written report that shows how you have manage the project This should be written in a concise, formal business style using single spacing and font size 12 You are required to make use of headings, paragraphs and subsections as appropriate, and all work must be supported with research and referenced using the Harvard referencing system Please also provide a bibliography using the Harvard referencing system

Submission Students are compulsory to submit the assignment in due date and in a way requested by the Tutors The form of submission will be a soft copy in PDF posted on corresponding course of http://cms.greenwich.edu.vn/

Note: The Assignment must be your own work, and not copied by or from another student or from books etc If you use ideas, quotes or data (such as diagrams) from books, journals or other sources, you must reference your sources, using the Harvard style Make sure that you know how to reference properly, and that understand the guidelines on plagiarism If you do not, you definitely get fail

Assignment Brief and Guidance:

Your company is currently working in [Assumed Domain] for 2 years For a new, young company, the competition in the market is very high Therefore, the Board of Director has decided to apply Business Intelligence to improve the company business process by making better decisions

The Board of Directors assigns a small group including you in Research & Development Department

Trang 5

4 | P a g e

to study business intelligence to apply for the company in the coming years

You need to research about business processes and decision support processes in the company and identify the types of data (unstructured, semi-structured or structured) generated by these

processes with examples You also need to research about current software used in the business process or decision support process and evaluate these usages (benefits and drawbacks)

Next you need to understand the types of support for decision-making at different levels

(operational, tactical and strategic) within the company and study which business intelligence features can help on that types of support Study the information systems or technologies (of BI) can

be used in this case, compare and contrast them to conclude which should be used

Your group needs to present the research results to the board in a presentation of 30 minutes

Learning Outcomes and Assessment Criteria

LO1 Discuss business processes and the mechanisms used to support

business decision-making

D1 Critically evaluate the

project management process and appropriate research methodologies applied

P1 Examine, using examples,

the terms ‘Business Process’

and ‘Supporting Processes’

M1 Differentiate between

unstructured and semi-structured

data within an organisation

LO2 Compare the tools and technologies associated with business

intelligence functionality

D2 Compare and contrast a range

of information systems and technologies that can be used to support

organisations at operational, tactical and strategic levels

P2 Compare the types of

support available for business

decision-making at varying

levels within an organisation

M2 Justify, with specific examples,

the key features of business intelligence functionality

Trang 6

5 | P a g e

Table of Content

1 Introduction 6

1.1 Overview of problems 6

1.2 Motivations 6

1.3 Objectives 7

2 Related works and dataset 7

2.1 Related works 7

2.2 Dataset 8

2.3 Summary 10

3 Proposed model 10

3.1 Correlation 10

3.2 Linear regression 12

3.3 Multiple regression 12

4 Simulating scenarios and Results 16

4.1 Package installation 16

4.2 Correlation 19

4.3 Scenarios and analysis 20

5 Conclusions and future works 24

5.1 Conclusions 24

5.2 Future works 24

Trang 7

Figure 1: The factors impact on house price

Nowadays, there are many projects that data scientists have built for price prediction in machine learning In machine learning, we can easily predict a new data based on some features that we already have One of the most models for predictive analysis is regression As we know, the purpose of the model is for predicting future results that has been applied in many fields of life like economics, business, banking sector, healthcare industry, e-commerce entertainment, sports and so on Therefore, this technique is popularly used in building a model based on some features for prices prediction

1.2 Motivations

The least transparent sector of our economy is real estate Real estate prices fluctuate daily and sometimes prices are inflated and not based on estimates When people decide to buy a home, they look for one that is affordable and meets all their requirements With machine learning, we can easily predict house prices and

Trang 8

7 | P a g e

decide whether the house is worth buying or selling for a higher price In this report, we will forecast home prices in King County, USA Some features like the size, location, square footage, etc of the home can be key factors in determining the price

1.3 Objectives

In this job, there are several important goals that I focus on:

How does the size of the house affect the house price?

– How does the size of the housing area affect the house price?

– The area of the house campus affects the area of the house

- How does the area of the house affect the bathroom?

- Multiple regression of all features

To answer the questions in the first chapter, I will show the dataset There are several steps to get information from raw data These steps are shown in Figure 1 below, namely data collection

In the first chapter, I introduced my work and outlined the goals of the project The rest of this work includes showcasing my dataset, methods, and results, as well as a demo of the application

2 Related works and dataset

2.1 Related works

In the study, the authors used some algorithms such as Multiple Linear Regression, Ridge Regression, LASSO Regression, Elastic Net Regression, Ada Boosting Regression, and Gradient Boosting The purpose of this study is that the authors want to compare different methods and compare the model error of each method The results show that multiple regression has a fairly low error statistic, proving that multiple regression is one of the suitable models for predicting housing prices

In a further study, the authors divide the characteristics affecting housing prices into three categories: structural conditions, concepts, and locations Physical features are those characteristics of the house that can

be seen with the human eye, such as: B Size of the house, number of bedrooms, presence of a kitchen and garage, presence of a garden, size of the plot and other structures, and age of the house On the other hand, conceptual features are concepts provided by developers to attract buyers, such as: B The concept of minimalist home, healthy and environmentally friendly, and elite environment Research has proven that these characteristics are significantly correlated with real estate prices

Trang 9

8 | P a g e

In summary, there are many studies on predicting house prices using different machine learning methods or models In my project, I will use linear regression and multiple regression for model building and forecasting I’m going to take advantage of all the features in this dataset and decide to build a good model

2.2 Dataset

2.2.1 Data collection

I got the data from Kaggle The dataset is house price forecasts for 2014-2015 The raw data set contains over

21000 entries and 21 columns In this dataset, the price column is the dependent variable and the rest of the columns except ID and Date are the independent objects This is the beginning of the plot data set.In Figure 3, the result of this study is that the value is continuously dependent, and the price and other variables are the independent variables

In Figure 3, the result of this study is that the value is continuously dependent, and the price and other variables are the independent variables

2.2.2 Description datas et

The dataset includes:

Id: the unique identifier of each house

Date: the date when the house was sold

Price: the price of the house (thêm đơn vị)

Bedrooms: number of bedrooms

Bathrooms: the number of bathrooms

Sqft_living: the footage of the house

Sqft_lot: the footage of the lot

Floors: number of floors

Waterfront: house that has waterfront view

View: the house has view

Condition: the condition of the house on scale of 1-5 (overall)

Grade: the grade of the house unit on scale of 0-10 (overall)

Sqft-above: living area of the home, excluding the basement

Sqft_basement: living square footage of the basement

Yr_built: year that the house built

Yr_renovated: year that the house renovated

Trang 10

9 | P a g e

Zipcode: zipcode of the house

Lat: latitude coordinate

Long: longitude coordinate

Sqft_living15: The area of the interior where the 15 closest neighbors' living spaces are locate Sqft_lot15: the area of the 15 closest neighbors' nearest land lots

2.2.3 Data cleaning

Data cleaning is one of the most important steps before discovery, analysis, and modeling in machine learning The purpose of data cleaning is to deal with abnormal data such as missing data, outliers, unwanted data, or inconsistent data There are many ways to clean your data For example, deleting data, replacing data, changing the data type of a value, and so on Before cleaning, first examine the raw dataset to see what to do next:

2.2.4 Data processing

For data processing, there are many things to do with the raw dataset:

a Change the datatype of sqft_basement from int into float

Trang 11

10 | P a g e

b Change the datatype and create more columns for date column

I change the date column from object into date datatype

c Change the datatype of yr_renovated

In the raw dataset, this column has the datatype of int, so I change it into float

2.3 Summary

In this chapter, I've cleaned up my raw dataset into a better dataset that's easy to explore and analyze

I believe this is the most important step before doing any prediction or modeling in machine learning In the next chapter, I begin to build my model and explain and some visualizations will be shown for better analysis

3 Proposed model

3.1 Correlation

Basically, correlation measures the difference between two variables (Hauke and Kossowski, 2011) The correlation coefficient formula follows (David Groebner, 2017)

The above function is called the Pearson product moment correlation To know if two variables are correlated,

we can look at the scatter plot model as shown below:

Trang 14

13 | P a g e

3.4 squares and Adjusted R-squares

R-The adjusted coefficients of determination 2 or 2, which indicate how much of a change in the response is 𝑅 𝑅explained by the model, may be the most frequently used statistic in regression to assess the goodness of fit of

a model (Akossou and Palm, 2013)

Trang 17

Step 1: Install the basic packages for this job

There are three packages that need to be installed for data discovery, analysis, and application:

Pandas, Numpy and Streamlit We can install using pip or anaconda:

Step 2: Install packages for data visualization

There are two packages I would use for visualization: seaborn and matplotlib

Use pip:

Use Anaconda:

Trang 18

The second package is Statsmodels

Step 4: Install package for map

I am using Folium package for showing map:

Using pip:

Using Anaconda:

Folium version: 0.14.0

After installing all of the required packages for this work, I will import all of them in Jupyter

Notebook, except Streamlit package:

Trang 19

18 | P a g e4.2 Correlation

I will explore the correlation of the dataset I envision a heatmap to show this:

Trang 20

19 | P a g e

As we can see from the heatmap, I collect some high correlation pair because the correlation score above 0.5:

- price and sqft_above: 0.605567

- price and sqft_living: 0.702035

- bathrooms and sqft_living: 0.754665

- sqft_above and sqft_living: 0.876597

4.3 Scenarios and analysis

Scenario 1: How does the size of the housing area affect the house price?

Trang 21

The formula of this model:

Trang 22

21 | P a g e

Scenario 2: How does the size of the house affect the house price?

With only sqft_above, the R-square = 0.367 means that it affects 36.7% of the actual house prices The Figure below shows how strong relationship between sqft_above and house prices

Sqft_above from 0 to about 5000, house prices will be average, but most of the time house prices will increase more than decrease and sqft_above from more than 6000, house prices will all increase steadily but rarely decrease And house prices will peak near 8000000 when sqft_living gets to over 8300 We can see that sqft_above has little effect on the price of the house

Trang 23

22 | P a g e

Scenario 3: How does the area of the house area affect the bathroom?

With only sqft_living, the R-square = 0.570 means that it affects 57% of the actual house prices The Figure below shows how strong relationship between sqft_living and house bathrooms

Trang 24

23 | P a g e

Sqft_living from 0 to about 6000 then the number of bathrooms will be on average from 0 to 6 but there are some cases where even if sqft_living increases in the above range, the number of rooms can be more than average and possibly 0 bathrooms And sqft_living from more than 8000 or more, the number of bathrooms of the house will be at 4 or more, but rarely decrease below 4 Finally, we can see that sqft_living has a lot of influence on the number of bathrooms in the house

Scenario 4: How does the area of the house campus affect the area of the house?

With only sqft_living, the R-square = 0.582 means that it affects 58.2% of the actual house prices The Figure below shows how strong relationship between sqft_above and sqft_above

Trang 25

24 | P a g e

sqft_living from 0 to about 8000 then sqft_above will have a much increasing average However, when sqft_living is more than 8000, we can see that almost sqft_above does not increase but only decreases but very little It can be seen that sqft_living from 0 to about 8000 has a lot of influence on the house sqft_above The formula of this model:

5 Conclusions and future works

Tiêu đề	Business Intelligence
Tác giả	Nguyen Xuan Nam
Người hướng dẫn	Nguyen Xuan Sam
Trường học	Greenwich University
Chuyên ngành	Computing
Thể loại	assignment
Năm xuất bản	2023
Thành phố	Hanoi

Định dạng
Số trang	26
Dung lượng	2,78 MB