Next you need to understand the types of support for decision-making at different levels operational, tactical and strategic within the company and study which business intelligence feat
Trang 11 | P a g e
Higher Nationals in Computing
Unit 14: Business Intelligence
Assessor name: Nguyen Xuan Sam
Assignment due:4 / 3 / 2 0 2 3 Assignment submitted:4 / 3 / 2 0 2 3
Trang 21 | P a g e
ASSIGNMENT 1 FRONT SHEET
Qualification BTEC Level 5 HND Diploma in Computing
Unit number and title Unit 14: Business Intelligence
Trang 32 |
Summative Feedback: Resubmission Feedback:
Signature & Date:
Trang 43 | P a g e
ASSIGNMENT 1 BRIEF Qualification BTEC Level 5 HND Diploma in Computing
Unit number and title Unit 14: Business Intelligence
Assignment title Assignment 1: Discover business process and BI technologies
Academic Year 2023
Unit Tutor Nguyen Xuan Sam
Submission Format:
Format: The submission is in the form of an individual written report that shows how you have manage the project This should be written in a concise, formal business style using single spacing and font size 12 You are required to make use of headings, paragraphs and subsections as appropriate, and all work must be supported with research and referenced using the Harvard referencing system Please also provide a bibliography using the Harvard referencing system
Submission Students are compulsory to submit the assignment in due date and in a way requested by the Tutors The form of submission will be a soft copy in PDF posted on corresponding course of http://cms.greenwich.edu.vn/
Note: The Assignment must be your own work, and not copied by or from another student or from books etc If you use ideas, quotes or data (such as diagrams) from books, journals or other sources, you must reference your sources, using the Harvard style Make sure that you know how to reference properly, and that understand the guidelines on plagiarism If you do not, you definitely get fail
Assignment Brief and Guidance:
Your company is currently working in [Assumed Domain] for 2 years For a new, young company, the competition in the market is very high Therefore, the Board of Director has decided to apply Business Intelligence to improve the company business process by making better decisions
The Board of Directors assigns a small group including you in Research & Development Department
Trang 54 | P a g e
to study business intelligence to apply for the company in the coming years
You need to research about business processes and decision support processes in the company and identify the types of data (unstructured, semi-structured or structured) generated by these
processes with examples You also need to research about current software used in the business process or decision support process and evaluate these usages (benefits and drawbacks)
Next you need to understand the types of support for decision-making at different levels
(operational, tactical and strategic) within the company and study which business intelligence features can help on that types of support Study the information systems or technologies (of BI) can
be used in this case, compare and contrast them to conclude which should be used
Your group needs to present the research results to the board in a presentation of 30 minutes
Learning Outcomes and Assessment Criteria
LO1 Discuss business processes and the mechanisms used to support
business decision-making
D1 Critically evaluate the
project management process and appropriate research methodologies applied
P1 Examine, using examples,
the terms ‘Business Process’
and ‘Supporting Processes’
M1 Differentiate between
unstructured and semi-structured
data within an organisation
LO2 Compare the tools and technologies associated with business
intelligence functionality
D2 Compare and contrast a range
of information systems and technologies that can be used to support
organisations at operational, tactical and strategic levels
P2 Compare the types of
support available for business
decision-making at varying
levels within an organisation
M2 Justify, with specific examples,
the key features of business intelligence functionality
Trang 65 | P a g e
Table of Content
1 Introduction 6
1.1 Overview of problems 6
1.2 Motivations 6
1.3 Objectives 7
2 Related works and dataset 7
2.1 Related works 7
2.2 Dataset 8
2.3 Summary 10
3 Proposed model 10
3.1 Correlation 10
3.2 Linear regression 12
3.3 Multiple regression 12
4 Simulating scenarios and Results 16
4.1 Package installation 16
4.2 Correlation 19
4.3 Scenarios and analysis 20
5 Conclusions and future works 24
5.1 Conclusions 24
5.2 Future works 24
Trang 7Figure 1: The factors impact on house price
Nowadays, there are many projects that data scientists have built for price prediction in machine learning In machine learning, we can easily predict a new data based on some features that we already have One of the most models for predictive analysis is regression As we know, the purpose of the model is for predicting future results that has been applied in many fields of life like economics, business, banking sector, healthcare industry, e-commerce entertainment, sports and so on Therefore, this technique is popularly used in building a model based on some features for prices prediction
1.2 Motivations
The least transparent sector of our economy is real estate Real estate prices fluctuate daily and sometimes prices are inflated and not based on estimates When people decide to buy a home, they look for one that is affordable and meets all their requirements With machine learning, we can easily predict house prices and
Trang 87 | P a g e
decide whether the house is worth buying or selling for a higher price In this report, we will forecast home prices in King County, USA Some features like the size, location, square footage, etc of the home can be key factors in determining the price
1.3 Objectives
In this job, there are several important goals that I focus on:
How does the size of the house affect the house price?
– How does the size of the housing area affect the house price?
– The area of the house campus affects the area of the house
- How does the area of the house affect the bathroom?
- Multiple regression of all features
To answer the questions in the first chapter, I will show the dataset There are several steps to get information from raw data These steps are shown in Figure 1 below, namely data collection
In the first chapter, I introduced my work and outlined the goals of the project The rest of this work includes showcasing my dataset, methods, and results, as well as a demo of the application
2 Related works and dataset
2.1 Related works
In the study, the authors used some algorithms such as Multiple Linear Regression, Ridge Regression, LASSO Regression, Elastic Net Regression, Ada Boosting Regression, and Gradient Boosting The purpose of this study is that the authors want to compare different methods and compare the model error of each method The results show that multiple regression has a fairly low error statistic, proving that multiple regression is one of the suitable models for predicting housing prices
In a further study, the authors divide the characteristics affecting housing prices into three categories: structural conditions, concepts, and locations Physical features are those characteristics of the house that can
be seen with the human eye, such as: B Size of the house, number of bedrooms, presence of a kitchen and garage, presence of a garden, size of the plot and other structures, and age of the house On the other hand, conceptual features are concepts provided by developers to attract buyers, such as: B The concept of minimalist home, healthy and environmentally friendly, and elite environment Research has proven that these characteristics are significantly correlated with real estate prices
Trang 98 | P a g e
In summary, there are many studies on predicting house prices using different machine learning methods or models In my project, I will use linear regression and multiple regression for model building and forecasting I’m going to take advantage of all the features in this dataset and decide to build a good model
2.2 Dataset
2.2.1 Data collection
I got the data from Kaggle The dataset is house price forecasts for 2014-2015 The raw data set contains over
21000 entries and 21 columns In this dataset, the price column is the dependent variable and the rest of the columns except ID and Date are the independent objects This is the beginning of the plot data set.In Figure 3, the result of this study is that the value is continuously dependent, and the price and other variables are the independent variables
In Figure 3, the result of this study is that the value is continuously dependent, and the price and other variables are the independent variables
2.2.2 Description datas et
The dataset includes:
Id: the unique identifier of each house
Date: the date when the house was sold
Price: the price of the house (thêm đơn vị)
Bedrooms: number of bedrooms
Bathrooms: the number of bathrooms
Sqft_living: the footage of the house
Sqft_lot: the footage of the lot
Floors: number of floors
Waterfront: house that has waterfront view
View: the house has view
Condition: the condition of the house on scale of 1-5 (overall)
Grade: the grade of the house unit on scale of 0-10 (overall)
Sqft-above: living area of the home, excluding the basement
Sqft_basement: living square footage of the basement
Yr_built: year that the house built
Yr_renovated: year that the house renovated
Trang 109 | P a g e
Zipcode: zipcode of the house
Lat: latitude coordinate
Long: longitude coordinate
Sqft_living15: The area of the interior where the 15 closest neighbors' living spaces are locate Sqft_lot15: the area of the 15 closest neighbors' nearest land lots
2.2.3 Data cleaning
Data cleaning is one of the most important steps before discovery, analysis, and modeling in machine learning The purpose of data cleaning is to deal with abnormal data such as missing data, outliers, unwanted data, or inconsistent data There are many ways to clean your data For example, deleting data, replacing data, changing the data type of a value, and so on Before cleaning, first examine the raw dataset to see what to do next:
2.2.4 Data processing
For data processing, there are many things to do with the raw dataset:
a Change the datatype of sqft_basement from int into float
Trang 1110 | P a g e
b Change the datatype and create more columns for date column
I change the date column from object into date datatype
c Change the datatype of yr_renovated
In the raw dataset, this column has the datatype of int, so I change it into float
2.3 Summary
In this chapter, I've cleaned up my raw dataset into a better dataset that's easy to explore and analyze
I believe this is the most important step before doing any prediction or modeling in machine learning In the next chapter, I begin to build my model and explain and some visualizations will be shown for better analysis
3 Proposed model
3.1 Correlation
Basically, correlation measures the difference between two variables (Hauke and Kossowski, 2011) The correlation coefficient formula follows (David Groebner, 2017)
The above function is called the Pearson product moment correlation To know if two variables are correlated,
we can look at the scatter plot model as shown below:
Trang 1413 | P a g e
3.4 squares and Adjusted R-squares
R-The adjusted coefficients of determination 2 or 2, which indicate how much of a change in the response is 𝑅 𝑅explained by the model, may be the most frequently used statistic in regression to assess the goodness of fit of
a model (Akossou and Palm, 2013)
Trang 17Step 1: Install the basic packages for this job
There are three packages that need to be installed for data discovery, analysis, and application:
Pandas, Numpy and Streamlit We can install using pip or anaconda:
Step 2: Install packages for data visualization
There are two packages I would use for visualization: seaborn and matplotlib
Use pip:
Use Anaconda:
Trang 18The second package is Statsmodels
Step 4: Install package for map
I am using Folium package for showing map:
Using pip:
Using Anaconda:
Folium version: 0.14.0
After installing all of the required packages for this work, I will import all of them in Jupyter
Notebook, except Streamlit package:
Trang 1918 | P a g e4.2 Correlation
I will explore the correlation of the dataset I envision a heatmap to show this:
Trang 2019 | P a g e
As we can see from the heatmap, I collect some high correlation pair because the correlation score above 0.5:
- price and sqft_above: 0.605567
- price and sqft_living: 0.702035
- bathrooms and sqft_living: 0.754665
- sqft_above and sqft_living: 0.876597
4.3 Scenarios and analysis
Scenario 1: How does the size of the housing area affect the house price?
Trang 21The formula of this model:
Trang 2221 | P a g e
Scenario 2: How does the size of the house affect the house price?
With only sqft_above, the R-square = 0.367 means that it affects 36.7% of the actual house prices The Figure below shows how strong relationship between sqft_above and house prices
Sqft_above from 0 to about 5000, house prices will be average, but most of the time house prices will increase more than decrease and sqft_above from more than 6000, house prices will all increase steadily but rarely decrease And house prices will peak near 8000000 when sqft_living gets to over 8300 We can see that sqft_above has little effect on the price of the house
Trang 2322 | P a g e
The formula of this model:
Scenario 3: How does the area of the house area affect the bathroom?
With only sqft_living, the R-square = 0.570 means that it affects 57% of the actual house prices The Figure below shows how strong relationship between sqft_living and house bathrooms
Trang 2423 | P a g e
Sqft_living from 0 to about 6000 then the number of bathrooms will be on average from 0 to 6 but there are some cases where even if sqft_living increases in the above range, the number of rooms can be more than average and possibly 0 bathrooms And sqft_living from more than 8000 or more, the number of bathrooms of the house will be at 4 or more, but rarely decrease below 4 Finally, we can see that sqft_living has a lot of influence on the number of bathrooms in the house
The formula of this model:
Scenario 4: How does the area of the house campus affect the area of the house?
With only sqft_living, the R-square = 0.582 means that it affects 58.2% of the actual house prices The Figure below shows how strong relationship between sqft_above and sqft_above
Trang 2524 | P a g e
sqft_living from 0 to about 8000 then sqft_above will have a much increasing average However, when sqft_living is more than 8000, we can see that almost sqft_above does not increase but only decreases but very little It can be seen that sqft_living from 0 to about 8000 has a lot of influence on the house sqft_above The formula of this model:
5 Conclusions and future works