1. Trang chủ
  2. » Luận Văn - Báo Cáo

higher nationals in computing unit 14 business intelligence assignment 1

52 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Business Intelligence
Tác giả Nguyễn Lê Quang Tuấn Anh
Người hướng dẫn Nguyen Xuan Sam
Trường học Btec
Chuyên ngành Computing
Thể loại Assignment
Năm xuất bản 2023
Định dạng
Số trang 52
Dung lượng 2,01 MB

Cấu trúc

  • 1.1 Overview of problems (15)
  • 1.2 Motivations (17)
  • 1.3 Objectives (17)
  • 1.4 Summary (19)
  • 2.1 Related works (19)
  • 2.2 Dataset (21)
    • 2.2.1 Data collection (21)
    • 2.2.2 Description dataset (23)
  • 2.3 Summary (23)
  • 3.1 Correlation (24)
  • 3.2 Linear regression (27)
  • 3.3 Multiple regression (27)
  • 3.4 R-squares and Adjusted R-squares (28)
  • 3.5 Model accuracy (30)
  • 3.6 Summary (31)
  • 4.1 Package installation (32)
  • 4.2 Performance of scenarios (36)
  • 5.1 Conclusions (45)
  • 5.2 Future works (45)

Nội dung

ASSIGNMENT 1 FRONT SHEET Qualification BTEC Level 5 HND Diploma in Computing Unit number and title Unit 14: Business Intelligence Submission date March 15, 2023 Date Received 1st submiss

Overview of problems

House prices have altered significantly in the contemporary era as a result of the advancement of the economy and of living standards There are a number of reasons that are having an effect on the instability of housing values, as shown by the example in the figure below

Figure 1: The factors impact on house price

Scientists have already incorporated a large number of data projects into machine learning, and the most often used method is Random Forest A common supervised machine learning approach for Classification and Regression issues is random forest (Sruthi ER, ) And as we are aware, the goal of the model is to forecast future results in a variety of areas, including economics, business, sport, etc (Rachel, 2021) As a result, this approach is often used to develop models that use certain features to predict

Motivations

The foundations of high levels of transparency in the real estate sector include strictly enforced laws and regulations, high-quality, easily accessible market information and performance benchmarks, clear and fair practices, and high professional standards To fulfill this role and operate efficiently, the real estate sector needs to be highly transparent These foundations enable governments to operate efficiently, bringing long-term benefits to local communities and the environment, while helping businesses and investors to make decisions with confidence (Jeremy, 2018)

People will search for a home that fits all of their specifications and is affordable when they decide to purchase a home With the aid of machine learning, we can estimate home prices with ease and determine whether a particular home is better suited for purchase or higher-priced sale In this article, we'll make housing price predictions for King County, Washington When calculating the price of homes in regions like King County, Washington, predictive algorithms are complicated and tough to utilize (WA) Real estate sales prices in King County may be impacted by a number of independent factors The pricing can be significantly influenced by some characteristics, such as size, location, housing area, and so forth.

Objectives

There are a few key goals in this work that I am concentrating on:

▪ What impact does the size of the bathroom (bathrooms) have on the price of a home?

▪ What effect does the grade (grade) around the house have on the price?

▪ How does the price of a house change depending on the square footage of the home minus the basement (sqft_above)?

▪ How does the average size of indoor living space for the last 15 homes (sqft_living15) affect home prices?

I'll present the dataset in order to address the issues raised in the first chapter In order to extract information from raw data, there are various procedures Figure 2 below illustrates these stages, specifically data collecting

Figure 2 The summary of methodology

Summary

I described my work and laid out the project's goals in the first chapter The remaining components of this work are a dataset introduction, my approach and findings, and an application demo

Related works

The researchers (Madhuri et al., 2019) used a variety of techniques, including gradient boosting, multiple linear regression, ridge regression, LASSO regression, elastic net regression, and multiple linear regression The authors of that study wish to examine several methodologies and gauge how much model error is introduced by each The findings demonstrate that multiple regression is one of the most effective models for forecasting home prices since it has a relatively low error statistic The author of another study (Rahadi et al., 2015) categorizes the elements that influence home pricing into three categories: physical state, concept, and location A home's physical qualities include those that are visible to the naked eye, such as its size, number of bedrooms, the presence of a kitchen and garage, the presence of a garden, the size of the lot and adjacent structures, and the age of the house On the other side, conceptual characteristics are ideas that developers use to lure purchasers, such as the idea of a minimalist home, a healthy and eco-friendly atmosphere, or an upscale location A house's price is greatly influenced by its location This is because the location affects the current land price (Xiao-zhu and Ling-wei, 2013) Furthermore, the location influences how convenient it is to get to family-friendly entertainment alternatives like malls, gourmet tours, or even locations with breathtaking scenery Public amenities like schools, campuses, hospitals, and health centers are also impacted by the location (Kisilevich et al., 2013) Research has shown that these characteristics have a significant impact on home prices

In conclusion, a lot of research has been done on how to anticipate home values using various machine learning techniques or models I'll be developing models and making predictions for my project using both linear regression and multiple regression The location in King County, Washington, United States, is where I will be working on my project I'll make use of every feature in this dataset and decide whether to create a strong model.

Dataset

Data collection

The information I got from Kaggle (Lemsalu, 2017) The data set includes King County, Washington, home values from May 2014 to May 2015 There are 21 columns and more than 21000 entries in the raw dataset The price column in this dataset is the dependent variable, and all other columns aside from id and date— —are independent features The draw dataset's head is shown here

The price and the other factors are independent variables in Figure 3, which shows the dependent continuous value of this study

Description dataset

• Id: the house's individual identification number

• Date: the date when the house was sold

• Bathrooms: the number of bathrooms

• Sqft_living: The home's square footage

• Sqft_lot: The lot's square footage

• Waterfront: house that has waterfront view

• View: the house has view

• Condition: Rate the home's condition on a scale of 1 to 5 (overall)

• Grade: The dwelling unit's grade on a scale of 0 to 10 (overall)

• Sqft_above: living area of the home, excluding the basement

• Sqft_basement: the basement's dwelling area in square feet

• Yr_built: year that the house built\sYr renovated: year that the house renovated

• Zipcode: the home's zip code

• Sqft_living15: The interior space where the homes of the 15 closest neighbors are located

• Sqft_lot15: the sum of the 15 nearest neighbors' land lots in square feet.

Summary

In this section, I go over the effort involved and mention a few more studies that make use of the same data but employ various approaches, allowing you to pick and choose what works best for you Indicate the number of dependent and independent values in the data and how many columns and columns there are in total

Further to defining the raw data set's component names

Correlation

In essence, the correlation evaluates the difference between two variables (Hauke and Kossowski, 2011) According to the correlation coefficient formula (David Groebner, 2017)

The Pearson product moment correlation is the name of the function described above The scatter plot's pattern can be seen like the illustration in Figure 5 below to determine whether the two variables are correlated:

The correlation coefficient, or r, can be positive or negative, with a perfect correlation being +1.0 (the perfect negative correlation) There is no correlation between the x and y variables if r = 0 This is the ideal connection if the scatter plot's data points all fall along a straight line As a result, the correlation deviates from 0.0 to a greater extent the stronger the linear connection between the two variables The direction of the link is shown by the correlation coefficient's sign (David Groebner, 2017)

Figure 6 Correlation between Two Variables

Linear regression

Study of the fundamental equation for a single linear regression (David Groebner, 2017) The relationship is depicted as follows in the equation where x is the dependent variable and 1 is the dependent variable as the outcome:

Multiple regression

In this project, I utilize multiple regression to forecast the average book rating based on three features: the volume of the book, the number of text reviews, and the number of ratings Here is the equation for multiple regression (David Groebner, 2017):

R-squares and Adjusted R-squares

The coefficients of determination R^2 or modified R^2 are probably the most frequently used statistics in regression to assess how well a model fits the data These statistics indicate how much variation in the response is explained by the model (Akossou and Palm, 2013)

Figure 9 R-squares and Adjusted R-squares 1

The likelihood that the regression line will accurately represent the actual data points is statistically assessed using the multiple coefficient of determination R- squared, in other words, shows how closely the data match the regression model R- squared values normally range from 0 to 1, from 0% to 100% If the R-squares value is negative, this indicates that the model's performance is subpar (Chicco et al., 2021) As an illustration, if the R-squares value is equal to 0.8, then the independent variables are responsible for 80% of the variation in the target variable The better the model fits, the greater the R-squared score

The percentage of variance that can be accounted for by simply the independent variables with a substantial influence on the explanation of the dependent variable is determined by the adjusted R-squares method Only when the independent variable has an impact on the dependent variable do the R-squares rise

Figure 10 R-squares and Adjusted R-squares 2

Even if the independent variable I add to the model is unimportant, the R- squared will rise when I do so It is accurate to say that the R-squared never falls Yet, when include the irrelevant independent variable in the model, the Adjust R-squares drop The Adjusted R-squares is therefore always equal to or less than R-squares.

Model accuracy

Mean Absolute (MSE), Mean Square Error (MSE), and Root Mean Square Error are the metrics I use to assess the model's correctness (RMSE)

MAE (Mean Absolute Error) is the average absolute error between actual and predicted values The row level error computation known as L1 loss, also known as absolute error, ascertains the non-negative difference between the prediction and the actual Examining the MAE, which is the total of these mistakes, allows us to more accurately judge the model's performance throughout the entire dataset

MSE (Mean Squared Error) is the average squared difference between actual and projected values A row level error computation known as squared error, also known as L2 loss, squares the discrepancy between the prediction and the actual By looking at the MSE, which represents the average of these errors, we can more accurately assess the model's performance throughout the entire dataset

The standard deviation of the residuals is known as Root Mean Square Error (RMSE) (prediction errors) Residuals and RMSE both measure the spread of these residuals, which measures the separation of the residuals from the regression line

The best value of accuracy = 0 and the worst value is equal +∞ (Chicco et al., 2021) Therefore, the more of the value small, the more perfect model.

Summary

I've demonstrated the formula and some related theories in this chapter In the chapter, I present my work, back it up with evidence, and respond to a few inquiries to support the inclusion of this work

Package installation

Step 1: Install basic packages for this work

For data exploration, analysis, and application, three packages must be installed: Pandas, Numpy, and Streamlit Pip or Anaconda can be used to install:

Figure 14 Step 1: Install basic packages for this work

Step 2: Install packages for data visualization

Seaborn and Matplotlib are the two packages I'll use for visualization

Figure 15 Step 2: Install packages for data visualization

Step 3: Install packages for modeling

For modeling, I use the Scikit-learn package, which needs:

With pip, install Scikit-learn

Figure 16 Step 3: Install packages for modeling 1

Figure 17 Step 3: Install packages for modeling 2

After installing all of the required packages for this work, I will import all of them in Jupyter Notebook:

A multidimensional array object, several derivative objects (such masked arrays and matrices), and a selection of procedures for quick operations on arrays are all provided by this Python package (Numpy, 2023) I would set Numpy to np

Pandas is a Python library used for working with data sets It has the function of analyzing, cleaning, exploring and manipulating data Pandas helps us to answer the questions: Is there a correlation between two or more columns?, What is the mean? Maximum value?, Minimum value? (W3schools, 2023) I would set Pandas as pd

Matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB Each pyplot function modifies a figure in some way, such as by creating a figure, a plotting region within a figure, some lines within a plotting area, labeling the plot, etc (Matplotlib, 2023) I would set matplotlid.pyplot as plt

A Python data visualization library called Seaborn is based on matplotlib It offers a sophisticated user interface for creating visually appealing and useful statistics visuals (Seaborn, 2023) I would set seaborn as sns

Using a Python API, the Folium library for Python allows you access to the mapping capabilities of the Leaflet JavaScript framework You can use it to produce dynamic geographic visualizations that you can publish as websites (Martin, 2023).

Performance of scenarios

After importing the required libraries and packages, we need to import the data so that we can start analyzing the data In figure 19, I read a comma-separated values (csv) file into a DataFrame using the pd.read csv function supports the option of repeating or dividing the file into many segments

I'll need to use the [data.head()] command to see what's in the data once I can study it I use [data.shape] to determine how many rows and columns the data has I utilize [data.info()] to determine the type of data to enter, what data can be left blank, and how much the entire amount of data is

Figure 20 Statements to describe data information

I would use the [data.corr()] command to be able to find the pairwise correlation of all the columns in the data frame once I was able to make sense of the information in the data But I also use [sns.heatmap()] since it allows me to fit the dataset into an ndarray and output the histogram as a rectangular, color-coded matrix Also, it is quicker and simpler to locate the data frame's column pairings with a correlation

I was able to see the correlation between the column pairs in the dataframe after

I had the Heatmap The correlation is then stated from -1 to 1, as can be seen The level of correlation between these two pairs of columns is high if the color is vivid and the number is close to one, and vice versa And I want to identify the price column's high correlation Thus, I will choose columns that have a strong correlation with housing prices and I want them to have a high correlation to be able to determine why properties are so expensive I'll also choose some columns that have a correlation of at least 0.5:

After getting the correlation pairs through heatmap I will set the variable y to be the price in the data I entered and f to be sqft_living15

I then need to use a statistical model, it fits a line passing through the origin, i.e it does not fit the intercept And I need to give statistical model of sqft_living15

I use sm.OLS to take two array objects y and x3 as input y is generally a Pandas dataframe or a NumPy array

I use the scatter function to plot the scatter plot, x is sqft_living15 and y is price

Was then able to show the same scatter plot with the correlated column pairs I found from Heatmap

Figure 25 Price versus Number of bathrooms

Figure 27 Price versus Square Feet of the houses exicuding basement

Figure 28 Price versus Square Feet of 15 closest neighbors’ houses b) Scenarios

What effect does the grade (grade) around the house have on the price?

Figure 29 OLS Regression Result between grade and price

With only grade, the R-square = 0.445 means that it affects 44.5% of the actual house prices The Figure below shows how strong relationship between grade and house prices

Figure 30 Model visualization of grade and price

The formula of this model: y = 2.085e+05*x + (-1.056e+06)

Conclusions

Grade affects house prices by a ratio of 44.5% The odds are pretty high with just one home price prediction feature From there, it can be seen that the slope of the perimeter pulse around the house has an effect on the price of the house Because I don't think anyone wants their surroundings to be wet all the time And especially when there is standing water in the area around the house, there are some harmful effects such as: growing moss to smooth the yard, increasing the number of mosquitoes, affecting the aesthetics, etc.

Future works

If I had more time, I'll study more skills like data processing and data cleansing so I can evaluate data Also, knowing the Zip code will help you draw more conclusions about how the seasons such as spring, summer, autumn, and winter— — affect home pricing From there, more data can be compared to get the best project results

Sruthi E R, (2023) Understand Random Forest Algorithms With Examples (Updated 2023).[online] https://www.analyticsvidhya.com/blog/2021/06/understanding-random- forest/ [March 4, 2023]

Rachel M, (2021) What is Random Forest.[online] https://careerfoundry.com/en/blog/data-analytics/what-is-random- forest/#:~:text=Random%20Forest%20is%20used%20for,to%20name%20just%20a% 20few! [March 4, 2023]

Jeremy K, (2023) Why a transparent property market helps cities succeed.[online] https://www.weforum.org/agenda/2018/10/transparent-real-estate-property-market- success-cities/ [March 4, 2023]

Numpy, (2023) What is Numpy?.[online] https://numpy.org/doc/stable/user/whatisnumpy.html [March 4, 2023]

W3schools, (2023) Pandas Introduction.[online] https://www.w3schools.com/python/pandas/pandas_intro.asp#:~:text=What%20is%20 Pandas%3F,by%20Wes%20McKinney%20in%202008 [March 4, 2023]

Matplotlib, (2023) Pyplot tutoraial.[online] https://matplotlib.org/stable/tutorials/introductory/pyplot.html [March 4, 2023]

Seaborn, (2023) seaborn: statistical data visualization.[online] https://seaborn.pydata.org/ [March 4, 2023]

Martin B, (2023) Python Folium: Create Web Maps From Your Data.[online] https://realpython.com/python-folium-web-maps-from-data/ [March 4, 2023]

Madhuri, C R., Anuradha, G & Pujitha, M V House price prediction using regression techniques: a comparative study 2019 International conference on smart structures and systems (ICSSS), 2019 IEEE, 1-5

Rahadi, R A., Wiryono, S K., Koerindartoto, D P & Syamwil, I B 2015 Factors influencing the price of housing in Indonesia International Journal of Housing Markets and Analysis

Xiao-dzu, D & Ling-wei, K The land prices and housing prices Empirical research — based on panel data of 11 provinces and municipalities in Eastern China 2013 International Conference on Management Science and Engineering 20th Annual Conference Proceedings, 2013 IEEE, 2118- 2123

Kisilevich, S., Keim, D & Rokach, L 2013 A GIS-based decision support system for hotel room rate estimation and temporal price prediction: The hotel brokers' context Decision Support Systems, 54, 1119-1133

Hauke, J & Kossowski, T 2011 Comparison of values of Pearson's and Spearman's correlation coefficients on the same sets of data Quaestiones geographicae, 30, 87

David Groebner, P S., Phillip Fry 2017 Business Statistics: A Decision-Making Approach, Pearson

Akossou, A & Palm, R 2013 Impact of data structure on the estimators R-square and adjusted R-square in linear regression Int J Math Comput, 20, 84-93

Chicco, D., Warrens, M J & JURMAN, G 2021 The coefficient of determination R- squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation PeerJ Computer Science, 7, e623

The relationship between the area of grade and house prices

I take the variable f, which represents the independent value, and the variable y, which represents the dependent value of price, as I described in Chapter 2 x3 = sm

Add constant(x2) adds the first column "const" with the value "1.0" to a dataframe Figure 26 is then made from that

Right now, we use the small OLS class and its initialization function, OLS(y, X), to execute the predictor's regression on the response Two array-like objects, X and y, are the inputs for this procedure.

Ngày đăng: 08/05/2024, 14:39

w