Report on the introduction process to business analysis

We mainly focus on 4 algorithms for this exercise: Linear Regression, k-Nearest Neighbor, Conditional Decision Trees, Gaussian Naive Bayes GaussianNB.. 2 Input data, training data, test

Trang 1

REPORT ON THE INTRODUCTION PROCESS

TO BUSINESS ANALYSIS

Người hướng dẫn: TS PHAM THAI KI TRUNG Người thực hiện: PHAN HOANG PHUC – MSSV: 520H0278

TRUONG TUAN AN– MSSV: 520H0446

Lớp : 20H50304 – 20H50302

Khoá : 24

THÀNH PHỐ HỒ CHÍ MINH, NĂM 2021

Trang 2

REPORT ON THE INTRODUCTION PROCESS

TO BUSINESS ANALYSIS

Người hướng dẫn: TS PHAM THAI KI TRUNG Người thực hiện: PHAN HOANG PHUC – MSSV: 520H0278

TRUONG TUAN AN– MSSV: 520H0446

Lớp : 20H50304 – 20H50302

Khoá : 24

THÀNH PHỐ HỒ CHÍ MINH, NĂM 2021

Trang 3

Dear teacher Pham Thai Ky Trung

We would like to express our deep gratitude to Mr Pham Thai Ky Trung for devoting his time and knowledge to impart knowledge, guide and support us in the process of researching and writing this report The teachers' dedication and teaching provided important advice in our learning and development

He is not only a professional instructor but also a source of inspiration and motivation for us to overcome difficulties in the research process Your comments, advice and solutions have helped us refine and perfect this report

We would like to sincerely thank Mr Pham Thai Ky Trung for his dedication and valuable treasure contribution His help and guidance helped us automatically present the report accurately and reliably

Trang 4

THE PROJECT WAS COMPLETED AT TON DUC THANG

UNIVERSITY

I hereby declare that this is my own project product and is guided by Teacher Pham Thai Ky Trung; The research content and results in this topic are honest and have not been published in any form before The data in the tables for analysis, comments, and evaluation were collected by the author from different sources and clearly stated in the reference section

In addition, the project also uses a number of comments, assessments as well as data from other authors and other organizations, all with citations and source notes

If any fraud is discovered, I will take full responsibility for the content of

my project Ton Duc Thang University is not involved in copyright violations caused

by me during the implementation process (if any)

TP Hồ Chí Minh, ngày 20 tháng 10 năm 2023

Tác giả (ký tên và ghi rõ họ tên) Phúc Phan Hoàng Phúc

Trang 5

INSTRUCTOR VERIFICATION AND EVALUATION

SECTION Confirmation from the instructor

_ _ _ _ _ _ _

Tp Hồ Chí Minh, ngày tháng năm (kí và ghi họ tên)

The teacher's evaluation part marks the test

_ _ _ _ _ _ _

Tp Hồ Chí Minh, ngày tháng năm (kí và ghi họ tên)

Trang 6

This exercise is an exercise to evaluate our progress when studying in class and learning at home We mainly focus on 4 algorithms for this exercise: Linear Regression, k-Nearest Neighbor, Conditional Decision Trees, Gaussian Naive Bayes (GaussianNB) These are algorithms that focus on classification and prediction in business, in this subject

Trang 7

MỤC LỤC

Nô K i dung

THANKS 1

THE PROJECT WAS COMPLETED AT TON DUC THANG UNIVERSITY 2

INSTRUCTOR VERIFICATION AND EVALUATION SECTION 3

SUMMARY 4

DETERMINING THE PROBLEM 2

DATA USED FOR ANALYSIS 2

DESCRIPTION OF THE ALGORITHM 2

I Description of Linear Regression: 2

Input data, training data, test data and source code of Linear Regression : 2

II Description of k-Nearest Neighbor (KNN): 2

Input data, training data, test data and source code of k-Nearest Neighbor (KNN): 2

III Description of Conditional Decision Trees: 2

Input data, training data, test data and source code of Conditional Decision Trees : 2

IV Description of Gaussian Naive Bayes (GaussianNB): 2

Input data, training data, test data and source code of Gaussian Naive Bayes (GaussianNB) : 2 REFERENCES 2

Trang 8

DETERMINING THE PROBLEM

Description of the business problem:

Business Problem: Optimizing SUV Sales - Combining Social Networks, Age and Salary

In the world of car business, the problem of combining selling SUVs through social networks, predicting car buying decisions based on age and salary is an opportunity for businesses to not only create attractive products but also better understand customer needs and wants

1 Selling SUVs - Proud in the High-End Automobile Segment:

SUVs, with their luxurious and versatile design, are a popular choice in the high-end car segment The business problem revolves around developing a sales strategy to honor the outstanding features of SUVs and highlight the luxury and comfort of the product

2 Using Social Networks - Brand Building and Customer Interaction: Social networks are powerful tools for connecting and interacting with customers Through platforms like Facebook and Instagram, businesses can share images, videos and reviews from users, creating curiosity and wanting to test the product

3 Guess Car Buying Decisions Based On Age:

Categorizing customers by age helps identify trends and priorities Younger buyers may look for technology features, while older buyers may be interested in convenience and safety Marketing strategies can be optimized to reflect the needs of each audience

4 Salary Evaluation - Customizing Financial Package:

Information about salaries helps businesses propose flexible financial packages

At the same time, it also helps determine the appropriate price segment to meet customers' payment ability

Conclude:

By optimizing SUV sales through social networks, predicting car buying decisions from age and salary, businesses have the opportunity not only to drive sales but also to build long-term relationships with customers The flexible combination of products, advertising and financial incentives will lead to the best shopping experience for a diverse audience in the automotive market

Trang 9

DATA USED FOR ANALYSIS

To describe the dataset for the problem of selling SUVs through social networks, predicting car buying decisions from age and salary, you can create a data table with the following information fields:

1 User ID: A unique identifier for each customer, helping to track and manage personal information

2 Gender: Determining the customer's gender can help optimize marketing strategies based on the differences between men and women in car purchasing decisions

3 Age: Information about the age of customers, helping to classify by age group to better understand the car buying tendency of each subject

4 Estimated Salary: Information about customers' annual income or salary, helping businesses propose financial packages and incentives suitable to their ability to pay

5 Purchase Decision (Purchased): This is the purchase decision section through estimates from Age and Estimated Salary data expressed through numbers 0 and 1, 0 represents not buying and 1 represents purchase

Example of dataset structure:

Trang 10

This data set provides basic information about customers, helping to classify and better understand the target audience We provide all of this data via an excel file

called "Social_Network_Ads.csv"

DESCRIPTION OF THE ALGORITHM

Describe the 4 algorithms used in the article :

I Description of Linear Regression:

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables The goal of linear regression is to find the line (or Metamechanical in the high-dimensional case) such that the aggregate comment on the deviation between the predicted value and the actual value is minimal

Instrument description:

1 Dependent Variable (Y): This is the variation we expect It is often called the dependent variable because its value depends on the value of the independent variable

2 Independent Variable (X): This is the variable used to predict the value

of the dependent variable Linear regression can be performed with one independent variable (simple linear regression) or multiple independent variables (multiple linear regression)

3 Linear model: In simple cases, the Linear Regression model can be represented by the formula Y = mx + b In there :

• Y is the dependent variable

• x is the independent variable

• m is the slope of the line

• b is the intersection coefficient of the straight line

4 Model Optimization: The goal is to adjust the m and b coefficients so that the straight line produces the closest actual value expectation This is usually done using the minimization of the total error method (least squares method) Linear Regression is widely used in applications that predict and analyze relationships between variables in many fields such as economics, medicine, data science and many other fields

Trang 11

Input data, training data, test data and source code of Linear Regression :

1 In the input data part, we will use the data in the excel file we provide

as in the DATA USED FOR ANALYSIS section Used for all models we

use in this article

Code input data :

2 In the code snippet you provided, `X4_train` and `y4_train` are variables representing the training data for a linear regression model Here's an explanation of what these variables typically represent:

 `X4_train`: This is the feature matrix containing the input features for training the linear regression model Each row of this matrix corresponds to a training sample, and each column corresponds to a feature The dimensions of `X4_train` should be

`(number of samples, number of features)`

 `y4_train`: This is the target variable or labels associated with the training samples in `X4_train` It contains the correct output

or dependent variable values corresponding to each training sample The length of `y4_train` should be equal to the number

of samples in the training data

So, before using the provided code, you need to have your training data prepared and assigned to `X4_train` and `y4_train` Code of training data:

3 In the code y4_pred = LR.predict(X4_test), X4_test represents the test data we want to use to make predictions using the trained linear regression (LR) model Same with training data:

 X4_test: This is the feature matrix containing the input features for the test data Each row of this matrix corresponds to a test

Trang 12

sample and each column corresponds to a feature The size of X4_test should be (number of test samples, number of features) Here is our code that has our test data :

II Description of k-Nearest Neighbor (KNN):

k-Nearest Neighbors (KNN) is a simple and versatile computer algorithm used for various types of classification and prediction problems KNN is based on the principle that data that are close to each other do not belong to the same particular class or have an expected value This algorithm does not create a specific prediction model but instead uses training data to make predictions based on similarity to the closest points

Detailed description of k-Nearest Neighbor (KNN):

1 Choose the value of k: Before launching KNN, you need to choose a value of k, the number of nearest neighbors that the algorithm will consider to make predictions

2 Measuring Similarity: KNN uses a measured distance (usually Euclidean distance) to measure the similarity between data points The closer it

is, the more similar the data points are

3 Determine the K nearest neighbors: Based on the distance measurement, select the k data points closest to the point to be predicted

4 Classification or Prediction: For a classification problem, use the votes (majority votes) of the classes of k rows to determine the class of the data to be predicted For expected problems, use the mean (mean) or median with the number of significant values of k neighbors

5 Evaluation Description: Evaluate the model using metrics such as accuracy (accuracy), F1 score, or appropriate metrics depending on the specific problem

• Advantages and disadvantages of KNN:

o Advantages:

 Easy to understand and develop declarations

 Works with many types of data

o Disadvantages:

 Disadvantages when the amount of data is large

 Enhance data noise and height

Trang 13

 It is necessary to choose appropriate values to avoid overfitting or underfitting

Input data, training data, test data and source code of k-Nearest Neighbor (KNN):

1 In the input data part, we will use the data in the excel file we provide as

in the DATA USED FOR ANALYSIS section Used for all models we use

in this article

Code input data :

2 In the code snippet you provided, X_train and y_train are variables representing the training data for a k-nearest neighbors (kNN) classifier Here's a breakdown of what these variables typically represent:

 X_train: This is the feature matrix containing the input features for training the kNN classifier Each row of this matrix corresponds to a training sample, and each column corresponds to a feature The dimensions of X_train should be (number of samples, number of features)

 y_train: This is the target variable or labels associated with the training samples in X_train It contains the correct output or class labels corresponding to each training sample The length of y_train should be equal to the number of samples in the training data

So, before using the provided code, you need to have your training data prepared and assigned to X_train and y_train Code of training data :

3 In the code snippet y_pred = kNN.predict(X_test), X_test represents the test data that you want to use to make predictions using the trained k-nearest neighbors (kNN) classifier (kNN) Similar to the training data:

Trang 14

 X_test: This is the feature matrix containing the input features for the test data Each row of this matrix corresponds to a test sample, and each column corresponds to a feature The dimensions of X_test should be (number of test samples, number of features)

Here is our code that has our test data :

III Description of Conditional Decision Trees:

Conditional Decision Trees are a form of decision trees in which each decision tree is executed based on a specific condition This means that at each node in the tree, a condition is tested and data is analyzed based on the results of that condition

Detailed description of Conditional Decision Trees:

1 Divide data based on conditions: At each node in the tree, a condition is defined based on one or more attributes of the data The data is then split into options that branch into whether the documents are fully qualified or not

2 Decision at Root Node: The root node of the tree represents the entire data Condition at the root node, data is divided into child branches

3 Building the tree:

o Fast Node: Each fast node of the tree represents a condition If the conditions are right for a data sample engine, they will follow the correct branch, otherwise, they will follow the incorrect branch

o Leaf Node: If a sample data arrives at a leaf node, it will be assigned an expected value or class

4 Next shared data: The process of data splitting and tree building is done recursively for each branch The conditions are determined based on the properties of the data

5 Stop Decision (Stopping Criteria): The recursive process can stop when certain continuity criteria are reached, considerations like maximum depth of the tree, minimum number of samples per leaf node , or minimize classification errors

• Advantages and disadvantages of Conditional Decision Trees:

o Advantages:

Tiêu đề	Report on the Introduction Process to Business Analysis
Tác giả	Phan Hoang Phuc, Truong Tuan An
Người hướng dẫn	TS. Pham Thai Ky Trung
Trường học	Ton Duc Thang University
Chuyên ngành	Information Technology
Thể loại	Report
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	20
Dung lượng	1,32 MB