We mainly focus on 4 algorithms for this exercise: Linear Regression, k-Nearest Neighbor, Conditional Decision Trees, Gaussian Naive Bayes GaussianNB.. 2 Input data, training data, test
Trang 1REPORT ON THE INTRODUCTION PROCESS
TO BUSINESS ANALYSIS
Người hướng dẫn: TS PHAM THAI KI TRUNG Người thực hiện: PHAN HOANG PHUC – MSSV: 520H0278
TRUONG TUAN AN– MSSV: 520H0446
Lớp : 20H50304 – 20H50302
Khoá : 24
THÀNH PHỐ HỒ CHÍ MINH, NĂM 2021
Trang 2REPORT ON THE INTRODUCTION PROCESS
TO BUSINESS ANALYSIS
Người hướng dẫn: TS PHAM THAI KI TRUNG Người thực hiện: PHAN HOANG PHUC – MSSV: 520H0278
TRUONG TUAN AN– MSSV: 520H0446
Lớp : 20H50304 – 20H50302
Khoá : 24
THÀNH PHỐ HỒ CHÍ MINH, NĂM 2021
Trang 3Dear teacher Pham Thai Ky Trung
We would like to express our deep gratitude to Mr Pham Thai Ky Trung for devoting his time and knowledge to impart knowledge, guide and support us in the process of researching and writing this report The teachers' dedication and teaching provided important advice in our learning and development
He is not only a professional instructor but also a source of inspiration and motivation for us to overcome difficulties in the research process Your comments, advice and solutions have helped us refine and perfect this report
We would like to sincerely thank Mr Pham Thai Ky Trung for his dedication and valuable treasure contribution His help and guidance helped us automatically present the report accurately and reliably
Trang 4THE PROJECT WAS COMPLETED AT TON DUC THANG
UNIVERSITY
I hereby declare that this is my own project product and is guided by Teacher Pham Thai Ky Trung; The research content and results in this topic are honest and have not been published in any form before The data in the tables for analysis, comments, and evaluation were collected by the author from different sources and clearly stated in the reference section
In addition, the project also uses a number of comments, assessments as well as data from other authors and other organizations, all with citations and source notes
If any fraud is discovered, I will take full responsibility for the content of
my project Ton Duc Thang University is not involved in copyright violations caused
by me during the implementation process (if any)
TP Hồ Chí Minh, ngày 20 tháng 10 năm 2023
Tác giả (ký tên và ghi rõ họ tên) Phúc Phan Hoàng Phúc
Trang 5INSTRUCTOR VERIFICATION AND EVALUATION
SECTION Confirmation from the instructor
_ _ _ _ _ _ _
Tp Hồ Chí Minh, ngày tháng năm (kí và ghi họ tên)
The teacher's evaluation part marks the test
_ _ _ _ _ _ _
Tp Hồ Chí Minh, ngày tháng năm (kí và ghi họ tên)
Trang 6This exercise is an exercise to evaluate our progress when studying in class and learning at home We mainly focus on 4 algorithms for this exercise: Linear Regression, k-Nearest Neighbor, Conditional Decision Trees, Gaussian Naive Bayes (GaussianNB) These are algorithms that focus on classification and prediction in business, in this subject
Trang 7MỤC LỤC
Nô K i dung
THANKS 1
THE PROJECT WAS COMPLETED AT TON DUC THANG UNIVERSITY 2
INSTRUCTOR VERIFICATION AND EVALUATION SECTION 3
SUMMARY 4
DETERMINING THE PROBLEM 2
DATA USED FOR ANALYSIS 2
DESCRIPTION OF THE ALGORITHM 2
I Description of Linear Regression: 2
Input data, training data, test data and source code of Linear Regression : 2
II Description of k-Nearest Neighbor (KNN): 2
Input data, training data, test data and source code of k-Nearest Neighbor (KNN): 2
III Description of Conditional Decision Trees: 2
Input data, training data, test data and source code of Conditional Decision Trees : 2
IV Description of Gaussian Naive Bayes (GaussianNB): 2
Input data, training data, test data and source code of Gaussian Naive Bayes (GaussianNB) : 2 REFERENCES 2
Trang 8DETERMINING THE PROBLEM
Description of the business problem:
Business Problem: Optimizing SUV Sales - Combining Social Networks, Age and Salary
In the world of car business, the problem of combining selling SUVs through social networks, predicting car buying decisions based on age and salary is an opportunity for businesses to not only create attractive products but also better understand customer needs and wants
1 Selling SUVs - Proud in the High-End Automobile Segment:
SUVs, with their luxurious and versatile design, are a popular choice in the high-end car segment The business problem revolves around developing a sales strategy to honor the outstanding features of SUVs and highlight the luxury and comfort of the product
2 Using Social Networks - Brand Building and Customer Interaction: Social networks are powerful tools for connecting and interacting with customers Through platforms like Facebook and Instagram, businesses can share images, videos and reviews from users, creating curiosity and wanting to test the product
3 Guess Car Buying Decisions Based On Age:
Categorizing customers by age helps identify trends and priorities Younger buyers may look for technology features, while older buyers may be interested in convenience and safety Marketing strategies can be optimized to reflect the needs of each audience
4 Salary Evaluation - Customizing Financial Package:
Information about salaries helps businesses propose flexible financial packages
At the same time, it also helps determine the appropriate price segment to meet customers' payment ability
Conclude:
By optimizing SUV sales through social networks, predicting car buying decisions from age and salary, businesses have the opportunity not only to drive sales but also to build long-term relationships with customers The flexible combination of products, advertising and financial incentives will lead to the best shopping experience for a diverse audience in the automotive market
Trang 9DATA USED FOR ANALYSIS
To describe the dataset for the problem of selling SUVs through social networks, predicting car buying decisions from age and salary, you can create a data table with the following information fields:
1 User ID: A unique identifier for each customer, helping to track and manage personal information
2 Gender: Determining the customer's gender can help optimize marketing strategies based on the differences between men and women in car purchasing decisions
3 Age: Information about the age of customers, helping to classify by age group to better understand the car buying tendency of each subject
4 Estimated Salary: Information about customers' annual income or salary, helping businesses propose financial packages and incentives suitable to their ability to pay
5 Purchase Decision (Purchased): This is the purchase decision section through estimates from Age and Estimated Salary data expressed through numbers 0 and 1, 0 represents not buying and 1 represents purchase
Example of dataset structure:
Trang 10This data set provides basic information about customers, helping to classify and better understand the target audience We provide all of this data via an excel file
called "Social_Network_Ads.csv"
DESCRIPTION OF THE ALGORITHM
Describe the 4 algorithms used in the article :
I Description of Linear Regression:
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables The goal of linear regression is to find the line (or Metamechanical in the high-dimensional case) such that the aggregate comment on the deviation between the predicted value and the actual value is minimal
Instrument description:
1 Dependent Variable (Y): This is the variation we expect It is often called the dependent variable because its value depends on the value of the independent variable
2 Independent Variable (X): This is the variable used to predict the value
of the dependent variable Linear regression can be performed with one independent variable (simple linear regression) or multiple independent variables (multiple linear regression)
3 Linear model: In simple cases, the Linear Regression model can be represented by the formula Y = mx + b In there :
• Y is the dependent variable
• x is the independent variable
• m is the slope of the line
• b is the intersection coefficient of the straight line
4 Model Optimization: The goal is to adjust the m and b coefficients so that the straight line produces the closest actual value expectation This is usually done using the minimization of the total error method (least squares method) Linear Regression is widely used in applications that predict and analyze relationships between variables in many fields such as economics, medicine, data science and many other fields
Trang 11Input data, training data, test data and source code of Linear Regression :
1 In the input data part, we will use the data in the excel file we provide
as in the DATA USED FOR ANALYSIS section Used for all models we
use in this article
Code input data :
2 In the code snippet you provided, `X4_train` and `y4_train` are variables representing the training data for a linear regression model Here's an explanation of what these variables typically represent:
`X4_train`: This is the feature matrix containing the input features for training the linear regression model Each row of this matrix corresponds to a training sample, and each column corresponds to a feature The dimensions of `X4_train` should be
`(number of samples, number of features)`
`y4_train`: This is the target variable or labels associated with the training samples in `X4_train` It contains the correct output
or dependent variable values corresponding to each training sample The length of `y4_train` should be equal to the number
of samples in the training data
So, before using the provided code, you need to have your training data prepared and assigned to `X4_train` and `y4_train` Code of training data:
3 In the code y4_pred = LR.predict(X4_test), X4_test represents the test data we want to use to make predictions using the trained linear regression (LR) model Same with training data:
X4_test: This is the feature matrix containing the input features for the test data Each row of this matrix corresponds to a test
Trang 12sample and each column corresponds to a feature The size of X4_test should be (number of test samples, number of features) Here is our code that has our test data :
II Description of k-Nearest Neighbor (KNN):
k-Nearest Neighbors (KNN) is a simple and versatile computer algorithm used for various types of classification and prediction problems KNN is based on the principle that data that are close to each other do not belong to the same particular class or have an expected value This algorithm does not create a specific prediction model but instead uses training data to make predictions based on similarity to the closest points
Detailed description of k-Nearest Neighbor (KNN):
1 Choose the value of k: Before launching KNN, you need to choose a value of k, the number of nearest neighbors that the algorithm will consider to make predictions
2 Measuring Similarity: KNN uses a measured distance (usually Euclidean distance) to measure the similarity between data points The closer it
is, the more similar the data points are
3 Determine the K nearest neighbors: Based on the distance measurement, select the k data points closest to the point to be predicted
4 Classification or Prediction: For a classification problem, use the votes (majority votes) of the classes of k rows to determine the class of the data to be predicted For expected problems, use the mean (mean) or median with the number of significant values of k neighbors
5 Evaluation Description: Evaluate the model using metrics such as accuracy (accuracy), F1 score, or appropriate metrics depending on the specific problem
• Advantages and disadvantages of KNN:
o Advantages:
Easy to understand and develop declarations
Works with many types of data
o Disadvantages:
Disadvantages when the amount of data is large
Enhance data noise and height
Trang 13 It is necessary to choose appropriate values to avoid overfitting or underfitting
Input data, training data, test data and source code of k-Nearest Neighbor (KNN):
1 In the input data part, we will use the data in the excel file we provide as
in the DATA USED FOR ANALYSIS section Used for all models we use
in this article
Code input data :
2 In the code snippet you provided, X_train and y_train are variables representing the training data for a k-nearest neighbors (kNN) classifier Here's a breakdown of what these variables typically represent:
X_train: This is the feature matrix containing the input features for training the kNN classifier Each row of this matrix corresponds to a training sample, and each column corresponds to a feature The dimensions of X_train should be (number of samples, number of features)
y_train: This is the target variable or labels associated with the training samples in X_train It contains the correct output or class labels corresponding to each training sample The length of y_train should be equal to the number of samples in the training data
So, before using the provided code, you need to have your training data prepared and assigned to X_train and y_train Code of training data :
3 In the code snippet y_pred = kNN.predict(X_test), X_test represents the test data that you want to use to make predictions using the trained k-nearest neighbors (kNN) classifier (kNN) Similar to the training data:
Trang 14 X_test: This is the feature matrix containing the input features for the test data Each row of this matrix corresponds to a test sample, and each column corresponds to a feature The dimensions of X_test should be (number of test samples, number of features)
Here is our code that has our test data :
III Description of Conditional Decision Trees:
Conditional Decision Trees are a form of decision trees in which each decision tree is executed based on a specific condition This means that at each node in the tree, a condition is tested and data is analyzed based on the results of that condition
Detailed description of Conditional Decision Trees:
1 Divide data based on conditions: At each node in the tree, a condition is defined based on one or more attributes of the data The data is then split into options that branch into whether the documents are fully qualified or not
2 Decision at Root Node: The root node of the tree represents the entire data Condition at the root node, data is divided into child branches
3 Building the tree:
o Fast Node: Each fast node of the tree represents a condition If the conditions are right for a data sample engine, they will follow the correct branch, otherwise, they will follow the incorrect branch
o Leaf Node: If a sample data arrives at a leaf node, it will be assigned an expected value or class
4 Next shared data: The process of data splitting and tree building is done recursively for each branch The conditions are determined based on the properties of the data
5 Stop Decision (Stopping Criteria): The recursive process can stop when certain continuity criteria are reached, considerations like maximum depth of the tree, minimum number of samples per leaf node , or minimize classification errors
• Advantages and disadvantages of Conditional Decision Trees:
o Advantages: