VIETNAM GENERAL CONFEDERATION OF LABOR TON DUC THANG UNIVERSITY FACULTY OF INFORMATION TECHNOLOGY DALHOC TON DUC THANG THE MIDDLE-TERM ESSAY INTRODUCTION TO MACHINE LEARNING MACHIN
Trang 1VIETNAM GENERAL CONFEDERATION OF LABOR
TON DUC THANG UNIVERSITY
FACULTY OF INFORMATION TECHNOLOGY
DALHOC TON DUC THANG
THE MIDDLE-TERM ESSAY
INTRODUCTION TO MACHINE LEARNING
MACHINE LEARNING’S PROBLEMS
Instructors MR.LE ANH CUONG Student LE QUANG DUY- 520H0529 TRAN QUOC HUY - 520H0647
Class: 20H50204 Course:
HO CHI MINH CITY, 2022
Trang 2VIETNAM GENERAL CONFEDERATION OF LABOR
TON DUC THANG UNIVERSITY
FACULTY OF INFORMATION TECHNOLOGY
DALHOC TON BUC THANG TON DUC THANG UNIVERSITY THE MIDDLE-TERM ESSAY
INTRODUCTION TO MACHINE LEARNING
MACHINE LEARNING’S PROBLEMS
Instructors MR.LE ANH CUONG Student LE QUANG DUY- 520H0529 TRAN QUOC HUY - 520H0647
Class: 20H50204 Course: 24
HO CHI MINH CITY, 2022
Trang 4MIDDLE-TERM ESSAY COMPLETED AT TON DUC THANG
UNIVERSITY
I hereby declare that this is my own report and is under the guidance of Mr Le Anh Cuong The research contents and results on this topic are honest and have not been published in any form before The data in the tables for analysis, comments, and evaluation are collected by the author himself from different sources, clearly stated in the reference section
In addition, the project also uses a number of comments, assessments as well as data from other authors, other agencies, and organizations, with citations and source annotations
If I find any fraud I take full responsibility for the content of my report Ton Duc Thang University is not related to copyright and copyright violations caused by me during the implementation process (if any)
Ho Chi Minh City, 16 October 2022
Author
Le Quang Duy
Trang 5TEACHER’S CONFIRMATION AND ASSESSMENT SECTION Confirmation section of the instructors
Ho Chi Minh City, day month year
(sign and write full name)
The evaluation part of the lecturer marks the report
Ho Chi Minh City, day month year
(sign and write full name)
Trang 6SUMMARY
In this report, we will discuss basic methods for machine learning
In chapter 2, we will practice solving a classification problem based on 3 different models (Naive Bayes, k-Nearest Neighbor, and Decision Tree) And compare these models based
on metrics: accuracy, precision, recall, fl-score for each class, and weighted average of fl-score for all the data
In chapter 3, we will discuss, work on, and visualize the Feature Selection problem, and the way it (“correlation”) works
In chapter 4, we will show a theory, the code implementation, and the code’s illustration for 2 algorithms of optimization (Stochastic Gradient Descent and Adam Optimization Algorithm)
Trang 7TABLE OF CONTENTS ACKNOWLEDGEMENT
MIDDLE-TERM ESSAY COMPLETED AT TON DUC THANG UNIVERSITY TEACHER’S CONFIRMATION AND ASSESSMENT SECTION
2.2.1 Naive Bayes model:
2.2.2 k-Nearest Neighbors model:
2.2.3 Decision Tree model:
2.3 Comparing:
2.3.1 Reporting from Naive Bayes Model:
2.3.2 Reporting from k-Nearest Neighbors Model:
2.3.3 Reporting from Decision Tree Model:
3.1 What is correlation? [1]
3.2 How it works to help?[1]
3.3 Solving linear regression’s problem:
Trang 8LIST OF ABBREVIATIONS
Trang 9LIST OF DIAGRAMS, CHARTS, AND TABLES
Trang 10CHAPTER 1: INTRODUCE
In this report, we divided into 3 parts to solve 3 problems with 4 chapters
_In chapter 1, we will introduce the outline of the report
_In chapter 2, we will show 3 models which are: Naive Bayes Classification, k-Nearest Neighbors, and Decision Tree With each model, we will do a common preparation before traming and testing After all,
we split data into 2 types: traming (75%) and testing (25%), and make a comparison among standards: accuracy, precision, recall, fl - score, and weighted average of fl-score
_In chapter 3, we will answer 2 questions: what it is and how it works it means we will show a theory about “correlation” in feature selection and solve Boston-house-pricing regression
_In chapter 4, we will show the theory of Adam and the Stochastic Gradient Descent Algorithm and show our code for each algorithm
Trang 1110
CHAPTER 2: PROBLEM 1
2.1 Common Preparing for 3 models:
_In this chapter, I will solve the problem with 3 models: Naive Bayes, k - Nearest Neighbors, and Decision Tree
_ We used the “iris” data set to visualize 3 of the models
_ First of all, we prepare for collecting data and reading file “ins.data”
© import pandas as pd
from google.colab import files
file = files.upload()
iris.data
¢ iris.data(n/a) - 4551 bytes, last modified: 3/10/2022 - 100% done
Saving iris.data to iris.data
Trang 12[5 from sklearn.model_selection import train_test_split
# Split into 2 kind of random set: 75% training set, 25% test set
Trang 1312
Description; We chose 149 rows of 4 first columns and split them into 75% for training and 25% percent for testing through a group of variables: x_train, x_test, y_train, y_test
2.2 Execute models:
2.2.1 Naive Bayes model:
_ Training time: Take less than 1 second to train data
of © from sklearn.naive_bayes import Multinomial1NB
NB = MultinomialNB()
# Training
NB fit(x_train,y_train)
Multinomia1NB()
_ Predicting time: Take less than | second to train data
¥ © # Predict result by x_test y_predict = NB.predict(x_test)
Trang 14Conclusion: We only found 3 errors after running this model
2.2.2 k-Nearest Neighbors model:
_ Training time: Take less than 1 second to train data
Trang 15Conclusion: We only found 3 errors after running this model
2.2.3 Decision Tree model:
_ Training time: Take less than 1 second to train data
dTree = DecisionTreeClassifier(max_depth=2) dTree.fit(x_train,y_train)
Trang 1615 Iris-versicolor Iris-versicolor Iris-setosa Iris-setosa Iris-setosa Iris-setosa
Iris-virginica Iris-versicolor Iris-versicolor Iris-virginica
Conclusion: Weighted fl-score of data: 92%
2.3.2 Reporting from k-Nearest Neighbors Model:
Trang 1716 2.3.3 Reporting from Decision Tree Model:
9.85 9.87
.99 74 81 87 85 -87
Trang 183.2 How it works to help?!
High correlation features are more linearly dependent and hence virtually equally affect the dependent variable We can thus exclude one of the two features when there is a substantial correlation between the two features
For example: We used house-pricing which existed in scikit-learn library to analyze, Boston house- pricing:
After loading data, we have:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
Trang 1918
© X = df.drop("MEDV", axis = 1)
y = df["MEDV"]
Co from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 9, test_size = @.3)
We used “heatmap” to visualize data:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIIO B LSTAT
As we can see, the number in each square is the percent that they correlate together so we can reject 1 in 2
of them In this instance, “tax” column with “rad” row is up to 0.91 which means relative up to 91% so
we can remove one of them from the data set Thresholds are often used which are 70% to 90%
In this situation, we used 70% for the threshold to reject attributes unnecessary
Our function of correlation:
Trang 2019
col_corr = set() corr_matrix = dataset.corr()
We return a set of names of rejecting and prepare for our dataset:
3.3 Solving linear regression’s problem:
After all, we solve this problem with linear regression:
© from math import sqrt
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression
4.54 5.21
Trang 2120
oC y_pred = lr.predict(X test)
_preds df = pd.DataFrame(dict(observed=y_test, predicted=y_pred) ) _preds_ df.head ()
Checking MAE test score:
7 Co print("MAE test score", mean_absolute_error(y_test, y_pred)) [> MAE test score 3.609904060381819
Trang 22As we can see, we have to draw a line as linear and we have a formula to predict a height:
Predicted Height = Intercept + Slope x Weight (1)
In this instance, we can see 3 clusters, and we can choose randomly intercept = 0 and slope = |
Trang 2322 Sum of squared residuals = (Observed Height - Predicted Height)’ (2)
Replace (1) into (2), we have:
Sum of squared residuals = (Observed Height - (Intercept + Slope x Weight))’
We have to calculate the derivative of the sum of squared residuals with respect to the intercept and slope: f(Sum of squared residuals) ntercep:” = -2 (Height - (Intercept + Slope x Weight)) f(Sum of squared residuals)siope’ = -2 Weight(Height - (Intercept + Slope x Weight))
We can pick randomly 1 sample to calculate the derivative:
Height =
Weight f(Sum of squared residuals)intercep’ = -2 (3.3 - (0 +1 x 3)) =-0.6
f(Sum of squared residuals)sope” = -2 x 3 (3.3 - (0 + 1x3 ))=-1.8
We can easily calculate the step size to improve the line:
Step Sizeintercep: = f(Sum of squared residuals) ntercep’ X learning rate
Step sizesope = f(Sum of squared residuals)siope-’ X learning rate
We have to start with large Learning Rate and make it smaller with each step
In this example, we chose 0.01 for learning rate:
Step siz@intercep: = Sum of squared residuals )intercep’ X learning rate = -0.6 x 0.01 = -0.006
Step sizésiope = f(Sum of squared residuals)sope’ X learning rate = -1.8 x 0.01 =-0.018
Trang 2423 New slope = Old slope - Step sizestope = 1 - (-0.018) = 1.018
We had a new line:
Trang 25Having a new line:
Trang 2625 And we can stop at intercept = 0.85 and slope = 0.68 in this mstance:
Weight
Trang 27self.learning rate = learning rate
def fit(self, X, y):
rgen = np.random.RandomState(self.random_state)
self.coef_ = rgen.normal(loc=0.@, scale=0.01, size=1 + X.shape[1])
for _ in range(self.n_ iterations):
for xi, expected_value in zip(X, y):
predicted_value = self.predict(xi)
self.coef_[1:] += self.learning_rate * (expected_value - predicted_value) * xi self.coef_[0] += self learning rate * (expected_value - predicted value) * 1 def activation(self, X):
return np.dot(x, self.coef_[1:]) + self.coef [@]
Trang 2827
oe from sklearn.datasets import load_breast_cancer
from sklearn.model selection import train test split
Adam is a stochastic objective function optimization algorithm based on first-order gradients and adaptive estimation of low-order moments It’s a very efficient method when only first-order gradients are required with low memory This method is also suitable for problems with unstable variability and fragmented training data
Pseudo code for Adam Algorithm:
Require: a: Stepsize
Require: B 8, € [0 1): Exponential decay rates for the moment estimates
Require: £(6): Stochastic objective function with parameters 6
Require: 6 _: Initial parameter vector
m, — 0 (Initialize 1st moment vector)
m,— O (Initialize 2nd moment vector)
t — 0 (Initialize timestep)
Trang 2928 while 6 not converged do:
tT—t~I
9, ~ v FB, ủ
m_— B, cm, +(1- B.) -8, 1e B, ` PT mae ~ 8.) 9,
ma m, (ai- 8)
kể ~ if a ~8,) 6-6 -a- m,(afv,~e) end while
Trang 3029 4.2.2 Show code:
return math.log(1 + (abs(x))**(2 + math.sin(x)))
def general_grad(theta, function):
vt = beta2 * vt + (1 - betal) * (np.power(gt, 2))
m_up = np.true_divide(mt, (1 - (beta1**(i + 1))))
theta_new = theta[-1] - np.true_divide(alpha * m_up, (np.sqrt(v_up) + np.ones(theta[-1].shape[@]) * er theta.append(theta_new)
Trang 3130
REFERENCES [1]: https://www.kaggle.com/code/bbloggsbott/feature-selection-correlation-and-p-value/notebook