The middle term essay introduction to machine learning machine learning’s problems

VIETNAM GENERAL CONFEDERATION OF LABOR TON DUC THANG UNIVERSITY FACULTY OF INFORMATION TECHNOLOGY THE MIDDLE-TERM ESSAY INTRODUCTION TO MACHINE LEARNING MACHINE LEARNING’S PROBLEMS Instructors: MR.LE ANH CUONG Student: LE QUANG DUY– 520H0529 TRAN QUOC HUY - 520H0647 Class: 20H50204 Course: HO CHI MINH CITY, 2022 24 VIETNAM GENERAL CONFEDERATION OF LABOR TON DUC THANG UNIVERSITY FACULTY OF INFORMATION TECHNOLOGY THE MIDDLE-TERM ESSAY INTRODUCTION TO MACHINE LEARNING MACHINE LEARNING’S PROBLEMS Instructors: MR.LE ANH CUONG Student: LE QUANG DUY– 520H0529 TRAN QUOC HUY - 520H0647 Class: 20H50204 Course: HO CHI MINH CITY, 2022 24 ACKNOWLEDGEMENT Sincere gratitude to Mr Le Anh Cuong and my partner Tran Quoc Huy for their help during the machine learning semester His practical lectures paired with theory assisted me in mastering the principles of machine learning He taught with enthusiasm and with the help of Tran Quoc Huy Please accept my heartfelt gratitude once more MIDDLE-TERM ESSAY COMPLETED AT TON DUC THANG UNIVERSITY I hereby declare that this is my own report and is under the guidance of Mr Le Anh Cuong The research contents and results on this topic are honest and have not been published in any form before The data in the tables for analysis, comments, and evaluation are collected by the author himself from different sources, clearly stated in the reference section In addition, the project also uses a number of comments, assessments as well as data from other authors, other agencies, and organizations, with citations and source annotations If I find any fraud I take full responsibility for the content of my report Ton Duc Thang University is not related to copyright and copyright violations caused by me during the implementation process (if any) Ho Chi Minh City, 16 October 2022 Author Le Quang Duy TEACHER’S CONFIRMATION AND ASSESSMENT SECTION Confirmation section of the instructors _ Ho Chi Minh City, day month year (sign and write full name) The evaluation part of the lecturer marks the report _ Ho Chi Minh City, day month year (sign and write full name) SUMMARY In this report, we will discuss basic methods for machine learning In chapter 2, we will practice solving a classification problem based on different models (Naive Bayes, k-Nearest Neighbor, and Decision Tree) And compare these models based on metrics: accuracy, precision, recall, f1-score for each class, and weighted average of f1-score for all the data In chapter 3, we will discuss, work on, and visualize the Feature Selection problem, and the way it (“correlation”) works In chapter 4, we will show a theory, the code implementation, and the code’s illustration for algorithms of optimization (Stochastic Gradient Descent and Adam Optimization Algorithm) TABLE OF CONTENTS ACKNOWLEDGEMENT MIDDLE-TERM ESSAY COMPLETED AT TON DUC THANG UNIVERSITY TEACHER’S CONFIRMATION AND ASSESSMENT SECTION SUMMARY LIST OFABBREVIATIONS LIST OF DIAGRAMS, CHARTS, AND TABLES CHAPTER 1: INTRODUCE CHAPTER 2: PROBLEM 10 2.1 Common Preparing for models: 10 2.2 Execute models: 12 2.2.1 Naive Bayes model: 12 2.2.2 k-Nearest Neighbors model: 13 2.2.3 Decision Tree model: 14 2.3 Comparing: 15 2.3.1 Reporting from Naive Bayes Model: 15 2.3.2 Reporting from k-Nearest Neighbors Model: 15 2.3.3 Reporting from Decision Tree Model: 16 CHAPTER 3: PROBLEM 17 3.1 What is correlation? [1] 17 3.2 How it works to help?[1] 17 3.3 Solving linear regression’s problem: 19 CHAPTER 4: PROBLEM 4.1 Stochastic Gradient Descent 21 21 4.1.1 Theory: 21 4.1.2 Show code: 26 4.2 Adam Optimization Algorithm 27 4.2.1 Theory: 27 4.2.2 Show code: 29 REFERENCES 30 LIST OF ABBREVIATIONS LIST OF DIAGRAMS, CHARTS, AND TABLES CHAPTER 1: INTRODUCE In this report, we divided into parts to solve problems with chapters _In chapter 1, we will introduce the outline of the report _In chapter 2, we will show models which are: Naive Bayes Classification, k-Nearest Neighbors, and Decision Tree With each model, we will a common preparation before training and testing After all, we split data into types: training (75%) and testing (25%), and make a comparison among standards: accuracy, precision, recall, f1 - score, and weighted average of f1-score _In chapter 3, we will answer questions: what it is and how it works it means we will show a theory about “correlation” in feature selection and solve Boston-house-pricing regression _In chapter 4, we will show the theory of Adam and the Stochastic Gradient Descent Algorithm and show our code for each algorithm 16 2.3.3 Reporting from Decision Tree Model: Conclusion: Weighted f1-score of data: 87% 17 CHAPTER 3: PROBLEM To solve the problem, we have to answer kinds of questions: _ What is correlation? _ How does correlation help in feature selection? 3.1 What is correlation? [1] The statistical concept of correlation is frequently used to describe how nearly linear a relationship exists between two variables For instance, two linearly dependent variables, such as x and y, would have a larger correlation than two non-linearly dependent variables, such as u and v, which are dependent on each other as u = v2 3.2 How it works to help?[1] High correlation features are more linearly dependent and hence virtually equally affect the dependent variable We can thus exclude one of the two features when there is a substantial correlation between the two features For example: We used house-pricing which existed in scikit-learn library to analyze, Boston house-pricing: After loading data, we have: We divided data into sets: training (70%) and testing (30%): 18 We used “heatmap” to visualize data: As we can see, the number in each square is the percent that they correlate together so we can reject in of them In this instance, “tax” column with “rad” row is up to 0.91 which means relative up to 91% so we can remove one of them from the data set Thresholds are often used which are 70% to 90% In this situation, we used 70% for the threshold to reject attributes unnecessary Our function of correlation: 19 We return a set of names of rejecting and prepare for our dataset: After rejecting, we lost attributes with only 10 columns (13 columns before): 3.3 Solving linear regression’s problem: After all, we solve this problem with linear regression: Predicting values: