Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 33 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
33
Dung lượng
4,6 MB
Nội dung
VIETNAM – KOREA UNIVERSITY OF INFORMATION AND COMMUNICATION TECHNOLOGY FALCULTY OF COMPUTER SCIENCE MACHINE LEARNING HEART DISEASE PREDICTION Students conducting : NGUYEN PHUOC THINH Lecturers providing guidance: D.SC NGUYEN SI THIN Class : 20SE2 Da Nang, June 2023 VIETNAM – KOREA UNIVERSITY OF INFORMATION AND COMMUNICATION TECHNOLOGY FALCULTY OF COMPUTER SCIENCE MACHINE LEARNING HEART DISEASE PREDICTION Da Nang, June 2023 ACKNOWLEDGEMENT I would like to express my sincere gratitude to the university and the Falculty of Computer Science for their dedicated guidance, feedback, and support in helping me successfully complete my research project for the “Machine Learning” course I am grateful to D.Sc Nguyen Si Thin for his enthusiastic guidance throughout the process of preparing the research report During the course of my research project, I have made efforts and worked hard, but I understand that there may still be some shortcomings I hope to receive valuable feedback from my professors and the evaluation committee to further improve and refine my research work I sincerely thank you! Da Nang, June 2023 TABLE OF CONTENTS Contents Chapter Theoretical basis 1.1 Overview of Machine Learning .1 1.1.1 Introduction to Machine Learning 1.1.2 Importance of Machine Learning 1.1.3 Types of Machine Learning 1.1.4 Applications of Machine Learning 1.2 Some Machine Learning Algorithms .4 1.2.1 Logistic Regression 1.2.2 K-Nearest Neighbors 1.2.3 Support Vector Machines 1.2.4 Random Forests 1.3 Some model evaluation parameters in machine learning 1.3.1 Accuracy 1.3.2 Time train 1.3.3 Time test 1.4 Overview of tools used to build a demo website .9 1.4.1 Python Django Framework .9 1.4.2 SQLite 10 Chapter Build predictive data models .11 2.1 Data analysis 11 2.1.1 Data description 11 2.1.2 Data preprocessing 11 2.1.3 Qualitative analysis of data .12 2.1.4 Quantitative analysis of data 13 2.1.5 Relationship between attributes 14 2.2 Preparation of data sets 15 2.2.1 Data separation: Features - Target 15 2.2.2 Data separation: Train - Test 16 2.3 Using machine learning algorithms to data prediction 17 2.3.1 Using Logistic regresstion 17 2.3.2 Using K-nearest neighbors 18 2.3.3 Using Support vector machines 19 2.3.4 Using Ramdom forests 20 2.4 Comparison and Conclusion 20 Chapter Demo program .22 3.1 Prediction screen 22 3.2 Predicted Results screen 22 3.3 Prediction result record 23 3.4 Detailed record of prediction results 23 REFERENCES 24 LIST OF TABLES Table - Data description Table - Compare machine learning models LIST OF FIGURES Figure - Overview of Machine learning Figure - Importance of Machine learning .1 Figure - Supervised learning Figure - Unsupervised learning Figure - Reinforcement learning .3 Figure - Application of Machine learning .3 Figure - Logistic Regresstion Figure - K-Nearest Neighbors Figure - Support vector machines Figure 10 - Random forests .7 Figure 11 - Confusion matrix Figure 12 - Python Django framework Figure 13 - SQLite 10 Figure 14 - Check missing data .12 Figure 15 - Delete row contain missing data 12 Figure 16 - Check heterogeneity data of target column 12 Figure 17 - Visualize with bar&pie charts .13 Figure 18 - Histogram chart .14 Figure 19 - correlation matrix graph 15 Figure 20 - Target DataFrame 16 Figure 21 - Shape of datas .16 Figure 22 - Train using Logistic Regresstion 17 Figure 23 - Result with LR 17 Figure 24 - Train using KNN 18 Figure 25 - Result with KNN 18 Figure 26 - Train using SVM 19 Figure 27 - Result with SVM 19 Figure 28 - Train using Random Forests 20 Figure 29 - Result with RF 20 Figure 30 - Prediction screen 22 Figure 31 - Predicted results screen 22 Figure 32 - Prediction result record 23 Figure 33 - Detailed record of prediction results .23 LIST OF ABBREVIATIONS ID SVM KNN Phrase Abbreviation Support Vector Machines K-Nearest Neighbors CRUD RDBMS URL Create-Read-Update-Delete Relational Database Management System Uniform Resource Locator Chapter Theoretical basis 1.1 Overview of Machine Learning 1.1.1 Introduction to Machine Learning Figure - Overview of Machine learning Machine Learning is a field of study that focuses on the development of algorithms and models that enable computer systems to learn from data and make predictions or decisions without being explicitly programmed It involves the utilization of statistical techniques and computational methods to extract meaningful patterns and insights from large datasets 1.1.2 Importance of Machine Learning Figure - Importance of Machine learning Machine Learning has gained significant importance in various domains due to its ability to automate tasks, make accurate predictions, and provide valuable insights It is widely used in areas such as image and speech recognition, natural language processing, recommendation 1.4.2 SQLite Figure 13 - SQLite SQLite is a lightweight, serverless, and self-contained relational database management system (RDBMS) that is widely used in web development It is seamlessly integrated into the Django framework and serves as the default database backend SQLite offers a simple and efficient way to store and retrieve data using SQL queries It stores the entire database as a single file, making it easy to distribute and deploy SQLite is wellsuited for small to medium-sized projects and provides good performance for most use cases Django abstracts the database operations, allowing you to interact with the SQLite database using Python objects and methods It handles the creation of database tables, executing queries, and managing data migrations automatically, making it convenient for web development SQLite is suitable for development and testing environments or scenarios where low to moderate concurrent access is expected For high-traffic production websites, other database backends like PostgreSQL or MySQL may be preferred 10 Chapter Build predictive data models 2.1 Data analysis 2.1.1 Data description This data set dates from 1988 and consists of databases: Cleveland, Hungary, Switzerland, and Long Beach V It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them The "target" field refers to the presence of heart disease in the patient It is integer valued = no disease and = disease Dataset: 13 attributes, predicted atrribute, 303 data rows Table - Data description ID Attribute s Name target age sex cp trestbps chol fbs restecg thalach 10 exang 11 oldpeak 12 slope 13 ca 14 thal Attributes Description Notes cardiovascular disease status (0 - no disease, disease) age gender (0-female, 1-male) chest pain type (0-typical angina, 1-atypical angina, 2-non-anginal pain, 3-asymptomatic) resting blood pressure serum cholestoral in mg/dl fasting blood sugar (0-normal: =120mg/dl) resting electrocardiographic results (0-normal, 1heart muscle, 2-enlargement of the heart's left ventricle) maximum heart rate achieved exercise induced angina (0-no, 1-yes) ST depression induced by exercise relative to rest the slope of the peak exercise ST segment (0upsloping, 1-flat, 2-downsloping) number of major vessels (0-3) colored by flourosopy thalassemia (0-normal, 1-fixed defect, 2-reversable defect) predicted attribute float 2.1.2 Data preprocessing Check for "null" data, then sum: 11 Figure 14 - Check missing data Delete the row containing the value "null" (if any) In this case, the tuple in use does not contain a "null" value: Figure 15 - Delete row contain missing data Check the data for heterogeneity then print it out: Figure 16 - Check heterogeneity data of target column 2.1.3 Qualitative analysis of data Count the number of attribute values, then visualize with bar and pie charts: 12 Figure 17 - Visualize with bar&pie charts 2.1.4 Quantitative analysis of data Perform statistical calculations including: calculating the mean, calculating the variance, calculating the maximum value, calculating the minimum value Then print it out and visualize it with a histogram: 13 Figure 18 - Histogram chart 2.1.5 Relationship between attributes Use python's function to calculate correlation matrix value then visualize with heatmap graph: 14 Figure 19 - correlation matrix graph 2.2 Preparation of data sets 2.2.1 Data separation: Features - Target Use the "drop()" function to split the dataset into sets of features and target: 15 Figure 20 - Target DataFrame 2.2.2 Data separation: Train - Test Use the function "train_test_split()" to divide the dataset into sets of train and test at the rate of 80% and 20%: Figure 21 - Shape of datas 16 2.3 Using machine learning algorithms to data prediction 2.3.1 Using Logistic regresstion Figure 22 - Train using Logistic Regresstion Run prediction with data [50, 1, 0, 140, 268, 0, 0, 160, 0, 3.6, 0, 0, 0]: Figure 23 - Result with LR 17 2.3.2 Using K-nearest neighbors Figure 24 - Train using KNN Run prediction with data [50, 1, 0, 140, 268, 0, 0, 160, 0, 3.6, 0, 0, 0]: Figure 25 - Result with KNN 18 2.3.3 Using Support vector machines Figure 26 - Train using SVM Run prediction with data [50, 1, 0, 140, 268, 0, 0, 160, 0, 3.6, 0, 0, 0]: Figure 27 - Result with SVM 19 2.3.4 Using Ramdom forests Figure 28 - Train using Random Forests Run prediction with data [50, 1, 0, 140, 268, 0, 0, 160, 0, 3.6, 0, 0, 0]: Figure 29 - Result with RF 2.4 Comparison and Conclusion Table - Compare machine learning models Model Time Train Accuracy Train Accuracy Test Logistic Regresstion 0.03 seconds 85.12% 81.97% KNN 0.01 seconds 78.10% 62.30% 20 SVM 0.01 seconds 69.42% 62.30% Random Forest 0.17 seconds 100.00% 78.69% Based on the above results, Logistic Regression model is the best choice for this problem Since it has high accuracy on both training and test sets, the training time is fast and simpler than other models 21 Chapter Demo program 3.1 Prediction screen Users can enter the values of 13 attributes and press the "TEST NOW" button to predict their cardiovascular disease risk Figure 30 - Prediction screen 3.2 Predicted Results screen The results are displayed with parameters: probability of disease and possibility of disease: - Probability: 0-1 (0%-100%) - Possibility: No Disease (target = 0), Disease (target = 1) Figure 31 - Predicted results screen 22 3.3 Prediction result record The prediction results will be stored for the user to review Figure 32 - Prediction result record 3.4 Detailed record of prediction results The user can view the details of the entered values for the features when performing disease prediction Figure 33 - Detailed record of prediction results 23 REFERENCES [1] - https://chat.openai.com/ [2] - https://www.youtube.com/ [3] - https://www.python.org/ [4] - https://code.visualstudio.com/ [5] - https://colab.research.google.com/ [6] - https://www.djangoproject.com/ 24