Our novel model is implementedusing supervised machine learning techniques for any diabetes dataset to understandpatterns for the knowledge discovery process in diabetes.. In this work,
Trang 1INTRODUCTION TO ARTIFICIAL INTELLIGENCE
AI FOR DIABETES PREDICTION
Instructors: Phạm Văn Hải
Group 13: Trần Thị Lan Anh - 20215180
Dương Văn Hữu - 20215210
Võ Văn Thanh - 20215242
Vũ Đình Vũ - 20215258Nguyễn Đình Vũ - 20215257
Hanoi - May, 2023
Trang 21.1 Background and Motivation 3
1.2 Objectives 3
2 MAIN PURPOSE 4 3 DETAILED DESIGN 4 3.1 Libraries 4
3.2 Dataset 4
3.3 Training model 9
3.3.1 Preparation 9
3.3.2 KNN - K nearest neighbours 9
3.3.3 Logistic regression 10
3.3.4 Decision Tree 12
3.3.5 SVM 15
3.3.6 Evaluation 17
3.4 Feature engineering 17
3.4.1 Correlation Matrix 17
3.4.2 Random Forest Classifier 18
3.4.3 From Bagging to Random Forest 20
4 EXPERIMENTAL RESULTS 21 5 DEMO APP 22 6 EVALUATION 22 7 CONCLUSION 23 7.1 Conclusion 23
7.1.1 Lesson learned 23
7.1.2 Future work 24
7.2 References 24
7.3 Setting environment 24
7.4 Steps implementation 24
Trang 3LIST OF FIGURES
1 Example of the dataset 4
2 Dataset statistical table 5
3 Column chart 6
4 The outcome 7
5 Data 8
6 The dataset after zoom out 9
7 An example with k = 3 9
8 Effectiveness of each values 10
9 Decision tree chart 12
10 Visualizing Decision Tree 1st 13
11 IG formula 14
12 Visualizing Decision Tree 2nd 14
13 Representation of Support vector machine 15
14 Heat map 18
15 The data after removed 21
Trang 41 PROBLEMS STATEMENT
1.1 Background and Motivation
Diabetes is a chronic (long-lasting) health condition that affects how your body turnsfood into energy Your body breaks down most of the food you eat into sugar (glu-cose) and releases it into your bloodstream When your blood sugar goes up, it signalsyour pancreas to release insulin Insulin acts like a key to let the blood sugar into yourbody’s cells for use as energy
With diabetes, your body doesn’t make enough insulin or can’t use it as well as itshould When there isn’t enough insulin or cells stop responding to insulin, too muchblood sugar stays in your bloodstream Over time, that can cause serious health prob-lems, such as heart disease, vision loss, and kidney disease
According to the World Diabetes Federation (IDF) announced in 2021, the world has
537 million people with diabetes, equivalent to 1 in 10 adults aged 20-79 years old havediabetes; 1 in 6 babies born is affected by diabetes during fetal development In partic-ular, up to 50 percent of adults with diabetes go undiagnosed
In Vietnam, Vice Minister of Health Nguyen Thi Lien Huong said that about 5 millionpeople are suffering from diabetes; more than "55 percent of patients with diabetescurrently have complications, of which 34 percent are cardiovascular complications;39.5 percent had eye complications and neurological complications; 24 percent kidneycomplications
Despite the fact that diabetes is really dangerous, there is still no cure for this disease,
so an early diagnosis will help patients access the healing process at the earliest, and
at the same time lead a healthy lifestyle to control the disease and avoid complicationslater
1.2 Objectives
Our group created a fascinating AI model project Our novel model is implementedusing supervised machine learning techniques for any diabetes dataset to understandpatterns for the knowledge discovery process in diabetes This initiative is the initialstep in our ability to create sophisticated predictive modeling and analytics for diabetesand contribute to the future of medical technology development
The most responsibility for AI predictive modeling is to access, learn and gain statistics.When the process is complete, the system is able to identify, understand the input data,and then classify the patients into diabetic and non-diabetic In this work, the modelfollows the concept of support vector machine (SVM -Linear/RBF), k-nearest neighbor(k-NN), Logistic Regression, and Decision Tree
AI model is used to spot patterns in behavior that lead to either high or low blood sugarlevels in diabetes patients The device can change patients’ lives and enables healthcareproviders the opportunity to analyze real-time data So, in this way machine learningmodel is used to explore the patient’s statistics and then make predictions
Trang 52 MAIN PURPOSE
Our group created a fascinating AI model project Our novel model is implementedusing supervised machine learning techniques for any diabetes dataset to understandpatterns for the knowledge discovery process in diabetes This initiative is the initialstep in our ability to create sophisticated predictive modeling and analytics for diabetesand contribute to the future of medical technology development
The most responsibility for AI predictive modeling is to access, learn and gain statistics.When the process is complete, the system is able to identify, understand the input data,and then classify the patients into diabetic and non-diabetic In this work, the modelfollows the concept of support vector machine (SVM -Linear/RBF), k-nearest neighbor(k-NN), Logistic Regression, and Decision Tree
AI model is used to spot patterns in behavior that lead to either high or low blood sugarlevels in diabetes patients The device can change patients’ lives and enables healthcareproviders the opportunity to analyze real-time data So, in this way machine learningmodel is used to explore the patient’s statistics and then make predictions
3 DETAILED DESIGN
3.1 Libraries
To implement the model, we will use numpy, pandas and sklearn:
• Numpy: a Python library used for working with arrays
• Pandas: an open source Python package that is most widely used for data data analysis and machine learning tasks
science/-• Sklearn: the most useful and robust library for machine learning in Python
3.2 Dataset
At first, Input the data
Figure 1: Example of the dataset
Trang 6After some investigation, we have this dataset statistical table:
Figure 2: Dataset statistical table
• count: the number of non-empty elements in the dataset
• mean: mean value
• std: gradient of each value (standard deviation)
• min: minimum value 25/50/75 percent: value is smaller than the displayed ber
num-By looking at the table, Some values are missing or have min equals to 0 (invalid), suchas:
Trang 7And we statistic again the data in column chart.
Figure 3: Column chart
Trang 8Continuing the observation:
Figure 4: The outcomeThe number of patients with diabetes is half the number of those without diabetes
We can’t really see relation between these 2 data To help some models which usedistance (KNN for example), we need to zoom out the data
Trang 9As an example, this is the Figure 4 after the changes:
Figure 5: Data
Trang 10Figure 6: The dataset after zoom out
to 0 (no diabetes) or 1 (have diabetes)
Figure 7: An example with k = 3
Trang 11In 3 points closest to the point to be predicted, there are 2 points of class B, 1 point ofclass A The majority is class B, so the model will choose class B to assign to the point
to be predicted
To train the model, we need to find which k is the most accurate So we tested it outeach k from 1 to 14:
The result is:
Figure 8: Effectiveness of each valuesClearly k = 11 has the most accurate
The accuracy is 76.5625 percent
3.3.3 Logistic regression
Logistic regression is one of the most common machine learning algorithms used forbinary classification It predicts the probability of occurrence of a binary outcome using
Trang 12a logic function It is a special case of linear regression as it predicts the probabilities ofoutcome using log function.
Activation function defines the output of input or set of inputs Sigmoid Functionacts as an activation function which is used to add non-linearity in a model, in simplewords it decides which value to pass as output and what not to pass
The sigmoid activation function is used mostly as it does its task with great efficiency,
it basically is a probabilistic approach towards decision making and ranges in between
0 to 1, so when we have to make a decision or to predict an output we use this vation function because of the range is the minimum, therefore, prediction would bemore accurate The formula of sigmoid function
W0 is a phenomenon that skews the result of an algorithm in favor or against an idea.Bias is considered a systematic error that occurs in the machine learning model itselfdue to incorrect assumptions in the ML process
Cost function : given a is the actual result and y is the target result(0 or 1) the costfunction show how different they are The formula or cost function:
L(a, y)=(y∗log(a)+(1y) ∗log(a)) (3)Lost function define as the average of cost function The purpose of training is decreas-ing the value of lost function to the acceptable value
The logistic regression determine the probability of default class In this instance, weare modeling health examiner with 7 criterias:
Trang 13The accuracy is 73.046875 percent.
3.3.4 Decision Tree
Decision Tree is one of the easiest and popular classification to understand and enforce.Classification is a process which contains: learning and predict from the given data Byusing Decision Tree Classification, we can seperate patients into group based on each
of their stats
In a decision tree, start with root node, we will make a comparison (or a decision rule)based on the attribute value Then we will have the branch corresponding to thatvalue or the leaf represents the outcome For instance, we can visualize the tree by thefollowing chart:
Figure 9: Decision tree chart
We also can see Decision Tree as a greedy algorithm, so the time complexity is fast,which depends on the number of attribute in the given data
This algorithm can understand as 3 simple steps:
• Select the best attribute using Attribute Selection Measures (ASM) to split therecords The ASM, in the other hand, it’s a rule that choose the best possiblescore attribute to be the splitting attribute The most popular methods are:
– Information Gain
– Gain Ratio
Trang 14– Gini Index.
• Make that attribute a decision node and breaks the dataset into smaller subsets
• Keeps building by repeating this process recursively for each child until one ofthe condition will match:
– All the tuples belong to the same attribute value
– There are no more remaining attributes
– There are no more instances
Now, we will use these following code to train:
The accuracy is: 67.578125 percent
For better understanding, we will visualize the decision tree
Figure 10: Visualizing Decision Tree 1st
As we can see, the resultant tree is unpruned and really hard to understand So weneed to optimize it
• First step is choose a proper ASM In this case, we will use Information gain.Information gain computes the difference between entropy before split and av-erage entropy after split of the dataset based on given attribute values Entropymeasures the impurity of the input set
Trang 15Figure 11: IG formula
– Pi is the probability that an arbitrary tuple in D belongs to class Ci
– Info(D) is the average amount of information needed to identify the classlabel of a tuple in D
– |Dj|/|D| acts as the weight of the jth partition
– InfoA(D) is the expected informa-tion required to classify a tuple from Dbased on the partitioning by A
The attribute A with the highest information gain, Gain(A), is chosen as the ting attribute at node N
split-• Second step is set a maximum depth of the tree, which can be used as a controlvariable to optimize decision tree classifier After some experiment, we figuredthat 4 is the most optimal depth with 76.171875 percent
And the visualized tree become:
Figure 12: Visualizing Decision Tree 2nd
This model becomes easier to understand and explainable than the previous
Trang 163.3.5 SVM
Support vector machine (SVM) is used in both classification and regression In SVMmodel, the data points are represented on the space and are categorized into groupsand the points with similar properties falls in same group In linear SVM the given dataset is considered as p-dimensional vector that can be separated by maximum of p-1planes called hyper-planes These planes separate the data space or set the boundariesamong the data groups for classification or regression problems as in "Figure *" Thebest hyper-plane can be selected among the number of hyper-planes on the basis ofdistance between the two classes it separates The plane that has the maximum marginbetween the two classes is called the maximum-margin hyper-plane
For n data points is defined as:
where x1is real vector and y1can be 1 or -1, representing the class to which belongs Ahyper-plane can be constructed so as to maximize the distance between the two classes
y=1 and y= −1, is defined as:
Figure 13: Representation of Support vector machine
Trang 17Support vector machine has proven its efficiency on linear data and non linear data.Radial base function has been implemented with this algorithm to classify non lineardata Kernel function plays very important role to put data into feature space Mathe-matically, kernel trick (K) is defined as:
A Gaussian function is also known as Radial basis function (RBF) kernel In Figureabove, the input space separated by feature map (φ) By applying equation (1) and (2):
We get:
By applying equation (3):
We get new function, where N represents the trained data
Trang 18We both run radial and linear SVM.
Accuracy for radial SVM is 75 percent
Accuracy for linear SVM is 73.4375 percent
3.4 Feature engineering
Feature engineering is a machine learning technique that leverages data to create newvariables that aren’t in the training set It can produce new features for both super-vised and unsupervised learning, with the goal of simplifying and speeding up datatransformations while also enhancing model accuracy
We will use some techniques such as: feature scaling, feature extraction or feature lection, because:
se-• A lot of features can affect the accuracy of the algorithm
• Feature Extraction means select only the important features to improve the racy of the algorithm
accu-• It reduces training time and reduces over-equipment
We can select important features in the following ways:
– Correlation matrix, to selecting only uncorrelated features
– RandomForestClassifier, to help provide important information about tures
fea-3.4.1 Correlation Matrix
Trang 19Figure 14: Heat map.
Because we can’t see any similarities, we have to use Random Forest Classifie
3.4.2 Random Forest Classifier
In particular, trees that are grown very deep tend to learn highly irregular patterns:they overfit their training sets, i.e have low bias, but very high variance Randomforests are a way of averaging multiple deep decision trees, trained on different parts
of the same training set, with the goal of reducing the variance This comes at the pense of a small increase in the bias and some loss of interpretability, but generallygreatly boosts the performance in the final model
ex-Forests are like the pulling together of decision tree algorithm efforts Taking the work of many trees thus improving the performance of a single random tree Thoughnot quite similar, forests give the effects of a k-fold cross validation