introduction to artificial intelligence ai for diabetes prediction

Our novel model is implementedusing supervised machine learning techniques for any diabetes dataset to understandpatterns for the knowledge discovery process in diabetes.. In this work,

Trang 1

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

AI FOR DIABETES PREDICTION

Instructors: Phạm Văn Hải

Group 13: Trần Thị Lan Anh - 20215180

Dương Văn Hữu - 20215210

Võ Văn Thanh - 20215242

Vũ Đình Vũ - 20215258Nguyễn Đình Vũ - 20215257

Hanoi - May, 2023

Trang 2

1.1 Background and Motivation 3

1.2 Objectives 3

2 MAIN PURPOSE 4 3 DETAILED DESIGN 4 3.1 Libraries 4

3.2 Dataset 4

3.3 Training model 9

3.3.1 Preparation 9

3.3.2 KNN - K nearest neighbours 9

3.3.3 Logistic regression 10

3.3.4 Decision Tree 12

3.3.5 SVM 15

3.3.6 Evaluation 17

3.4 Feature engineering 17

3.4.1 Correlation Matrix 17

3.4.2 Random Forest Classifier 18

3.4.3 From Bagging to Random Forest 20

4 EXPERIMENTAL RESULTS 21 5 DEMO APP 22 6 EVALUATION 22 7 CONCLUSION 23 7.1 Conclusion 23

7.1.1 Lesson learned 23

7.1.2 Future work 24

7.2 References 24

7.3 Setting environment 24

7.4 Steps implementation 24

Trang 3

LIST OF FIGURES

1 Example of the dataset 4

2 Dataset statistical table 5

3 Column chart 6

4 The outcome 7

5 Data 8

6 The dataset after zoom out 9

7 An example with k = 3 9

8 Effectiveness of each values 10

9 Decision tree chart 12

10 Visualizing Decision Tree 1st 13

11 IG formula 14

12 Visualizing Decision Tree 2nd 14

13 Representation of Support vector machine 15

14 Heat map 18

15 The data after removed 21

Trang 4

1 PROBLEMS STATEMENT

1.1 Background and Motivation

Diabetes is a chronic (long-lasting) health condition that affects how your body turnsfood into energy Your body breaks down most of the food you eat into sugar (glu-cose) and releases it into your bloodstream When your blood sugar goes up, it signalsyour pancreas to release insulin Insulin acts like a key to let the blood sugar into yourbody’s cells for use as energy

With diabetes, your body doesn’t make enough insulin or can’t use it as well as itshould When there isn’t enough insulin or cells stop responding to insulin, too muchblood sugar stays in your bloodstream Over time, that can cause serious health prob-lems, such as heart disease, vision loss, and kidney disease

According to the World Diabetes Federation (IDF) announced in 2021, the world has

537 million people with diabetes, equivalent to 1 in 10 adults aged 20-79 years old havediabetes; 1 in 6 babies born is affected by diabetes during fetal development In partic-ular, up to 50 percent of adults with diabetes go undiagnosed

In Vietnam, Vice Minister of Health Nguyen Thi Lien Huong said that about 5 millionpeople are suffering from diabetes; more than "55 percent of patients with diabetescurrently have complications, of which 34 percent are cardiovascular complications;39.5 percent had eye complications and neurological complications; 24 percent kidneycomplications

Despite the fact that diabetes is really dangerous, there is still no cure for this disease,

so an early diagnosis will help patients access the healing process at the earliest, and

at the same time lead a healthy lifestyle to control the disease and avoid complicationslater

1.2 Objectives

Our group created a fascinating AI model project Our novel model is implementedusing supervised machine learning techniques for any diabetes dataset to understandpatterns for the knowledge discovery process in diabetes This initiative is the initialstep in our ability to create sophisticated predictive modeling and analytics for diabetesand contribute to the future of medical technology development

The most responsibility for AI predictive modeling is to access, learn and gain statistics.When the process is complete, the system is able to identify, understand the input data,and then classify the patients into diabetic and non-diabetic In this work, the modelfollows the concept of support vector machine (SVM -Linear/RBF), k-nearest neighbor(k-NN), Logistic Regression, and Decision Tree

AI model is used to spot patterns in behavior that lead to either high or low blood sugarlevels in diabetes patients The device can change patients’ lives and enables healthcareproviders the opportunity to analyze real-time data So, in this way machine learningmodel is used to explore the patient’s statistics and then make predictions

Trang 5

2 MAIN PURPOSE

Our group created a fascinating AI model project Our novel model is implementedusing supervised machine learning techniques for any diabetes dataset to understandpatterns for the knowledge discovery process in diabetes This initiative is the initialstep in our ability to create sophisticated predictive modeling and analytics for diabetesand contribute to the future of medical technology development

The most responsibility for AI predictive modeling is to access, learn and gain statistics.When the process is complete, the system is able to identify, understand the input data,and then classify the patients into diabetic and non-diabetic In this work, the modelfollows the concept of support vector machine (SVM -Linear/RBF), k-nearest neighbor(k-NN), Logistic Regression, and Decision Tree

AI model is used to spot patterns in behavior that lead to either high or low blood sugarlevels in diabetes patients The device can change patients’ lives and enables healthcareproviders the opportunity to analyze real-time data So, in this way machine learningmodel is used to explore the patient’s statistics and then make predictions

3 DETAILED DESIGN

3.1 Libraries

To implement the model, we will use numpy, pandas and sklearn:

• Numpy: a Python library used for working with arrays

• Pandas: an open source Python package that is most widely used for data data analysis and machine learning tasks

science/-• Sklearn: the most useful and robust library for machine learning in Python

3.2 Dataset

At first, Input the data

Figure 1: Example of the dataset

Trang 6

After some investigation, we have this dataset statistical table:

Figure 2: Dataset statistical table

• count: the number of non-empty elements in the dataset

• mean: mean value

• std: gradient of each value (standard deviation)

• min: minimum value 25/50/75 percent: value is smaller than the displayed ber

num-By looking at the table, Some values are missing or have min equals to 0 (invalid), suchas:

Trang 7

And we statistic again the data in column chart.

Figure 3: Column chart

Trang 8

Continuing the observation:

Figure 4: The outcomeThe number of patients with diabetes is half the number of those without diabetes

We can’t really see relation between these 2 data To help some models which usedistance (KNN for example), we need to zoom out the data

Trang 9

As an example, this is the Figure 4 after the changes:

Figure 5: Data

Trang 10

Figure 6: The dataset after zoom out

to 0 (no diabetes) or 1 (have diabetes)

Figure 7: An example with k = 3

Trang 11

In 3 points closest to the point to be predicted, there are 2 points of class B, 1 point ofclass A The majority is class B, so the model will choose class B to assign to the point

to be predicted

To train the model, we need to find which k is the most accurate So we tested it outeach k from 1 to 14:

The result is:

Figure 8: Effectiveness of each valuesClearly k = 11 has the most accurate

The accuracy is 76.5625 percent

3.3.3 Logistic regression

Logistic regression is one of the most common machine learning algorithms used forbinary classification It predicts the probability of occurrence of a binary outcome using

Trang 12

a logic function It is a special case of linear regression as it predicts the probabilities ofoutcome using log function.

Activation function defines the output of input or set of inputs Sigmoid Functionacts as an activation function which is used to add non-linearity in a model, in simplewords it decides which value to pass as output and what not to pass

The sigmoid activation function is used mostly as it does its task with great efficiency,

it basically is a probabilistic approach towards decision making and ranges in between

0 to 1, so when we have to make a decision or to predict an output we use this vation function because of the range is the minimum, therefore, prediction would bemore accurate The formula of sigmoid function

W0 is a phenomenon that skews the result of an algorithm in favor or against an idea.Bias is considered a systematic error that occurs in the machine learning model itselfdue to incorrect assumptions in the ML process

Cost function : given a is the actual result and y is the target result(0 or 1) the costfunction show how different they are The formula or cost function:

L(a, y)=(y∗log(a)+(1y) ∗log(a)) (3)Lost function define as the average of cost function The purpose of training is decreas-ing the value of lost function to the acceptable value

The logistic regression determine the probability of default class In this instance, weare modeling health examiner with 7 criterias:

Trang 13

The accuracy is 73.046875 percent.

3.3.4 Decision Tree

Decision Tree is one of the easiest and popular classification to understand and enforce.Classification is a process which contains: learning and predict from the given data Byusing Decision Tree Classification, we can seperate patients into group based on each

of their stats

In a decision tree, start with root node, we will make a comparison (or a decision rule)based on the attribute value Then we will have the branch corresponding to thatvalue or the leaf represents the outcome For instance, we can visualize the tree by thefollowing chart:

Figure 9: Decision tree chart

We also can see Decision Tree as a greedy algorithm, so the time complexity is fast,which depends on the number of attribute in the given data

This algorithm can understand as 3 simple steps:

• Select the best attribute using Attribute Selection Measures (ASM) to split therecords The ASM, in the other hand, it’s a rule that choose the best possiblescore attribute to be the splitting attribute The most popular methods are:

– Information Gain

– Gain Ratio

Trang 14

– Gini Index.

• Make that attribute a decision node and breaks the dataset into smaller subsets

• Keeps building by repeating this process recursively for each child until one ofthe condition will match:

– All the tuples belong to the same attribute value

– There are no more remaining attributes

– There are no more instances

Now, we will use these following code to train:

The accuracy is: 67.578125 percent

For better understanding, we will visualize the decision tree

Figure 10: Visualizing Decision Tree 1st

As we can see, the resultant tree is unpruned and really hard to understand So weneed to optimize it

• First step is choose a proper ASM In this case, we will use Information gain.Information gain computes the difference between entropy before split and av-erage entropy after split of the dataset based on given attribute values Entropymeasures the impurity of the input set

Trang 15

Figure 11: IG formula

– Pi is the probability that an arbitrary tuple in D belongs to class Ci

– Info(D) is the average amount of information needed to identify the classlabel of a tuple in D

– |Dj|/|D| acts as the weight of the jth partition

– InfoA(D) is the expected informa-tion required to classify a tuple from Dbased on the partitioning by A

The attribute A with the highest information gain, Gain(A), is chosen as the ting attribute at node N

split-• Second step is set a maximum depth of the tree, which can be used as a controlvariable to optimize decision tree classifier After some experiment, we figuredthat 4 is the most optimal depth with 76.171875 percent

And the visualized tree become:

Figure 12: Visualizing Decision Tree 2nd

This model becomes easier to understand and explainable than the previous

Trang 16

3.3.5 SVM

Support vector machine (SVM) is used in both classification and regression In SVMmodel, the data points are represented on the space and are categorized into groupsand the points with similar properties falls in same group In linear SVM the given dataset is considered as p-dimensional vector that can be separated by maximum of p-1planes called hyper-planes These planes separate the data space or set the boundariesamong the data groups for classification or regression problems as in "Figure *" Thebest hyper-plane can be selected among the number of hyper-planes on the basis ofdistance between the two classes it separates The plane that has the maximum marginbetween the two classes is called the maximum-margin hyper-plane

For n data points is defined as:

where x1is real vector and y1can be 1 or -1, representing the class to which belongs Ahyper-plane can be constructed so as to maximize the distance between the two classes

y=1 and y= −1, is defined as:

Figure 13: Representation of Support vector machine

Trang 17

Support vector machine has proven its efficiency on linear data and non linear data.Radial base function has been implemented with this algorithm to classify non lineardata Kernel function plays very important role to put data into feature space Mathe-matically, kernel trick (K) is defined as:

A Gaussian function is also known as Radial basis function (RBF) kernel In Figureabove, the input space separated by feature map (φ) By applying equation (1) and (2):

We get:

By applying equation (3):

We get new function, where N represents the trained data

Trang 18

We both run radial and linear SVM.

Accuracy for radial SVM is 75 percent

Accuracy for linear SVM is 73.4375 percent

3.4 Feature engineering

Feature engineering is a machine learning technique that leverages data to create newvariables that aren’t in the training set It can produce new features for both super-vised and unsupervised learning, with the goal of simplifying and speeding up datatransformations while also enhancing model accuracy

We will use some techniques such as: feature scaling, feature extraction or feature lection, because:

se-• A lot of features can affect the accuracy of the algorithm

• Feature Extraction means select only the important features to improve the racy of the algorithm

accu-• It reduces training time and reduces over-equipment

We can select important features in the following ways:

– Correlation matrix, to selecting only uncorrelated features

– RandomForestClassifier, to help provide important information about tures

fea-3.4.1 Correlation Matrix

Trang 19

Figure 14: Heat map.

Because we can’t see any similarities, we have to use Random Forest Classifie

3.4.2 Random Forest Classifier

In particular, trees that are grown very deep tend to learn highly irregular patterns:they overfit their training sets, i.e have low bias, but very high variance Randomforests are a way of averaging multiple deep decision trees, trained on different parts

of the same training set, with the goal of reducing the variance This comes at the pense of a small increase in the bias and some loss of interpretability, but generallygreatly boosts the performance in the final model

ex-Forests are like the pulling together of decision tree algorithm efforts Taking the work of many trees thus improving the performance of a single random tree Thoughnot quite similar, forests give the effects of a k-fold cross validation

Tiêu đề	Introduction To Artificial Intelligence Ai For Diabetes Prediction
Tác giả	Trần Thị Lan Anh, Dương Văn Hữu, Võ Văn Thanh, Vũ Đình Vũ, Nguyễn Đình Vũ
Người hướng dẫn	Phạm Văn Hải
Trường học	Hanoi University of Science and Technology, School of Information and Communication Technology
Chuyên ngành	Artificial Intelligence
Thể loại	Project
Năm xuất bản	2023
Thành phố	Hanoi

Định dạng
Số trang	25
Dung lượng	5,94 MB