1. Trang chủ
  2. » Luận Văn - Báo Cáo

introduction to artificial intelligence ai for diabetes prediction

25 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Our novel model is implementedusing supervised machine learning techniques for any diabetes dataset to understandpatterns for the knowledge discovery process in diabetes.. In this work,

Trang 1

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

AI FOR DIABETES PREDICTION

Instructors: Phạm Văn Hải.

Group 13: Trần Thị Lan Anh - 20215180Dương Văn Hữu - 20215210Võ Văn Thanh - 20215242Vũ Đình Vũ - 20215258Nguyễn Đình Vũ - 20215257

Hanoi - May, 2023

Trang 2

3.4.2 Random Forest Classifier 18

3.4.3 From Bagging to Random Forest 20

4EXPERIMENTAL RESULTS215DEMO APP226EVALUATION227CONCLUSION237.1 Conclusion 23

Trang 3

LIST OFFIGURES

1 Example of the dataset 4

2 Dataset statistical table 5

8 Effectiveness of each values 10

9 Decision tree chart 12

10 Visualizing Decision Tree 1st 13

11 IG formula 14

12 Visualizing Decision Tree 2nd 14

13 Representation of Support vector machine 15

14 Heat map 18

15 The data after removed 21

Trang 4

1.PROBLEMS STATEMENT

1.1.Background and Motivation

Diabetes is a chronic (long-lasting) health condition that affects how your body turnsfood into energy Your body breaks down most of the food you eat into sugar (glu-cose) and releases it into your bloodstream When your blood sugar goes up, it signalsyour pancreas to release insulin Insulin acts like a key to let the blood sugar into yourbody’s cells for use as energy.

With diabetes, your body doesn’t make enough insulin or can’t use it as well as itshould When there isn’t enough insulin or cells stop responding to insulin, too muchblood sugar stays in your bloodstream Over time, that can cause serious health prob-lems, such as heart disease, vision loss, and kidney disease.

According to the World Diabetes Federation (IDF) announced in 2021, the world has537 million people with diabetes, equivalent to 1 in 10 adults aged 20-79 years old havediabetes; 1 in 6 babies born is affected by diabetes during fetal development In partic-ular, up to 50 percent of adults with diabetes go undiagnosed.

In Vietnam, Vice Minister of Health Nguyen Thi Lien Huong said that about 5 millionpeople are suffering from diabetes; more than "55 percent of patients with diabetescurrently have complications, of which 34 percent are cardiovascular complications;39.5 percent had eye complications and neurological complications; 24 percent kidneycomplications .

Despite the fact that diabetes is really dangerous, there is still no cure for this disease,so an early diagnosis will help patients access the healing process at the earliest, andat the same time lead a healthy lifestyle to control the disease and avoid complicationslater.

Our group created a fascinating AI model project Our novel model is implementedusing supervised machine learning techniques for any diabetes dataset to understandpatterns for the knowledge discovery process in diabetes This initiative is the initialstep in our ability to create sophisticated predictive modeling and analytics for diabetesand contribute to the future of medical technology development.

The most responsibility for AI predictive modeling is to access, learn and gain statistics.When the process is complete, the system is able to identify, understand the input data,and then classify the patients into diabetic and non-diabetic In this work, the modelfollows the concept of support vector machine (SVM -Linear/RBF), k-nearest neighbor(k-NN), Logistic Regression, and Decision Tree.

AI model is used to spot patterns in behavior that lead to either high or low blood sugarlevels in diabetes patients The device can change patients’ lives and enables healthcareproviders the opportunity to analyze real-time data So, in this way machine learningmodel is used to explore the patient’s statistics and then make predictions.

Trang 5

2.MAIN PURPOSE

Our group created a fascinating AI model project Our novel model is implementedusing supervised machine learning techniques for any diabetes dataset to understandpatterns for the knowledge discovery process in diabetes This initiative is the initialstep in our ability to create sophisticated predictive modeling and analytics for diabetesand contribute to the future of medical technology development.

The most responsibility for AI predictive modeling is to access, learn and gain statistics.When the process is complete, the system is able to identify, understand the input data,and then classify the patients into diabetic and non-diabetic In this work, the modelfollows the concept of support vector machine (SVM -Linear/RBF), k-nearest neighbor(k-NN), Logistic Regression, and Decision Tree.

AI model is used to spot patterns in behavior that lead to either high or low blood sugarlevels in diabetes patients The device can change patients’ lives and enables healthcareproviders the opportunity to analyze real-time data So, in this way machine learningmodel is used to explore the patient’s statistics and then make predictions.

At first, Input the data.

Figure 1: Example of the dataset

Trang 6

After some investigation, we have this dataset statistical table:

Figure 2: Dataset statistical table

• count: the number of non-empty elements in the dataset.• mean: mean value.

• std: gradient of each value (standard deviation).

• min: minimum value 25/50/75 percent: value is smaller than the displayed ber.

num-By looking at the table, Some values are missing or have min equals to 0 (invalid), suchas:

• Glucose.• BloodPressure.• SkinThickness.• Insulin.• BMI.

To solve this problem, we decide to replace these values with its mean or median, andobserve the distribution of the data for each value Data with skewed distribution willreplace zeros with the median, and normally distributed data will replace zeros withthe mean.

Trang 7

And we statistic again the data in column chart.

Figure 3: Column chart

Trang 8

Continuing the observation:

Figure 4: The outcome

The number of patients with diabetes is half the number of those without diabetes.We can’t really see relation between these 2 data To help some models which usedistance (KNN for example), we need to zoom out the data.

Trang 9

As an example, this is the Figure 4 after the changes:

Figure 5: Data

Trang 10

Figure 6: The dataset after zoom out

3.3.Training model3.3.1.Preparation

To train and test the models, One third of the dataset will be used for testing and otherfor trainning We also import the models from Sklearn library.

3.3.2.KNN - K nearest neighbours

KNN is a simple model that collects necessary data to be predicted to 1 point in spaceand compares that point with the nearest k points to predict whether this score belongsto 0 (no diabetes) or 1 (have diabetes).

Figure 7: An example with k = 3

Trang 11

In 3 points closest to the point to be predicted, there are 2 points of class B, 1 point ofclass A The majority is class B, so the model will choose class B to assign to the pointto be predicted.

To train the model, we need to find which k is the most accurate So we tested it outeach k from 1 to 14:

The result is:

Figure 8: Effectiveness of each valuesClearly k = 11 has the most accurate.

The accuracy is 76.5625 percent.

3.3.3.Logistic regression

Logistic regression is one of the most common machine learning algorithms used forbinary classification It predicts the probability of occurrence of a binary outcome using

Trang 12

a logic function It is a special case of linear regression as it predicts the probabilities ofoutcome using log function.

Activation function defines the output of input or set of inputs Sigmoid Functionacts as an activation function which is used to add non-linearity in a model, in simplewords it decides which value to pass as output and what not to pass.

The sigmoid activation function is used mostly as it does its task with great efficiency,it basically is a probabilistic approach towards decision making and ranges in between0 to 1, so when we have to make a decision or to predict an output we use this acti-vation function because of the range is the minimum, therefore, prediction would bemore accurate The formula of sigmoid function.

W0 is a phenomenon that skews the result of an algorithm in favor or against an idea.Bias is considered a systematic error that occurs in the machine learning model itselfdue to incorrect assumptions in the ML process.

Cost function : given a is the actual result and y is the target result(0 or 1) the costfunction show how different they are The formula or cost function:

L(a, y)=(y∗log(a)+(1y) ∗log(a)) (3)Lost function define as the average of cost function The purpose of training is decreas-ing the value of lost function to the acceptable value.

The logistic regression determine the probability of default class In this instance, weare modeling health examiner with 7 criterias:

• 1 Pregnancies.• 2 Glucose.• 3 Blood Pressure.• 4 Skin Thickness.• 5 Insulin.• 6 BMI.• 7 Age.

The logistic regression can be written as the probability of getting diabetes given byabove criterias.

Trang 13

The accuracy is 73.046875 percent.

3.3.4.Decision Tree

Decision Tree is one of the easiest and popular classification to understand and enforce.Classification is a process which contains: learning and predict from the given data Byusing Decision Tree Classification, we can seperate patients into group based on eachof their stats.

In a decision tree, start with root node, we will make a comparison (or a decision rule)based on the attribute value Then we will have the branch corresponding to thatvalue or the leaf represents the outcome For instance, we can visualize the tree by thefollowing chart:

Figure 9: Decision tree chart

We also can see Decision Tree as a greedy algorithm, so the time complexity is fast,which depends on the number of attribute in the given data.

This algorithm can understand as 3 simple steps:

• Select the best attribute using Attribute Selection Measures (ASM) to split therecords The ASM, in the other hand, it’s a rule that choose the best possiblescore attribute to be the splitting attribute The most popular methods are:

Information Gain.

Gain Ratio.

Trang 14

Gini Index.

• Make that attribute a decision node and breaks the dataset into smaller subsets.• Keeps building by repeating this process recursively for each child until one of

the condition will match:

All the tuples belong to the same attribute value.

There are no more remaining attributes.

There are no more instances.Now, we will use these following code to train:

The accuracy is: 67.578125 percent.

For better understanding, we will visualize the decision tree.

Figure 10: Visualizing Decision Tree 1st

As we can see, the resultant tree is unpruned and really hard to understand So weneed to optimize it.

• First step is choose a proper ASM In this case, we will use Information gain.Information gain computes the difference between entropy before split and av-erage entropy after split of the dataset based on given attribute values Entropymeasures the impurity of the input set.

Trang 15

Figure 11: IG formula

Pi is the probability that an arbitrary tuple in D belongs to class Ci.

Info(D) is the average amount of information needed to identify the classlabel of a tuple in D.

|Dj|/|D| acts as the weight of the jth partition.

InfoA(D) is the expected informa-tion required to classify a tuple from Dbased on the partitioning by A.

The attribute A with the highest information gain, Gain(A), is chosen as the ting attribute at node N.

split-• Second step is set a maximum depth of the tree, which can be used as a controlvariable to optimize decision tree classifier After some experiment, we figuredthat 4 is the most optimal depth with 76.171875 percent.

And the visualized tree become:

Figure 12: Visualizing Decision Tree 2nd

This model becomes easier to understand and explainable than the previous.

Trang 16

Support vector machine (SVM) is used in both classification and regression In SVMmodel, the data points are represented on the space and are categorized into groupsand the points with similar properties falls in same group In linear SVM the given dataset is considered as p-dimensional vector that can be separated by maximum of p-1planes called hyper-planes These planes separate the data space or set the boundariesamong the data groups for classification or regression problems as in "Figure *" Thebest hyper-plane can be selected among the number of hyper-planes on the basis ofdistance between the two classes it separates The plane that has the maximum marginbetween the two classes is called the maximum-margin hyper-plane.

For n data points is defined as:

where x1is real vector and y1can be 1 or -1, representing the class to which belongs Ahyper-plane can be constructed so as to maximize the distance between the two classesy=1 and y= −1, is defined as:

Figure 13: Representation of Support vector machine.

Trang 17

Support vector machine has proven its efficiency on linear data and non linear data.Radial base function has been implemented with this algorithm to classify non lineardata Kernel function plays very important role to put data into feature space Mathe-matically, kernel trick (K) is defined as:

A Gaussian function is also known as Radial basis function (RBF) kernel In Figureabove, the input space separated by feature map (φ) By applying equation (1) and (2):

We get:

By applying equation (3):

We get new function, where N represents the trained data.

Trang 18

We both run radial and linear SVM.

Accuracy for radial SVM is 75 percent.

Accuracy for linear SVM is 73.4375 percent.

Radial SVM 75Linear SVM 73.4375

3.4.Feature engineering

Feature engineering is a machine learning technique that leverages data to create newvariables that aren’t in the training set It can produce new features for both super-vised and unsupervised learning, with the goal of simplifying and speeding up datatransformations while also enhancing model accuracy.

We will use some techniques such as: feature scaling, feature extraction or feature lection, because:

se-• A lot of features can affect the accuracy of the algorithm.

• Feature Extraction means select only the important features to improve the racy of the algorithm.

accu-• It reduces training time and reduces over-equipment.We can select important features in the following ways:

Correlation matrix, to selecting only uncorrelated features.

RandomForestClassifier, to help provide important information about tures.

fea-3.4.1.Correlation Matrix

Trang 19

Figure 14: Heat map.

Because we can’t see any similarities, we have to use Random Forest Classifie.

3.4.2.Random Forest Classifier

In particular, trees that are grown very deep tend to learn highly irregular patterns:they overfit their training sets, i.e have low bias, but very high variance Randomforests are a way of averaging multiple deep decision trees, trained on different partsof the same training set, with the goal of reducing the variance This comes at the ex-pense of a small increase in the bias and some loss of interpretability, but generallygreatly boosts the performance in the final model.

Forests are like the pulling together of decision tree algorithm efforts Taking the work of many trees thus improving the performance of a single random tree Thoughnot quite similar, forests give the effects of a k-fold cross validation.

Trang 20

team-The training algorithm for random forests applies the general technique of bootstrapaggregating, or bagging, to tree learners Given a training set∗X= x1, ,xn∗withresponses Y= y1, , yn, bagging repeatedly (B times) selects a random sample withreplacement of the training set and fits trees to these samples: For b=1, ,B:

• Sample, with replacement, n training examples from X, Y; call these Xb, Yb.• Train a classification or regression tree fb on Xb, Yb.

After training, predictions for unseen samples x′can be made by averaging thepredictions from all the individual regression trees on :x′

or by taking the majority vote in the case of classification trees.

This bootstrapping procedure leads to better model performance because it creases the variance of the model, without increasing the bias Additionally, anestimate of the uncertainty of the prediction can be made as the standard devia-tion of the predictions from all the individual regression trees on :x′

Ngày đăng: 30/05/2024, 14:56

Xem thêm:

w