1. Trang chủ
  2. » Luận Văn - Báo Cáo

Khóa luận tốt nghiệp: Lung cancer prediction using convolutional neural network and machine learning algorithms

75 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Lung Cancer Prediction Using Machine Learning Algorithms
Tác giả Nguyen Thanh Truc, Tran Thi Cam Tu
Người hướng dẫn Dr. CAO THỊ NHAN, MSc. NGUYEN THI KIM PHUNG
Trường học University of Information Technology
Chuyên ngành Information Systems
Thể loại Thesis
Năm xuất bản 2023
Thành phố HO CHI MINH CITY
Định dạng
Số trang 75
Dung lượng 40,73 MB

Nội dung

--- 52 Table 14: Classification Report for hybrid model CNN 4 layers and Logistic Regression.... --- 57 Table 18: Classification Report for hybrid model RESNETS0 and SVM...- ---+s+xs+ss+

Trang 1

VIETNAM NATIONAL UNIVERSITY — HO CHI MINH CITY

£UNIVERSITY OF INFORMATION TECHNOLOGY

ị FACULTY OF INFORMATION SYSTEMS TO Sã) Sã)ề) Sài Sà) Xã) Xã SG: GÀ Xà: Xã Bào Xã Sã SE À) Xã Xã Nà SỐ Ghi:

7 Nguyen Thanh Truc — 19522417 TU ẤY Q

NỀN: Xà: Xã) Này 3À) Xã Xã 8à 3À) Xö S&

Tran Thi Cam Tu — 19522458

Lung Cancer Prediction Using Machine Learning

Xà Y& S8) Này Xà XS AKAN

HO CHI MINH CITY, DECEMBER 2023

`Ắ.ĂĂẲ

Trang 2

ASSESSMENT COMMITTEE

The Assessment Committee is established under the Decision , date

— - by Rector of the University of Information Technology

1 Assoc Prof Nguyen Dinh Thuan — Chairman

2 Dr Ngo Duc Thanh — Secretary

3 Dr Nguyen Thanh Binh — Member

Trang 3

We would like to sincerely express our gratitude to the University of Information

Technology for creating favorable learning conditions and providing crucial resourcesthroughout our academic journey Our deep appreciation goes to the members of the

academic board, teachers, and mentors from the Faculty of Information Systems and other

departments, who played a vital role in shaping our understanding and supporting us

through these years

We would like to express our profound gratitude to Dr Cao Thi Nhan and MSc

Nguyen Thi Kim Phung for their exceptional guidance and unwavering support throughoutthe entire process of crafting this thesis Their expertise, encouragement, and constructive

feedback have been invaluable assets that significantly contributed to the quality and depth

of our research

We want to convey our sincere dedication and unwavering effort during the

thesis-writing process Though the journey was not without difficulties and challenges, weendeavored to overcome each obstacle with perseverance and enthusiasm We hope that

the outcome of the thesis reflects our dedication and commitment Simultaneously, we seek

understanding and compassion from our mentors if any shortcomings are identified in the

final product We sincerely thank you for the support and guidance provided by our mentors

throughout this journey, and we will be evident in the final achievement of the thesis

We sincerely appreciate it!

Nguyen Thanh Truc Tran Thi Cam Tu

Trang 4

ADVISORS COMMENTS

Trang 5

REVIEWER COMMENTS

Trang 6

4 Report Outline oo eee a 11

CHAPTER 2: BACKGROUND AND THEORY s9 ree 12

1 Basic Knowledge of Lung CaTnC€T - G3 321321135113 115111111 111111 1 1 ky 12

2 Related ch ““ :-::-13 ää 17

3 Model Evaluation ÍMGfTICS - - - SĂ 3S 112111111119 1v HH HH kg 18

CHAPTER 3: EXPERIMENTS AND RESULTS - -Ă S2 sgk, 20

1 Dataset - 20

2 Model Archif€CfUT - c2 2 1201121 12511211 1111111111111 11 111 H1 g1 H1 TH nh Hy 24

2.1 Convolutional Neural Networks (CNN) c 1S s2 rye, 26

2.2 VGG16 (Visual Geometry Group Í6) - - sec st sex srrirey 272.3 RESNET50 À/ BHPmœ AI ƒj (¿11 se, 29

2.4 Logistic R€BT€SSIOH c 2c 2c 22012 11201111121 1111 1111111 1 1111 T1 HH Hàn Hy 30

2.5 Random FOF€S( - - S3 1112011121112 1 1111112111111 11111111 11kg key 312.6 Support Vector Machine (SVM) ecceecceeccceseeeececeeeeeeceeeaeeceeeceaeeeseeseaeenseeeeaees 322.7 Hybrid Deep Learning and Machine Learning Models «+ 33S0 /284()0)0(0(2 o5 34

4 Model Evaluation R€SuÏtS - - +2 2223211211111 15311111211 11 8 1 111 11 g1 ng re 35

4.1 Training Model ConfiỹUratIOT - ĩc +21 31 1331191111511 111111 1E ke 35CN? Ji nh cố 36

4.3 Machine Learning MoOdelÌlS - - c2 2 32211211331 111 15111111 11111 11 1x key 44

5.1 Hybrid CNN 3 Layers — Machine Learning - ¿5555 +++x+sc+sexsx2 48

5.2 Hybrid CNN 4 Layers - Machine Learning 5-5 s5 ++s+svxseseerssss 515.3 Hybrid RESNETS0 — Machine Learning ccc ceceeceeeseeseeeeeeseeeeeeneenseeeenes 55

5.4 Hybrid VGG16 — Machine LearnIng s5 + +skEsessereersererke 59

Trang 7

6 Summary of Model Training R€SuÏfS - 5 22+ 3+2 EE + ES+EEESsersrersrererrkreske 63

6.1 Deep Learning Models ReSuÏ(S ¿c3 * + **3E*EEESEErirsrrrrrerrrree 63

6.2 Machine Learning Models ReSuÏ(S - 5+3 + *+EEsEEserssereersrrees 646.3 Hybrid Models Results 177 64

CHAPTER 4: CONCLUSIONS - HS HH ng HH HH kg 67 REFERENCES HH ng TH nọ ng it 69

LIST OF ACRONYMS AND ABBREVIATIONS

No Acronyms Meaning

1 CNN Convolutional Neural Network

2 VGGI6 Visual Geometry Group 16

3 RF Random Forest

4 LR Logistic Regression

5 SVM Support Vector Machine

Trang 8

The four stages of lung cancer progression [28] | - ¿5+ 522223 *++*+t£*+e+eEsrerrxerrersrrrrrerrke 15

Definition of Confusion Matrix [30] cccccccccscsssesccesseseeseeseceeeeseeseceeceseesecseceeeeseesesseseeeeeeeseeaeens 18 Original images for the three categories: Benign, Malignant and Normal -. - 21 Image after 1150821 02012i1xs12i10 0 22 Integrated Pipeline for Lung Cancer Prediction of Individual Model s - -¿- 55x55 24 Integrated Pipeline for Lung Cancer Prediction of Hybrid Models 00.0 ceececeseseeseeeeeseeneeees 25 CNN Architecture [23] cccccececcescescceseeseeseeeeceseesecsececesessesseceeeesesaecseceeeeaeeseceeessesesaeeeeseeseaeeatens 27

\M€.6/720690iisui 010757 a 27 Summary of 46 60108)/00210P7010177 28 RESNETS0 Architecture [26] - 2c 2c 221221121121 1211151 55111115111 11111111 11 11 H1 HT HH HT rệt 29

Logistic Regression Architecture [27] - - - +2: + 3x33 E*EEEE+EEsrEsrrrerrerrrrrrrrrrrrrerrerrkre 30

€1 v100804s30209011/301010)0 01550 ốẮ 31 l2 /0.00i500150E51017777 he 33 Hybrid model CNN with ML Architecture [37] - -¿ ¿5c +: ++e*++++kEveerxererrrrrererrresrs 34 Accuracy and Loss Per Epoch of CNN 3 Layers

Figure 18: Confusion Matrix of CNN 3 Layers (Test Set) ccccccescescesseeseeeeseeeeceeeeseeseeeeeeseaeeeeseeeeeeneeaes

Figure 19: Accuracy and Loss Per Epoch of CNN 4 Layers

Figure 20: Confusion Matrix of CNN 4 Layers (Test Set) - cà tt HH HH it 39 Figure 21: Accuracy and Loss Per Epoch of RESNETS0 cccecesescesseseceseeseeseceeceeeeseeseceeeeseesesseseeeeseeaeens 40 Figure 22: Confusion Matrix of RESNETSO (Test Sef) ch HH HH HH HH HH ri 41 Figure 23: Accuracy and Loss Per Epoch of VG ÌỐ -s sk vn TT nh TH Thư 42 Figure 24: Confusion Matrix of VGG16 (Test Set) cecsceeseeseeseeeeseesceecseeseeeeseesceessesseeecesaeeeeeeaeeeeeees 43 Figure 25: Confusion Matrix of Random Forest (Test Set) ccccccsssscssseseeseseeseeseeeeseeeeeecseeseeesseeseneeeeaeeees 44 Figure 26: Confusion Matrix of Logistic Regression ÌMoOe€lL - s6 St E99 2 EsEskrrkskerkrkeeree 45 Figure 27: Confusion Matrix of Support Vector Machine Model + +5 + ssxserersreeerrreee 47

Figure 28: Confusion Matrix of CNN 3 Layers - Random FOT©Sí 6 6+ tt nghe 48

Figure 29: Confusion Matrix of CNN 3 Layers - Logistic Ñ€BT€SSIOI -.- c5 Sex ssrrsrsekseree 49 Figure 30: Confusion Matrix of CNN 3 Layers - SVM - 1h ST HT gi 50 Figure 31: Confusion Matrix of CNN 4 Layers — Random FOT€SE - + tt 2E sksEskrrkekerkree 51 Figure 32: Confusion Matrix of CNN 4 Layers — Logistic Regression cccecesssceseeseseeseeeeseeseeeeteeseeeeeees 53

Trang 9

Figure 33: Confusion Matrix of CNN 4 Layers — SVM - c c St v vn TH HH TH HH xế 54 Figure 34: Confusion Matrix of RESNET50 - Random FOT€S( - 6 + 5c tt S + + EEerrksekskrrkre 55

Figure 35: Confusion Matrix of RESNETS0 - Logistic Regression cccscesseeseeseeseeeeeeseeseeaeeeeeeneeeeneenes 56

Figure 36: Confusion Matrix of RESNETS50 - SM - SG SH TH TH HH TH TH ng HH 58 Figure 37: Confusion Matrix of VGG16 - Random Forest - +: 5 2E 2E2E2 22tr 59 Figure 38: Confusion Matrix of VGG16 - Logistic RÑ€BT€SSION -.- 3+ 33t *+Evrrveerrrrsrrrrrrrrsre 60 Figure 39: Confusion Matrix of VGG16 - SVM u ccececsscesceseseesceseeeeseeeeseeseeeeseeaeeecsecseeaeeessesaeeceeaeeeeeeaee 61

Trang 10

LIST OF TABLES

Table 1: Data Summary 0E) a 23

Table 2: The model training parameters 0 c 4 35 Table 3: Classification Report for CNN 3 Layers Model (Test Set) :ccccsscssssseceeeseeeeeeseeseeeeteeseeseeeaees 37 Table 4: Classification Report for CNN 4 Layers Model (Test Se†) -ó- tt ng re, 39 Table 5: Classification Report for RESNET50 Model (Test Set) cceceseesesssseesceeeteeseeeeeeeseeeceeeseeeeeeeaees 41 Table 6: Classification Report for VGG16 Model (Test Set) 00 ceceeceescesceseeeeceseesecneceeeeseeseeaeeneeeeeeaeeatenes 43 Table 7: Classification Report for Random Forest ÌMO de] - - ¿c5 62+ 13t E 3E 92 E2 ESESkEekekeskekkskerkri 44 Table 8: Classification Report for Logistic Regression MOdel + t3 evkErkrerekekreree 46 Table 9: Classification Report for SVM MOdelL +: + 313311213 E9 E151 1 1111 111111 Triết 47 Table 10: Classification Report for hybrid model CNN 3 Layers — Random Forest - ‹-s- 49

Table 11: Classification Report for hybrid model CNN 3 Layers - Logistic Regression 50

Table 12: Classification Report for hybrid model CNN 3 Layers and SM 5 + c+c+ec+xcsecee 51

Table 13: Classification Report for hybrid model CNN 4 layers and Random Forest - 52

Table 14: Classification Report for hybrid model CNN 4 layers and Logistic Regression 53 Table 15: Classification Report for hybrid model CNN 4 layers and SVM ceesesceeeeteneeesteneeeeseeeees 54 Table 16: Classification Report for hybrid model RESNETS50 and Random Forest - - 5+ 56 Table 17: Classification Report for hybrid model RESNETS0 and Logistic Regression - 57

Table 18: Classification Report for hybrid model RESNETS0 and SVM - -+s+xs+ss+sssersersers 58

Table 19: Classification Report for hybrid model VGG16 and Random Forest -¿- +5 5s =+s++ 60 Table 20: Classification Report for hybrid model VGG16 and Logistic RegresS1on + 61 Table 21: Classification Report for hybrid model VGG16 and SVM ccccsccsceeseeseeseesseeeeeseeseeseeeeeeseeaeeaes 62 Table 22: Summary Results of Deep Learn1ng - - c1 119 311 1 191 11 HT TH ng 63 Table 23: Summary Results of Machine Learning - + scs tk EExvrvrerrrrrrrrkrrrrrrrrrerrkrrkre 64 Table 24: Summary Results of Hybrid Models - (6131121151 1 1191 9151151 11 111 gu nh 64

Trang 11

CHAPTER 1: INTRODUCTION

1 General Introduction

In the era brimming with excitement of the industry 4.0 revolution, rapid advancements inthe fields of computer science and artificial intelligence have burgeoned, unlocking vast

potential in applying technology to all aspects of life, particularly in healthcare By

harnessing the power of big data, rapid information processing, and machine learning

capabilities, the healthcare industry has paved the way for transforming how we understandand manage our health

One of the significant and pressing challenges that the healthcare industry faces today isthe ability to accurately diagnose and detect various diseases, especially in the case of

cancer Machine learning has become a crucial tool to support physicians and healthcare

experts The application of machine learning in healthcare brings numerous benefits,

including the efficient processing of large volumes of data, the detection of intricate

features that may be challenging for human recognition, and swift decision support

However, to ensure safety and reliability, the development of predictive disease systemsmust always be conducted under the supervision and control of healthcare experts Suchsystems not only save time but also present crucial opportunities to enhance the quality of

healthcare and increase survival chances for those afflicted

With this goal in mind, research into the unique combination of Convolutional Neural

Networks (CNN) and traditional machine learning algorithms becomes essential Thisconvergence not only opens up a new realm of research but also holds the promise of

contributing significantly to improving accuracy and efficiency in the diagnostic process

Lung cancer, particularly in the aftermath of the COVID-19 pandemic, has emerged as a

top priority in research The integration of the power of CNN with machine learning

capabilities from traditional algorithms promises to optimize the prediction and diagnosis

process, introducing new prospects to enhance treatment efficacy and increase survival

opportunities for those affected

Trang 12

2 The Rationale for Choosing the Topic

The decision to choose the topic "Lung Cancer Prediction using Convolutional Neural

Networks and Machine Learning Algorithms" stems from a profound understanding of the

importance of researching and applying technology in the field of healthcare Lung cancer,

one of the most daunting challenges in modern healthcare, poses an increasingly significant

problem in early diagnosis and effective treatment

The choice to use Convolutional Neural Networks (CNN) and traditional machine learning

algorithms is not only to harness the power of artificial intelligence in processing and

analyzing medical images but also represents a groundbreaking step in medical technology

It opens vast prospects, promising a substantial improvement in the diagnosis of diseases,

thereby enhancing the ability to treat and improve the quality of life for those affected

Especially, this project is not just an opportunity to develop in-depth skills and knowledge

in machine learning and artificial intelligence, but also a global mission to enhance thediagnosis and treatment of cancer, a pressing global health issue The integration of

technology and healthcare not only explores new research areas but also activelycontributes to a larger mission - protecting and improving the health of the global

community

Furthermore, the project aims at practical and humane goals, including aiding patients inearly disease detection, optimizing the treatment process, and providing comprehensive

information If successfully implemented, the predictive system developed from this

project could become a valuable tool, assisting patients and the healthcare community in

raising awareness about their health status

The fusion of predictive technology and healthcare processes will bring significant benefits

to patients, enabling them to be more proactive in managing their personal health In doing

so, we not only provide diagnostic solutions but also open opportunities to create a positive

healthcare environment, where individuals can easily access information and complete

healthcare, empowering them to make informed decisions in maintaining their health

Trang 13

3 Objective and Scope

Building deep learning models and machine learning models for lung cancer prediction

The goal is not only to provide predictions but also to improve healthcare quality through

accurate diagnostic decisions

3.1 Methodology

e Building Models:

— Develop individual deep learning models such as CNN, RESNETSO,

VGG16 and machine learning models such as Random Forest, LogisticRegression, Support Vector Machine

— Utilize CNN models such as VGG16 and RESNETSO in conjunction

with machine learning algorithms like SVM, LR, and RF

— Input data will consist of lung CT scan images and output data will be

classification results (Benign, Malignant, or Normal)

e Performance Evaluation: Employ metrics such as precision, recall,

fl-score, confusion matrix, and conduct cross-validation with 5 folds to assess

model performance

3.2 Comparison and Evaluation

e Model Performance Evaluation: Compare and evaluate the performance of

hybrid models to provide insights into their strengths and weaknesses

e Evaluate the performance and compare the models

4 Report Outline

Chapter 1: Introduction

- Introduce the topic, reasons for choosing the research topic, and the research

objectives

Chapter 2: Background And Theory

- Introduce Lung Cancer, the causes leading to lung cancer

Trang 14

- Literature Review on relevant articles.

- Introducing the evaluation metrics used for the models

Chapter 3: Experiments And Results

- Provide an overview of the dataset, data augmentation techniques

- Explain the process of splitting the data into training and testing sets

- Introduction and presentation of the methods for constructing deep learning,

machine learning, and hybrid models

- Present the configuration of the computer used in the study

- List the parameters used in training the models

- Evaluate and provide feedback on the training model results through model

evaluation metrics

Chapter 4: Conclusion

- Present the results achieved during the research

- Identify limitations and propose directions for future development

CHAPTER 2: BACKGROUND AND THEORY

1 Basic Knowledge of Lung Cancer

1.1 What is Lung Cancer?

Lung cancer [1] is a form of cancer that originates from cells in the lungs This disease is

often closely associated with smoking, although cases also occur in non-smokers Normalcells in the lungs undergo abnormal transformations and uncontrollable growth, forming

tumors.

Symptoms of lung cancer typically become apparent in later stages, including a persistent

cough, difficulty breathing, chest pain, fatigue, and sudden weight loss Mild stages of thedisease often exhibit few symptoms, while advanced stages may spread to surroundinglung areas and other organs

Trang 15

Various factors contribute to the development of lung cancer, with smoking being a major

factor The chemicals in tobacco are known to be one of the primary causes Additionally,exposure to toxic substances such as radon, asbestos, and pollutants in the workplace canincrease the risk of developing the disease

Smoking behavior can elevate the risk of lung cancer, and prolonged exposure to othercarcinogenic substances also plays a significant role in the disease's progression

Lung Cancer

Figure 1: Lung Cancer [29]

1.2 What are Malignant Lung Tumors?

It is a lung tumor characterized by the uncontrolled growth of abnormal cells, with theability to invade surrounding tissues and the potential to form distant lesions (metastasis).Cells within this tumor often participate in genetic alterations, increasing asymmetricalinteractions and organizational irregularities It develops rapidly, and malignant tumors can

spread quickly, making comprehensive treatment challenging, with a high recurrence rate

if detected late

Trang 16

1.3 What are Benign Lung Tumors?

If the cells within a tumor are normal, it is considered benign A benign lung tumor lacks

the ability to invade surrounding tissues, typically exhibits uniform density, and does notspread to other areas of the body Cells within benign tumors usually maintain better controlovergrowth, with no significant genetic alterations

1.4 Causative Agents of Lung Cancer

1.4.1 Primary Causes of Lung Cancer

Cigarette Smoking is the primary risk factor for lung cancer According to the Centers for

Disease Control and Prevention (CDC) [4], in the United States approximately 80% to 90%

of lung cancer-related deaths are linked to smoking In Vietnam, tobacco is the cause of

90% of lung cancer cases [5] Besides smoking, there are other main causes of lung cancer

[2] [4], such as:

- Genetics: Some individuals carry genetic mutations inherited from their parents,

increasing the risk of developing lung cancer

- Environmental Factors: Exposure to carcinogenic substances in the environment,

including industrial chemicals, air pollution, and ultraviolet radiation, can elevatethe risk

- Radiation Therapy to the Chest: Cancer survivors who underwent chest radiation

therapy are at a higher risk of developing lung cancer

- Diet: Scientists are studying various foods and dietary supplements to understand

their impact on the risk of lung cancer

1.4.2 Secondary Causes and Carcinogenic Agents

In addition to the primary causes, there are several secondary causes and carcinogenic

agents contributing to lung cancer [3] [4], including:

- Nicotine and Chemicals in Tobacco: Nicotine and other chemicals in tobacco are

major factors causing lung cancer and other smoking-related cancers

Trang 17

- Asbestos: A fire-resistant and insulating material used in construction, can lead to

lung cancer and other health issues when inhaled

- Radon: A natural gas emanating from the ground and rocks, can increase the risk of

lung cancer when it accumulates indoors

- Benzene: An industrial chemical found in oil products, may cause blood cancer and

bone marrow cancer

- Formaldehyde: Used in various industrial products and household items, can

contribute to cancers of the nose, throat, and lungs

- Hazardous Chemicals in Industrial Environments: Various hazardous substances

like vinyl chloride, chromium, arsenic, nickel, and others are linked to cancer risk

- Chemicals in Food: Some food additives and preservatives are also associated with

the risk of developing cancer

1.5 Understanding Cancer Stages and Associated Symptoms

1.5.1 Lung Cancer Stages

Cancer is generally categorized into four main stages [6] based on the extent of its spread

Stage 5 Stage 4

Affected Tumor lymph nodes (>5cm) Metastases Tumor(>7cm)

3 / g /

Figure 3: The four stages of lung cancer

progression [28]

- Stage 0 (Pre-cancer): At this stage, there are cell abnormalities, but they have not

developed into a tumor and have not spread to nearby organs

Trang 18

Stage I (Local): The tumor is localized in the area of origin without spreading to

distant locations This is often a stage where cancer has a high chance of cure whendetected early

Stage II (Locally Advanced): Cancer starts to invade nearby structures and organsbut is still within the area of origin

Stage IIT (Regional): In lung cancer, this stage involves the lymph nodes in the

chest

o Stage IITA: Cancer occurs in the lymph nodes but only on the same side of

the chest where the cancer first started

o Stage IIIB: Cancer has spread to lymph nodes on the opposite side or above

the collarbone

Stage IV (Advanced or Metastatic): This is the stage where cancer has spread

extensively, possibly metastasizing to other organs in the body

Early Stage Symptoms: In the early stages, there may be no apparent symptoms

When present, they are often mild and easily overlooked They may include slight

swelling, discomfort, or fatigue In the limited stage, cancer is found only in one

lung or nearby lymph nodes on the same side of the chest

Advanced Stage Symptoms: Symptoms intensify and become more noticeable

Pain, swelling, weight loss, and fatigue may increase Issues related to the affected

organ's function may arise

Widespread Cancer Symptoms: Symptoms become severe and may spread to

nearby organs, such as Throughout one lung, To the opposite lung, To lymph nodes

on the opposite side, Fluid around the lungs, To the bone marrow, To distant organs

Patients may experience symptoms like pain, weight loss, swelling, and noticeableorgan dysfunction

Cancer Metastasis Stage Symptoms: Cancer has metastasized to other organs,causing symptoms dependent on the affected organ For example, metastasis to the

bone may cause bone pain, while metastasis to the brain may result in vision

changes, headaches, and mood alterations

Trang 19

2 Related Works

Lung cancer has become one of the prevalent cancer types, especially following the

emergence of the COVID-19 pandemic, leading to increased risk and particular concernregarding this issue Research on medical image processing using Deep Learning (DL)began as early as 1995, primarily focusing on the classification of lung nodules through X-

ray images [38]

Rajasekar V and colleagues [31] proposed various methods, including Convolutional

Neural Network (CNN), CNN Gradient Descent (CNN GD), VGG-16, VGG-19, Inception

V3, and Resnet-50, to predict lung cancer based on CT scan images and histopathological

images The analysis results indicate that the detection accuracy is superior when utilizing

histopathological data for analysis Conversely, Vasavi CH and Sruthi ND [32] employed

a Deep Learning CNN model along with machine learning SVM After analyzing and

selecting multiple models, they concluded that CNN is the most effective method C

Lavanya and team [33] experimented with machine learning algorithms such as logistic

regression, random forest, SVM, KNN, XGBoost, and AdaBoost to predict lung cancer

based on Novel Biomarkers The results revealed that the most accurate prediction model

is random forest, followed by SVM, KNN, with logistic regression yielding the lowest

accuracy.

A research article by SaqIb Qamar and colleagues [34] on predicting TEM images using a

hybrid deep learning and machine learning model is also a noteworthy effort They

extracted features from a CNN model and applied machine learning to classify these

features The results showed that the hybrid CNN - Random Forest model has a higheraccuracy rate (73%) compared to other classifiers such as AdaBoost, XGBoost, and SVM

Regarding the impact of COVID-19 on the lungs and increased cancer susceptibility, Talal

S Qaid and team [37] utilized chest X-ray images to predict lung cancer through a hybrid

deep learning model (CNN, VGG16, VGG19) and machine learning (Naive Bayes, support

vector machine, random forest, and XGBoost) The results demonstrated high predictionaccuracy, with all models surpassing 90%

Trang 20

3 Model Evaluation Metrics

In the process of researching and evaluating the model's performance, aiming to create an

accurate and effective framework for the learning process, understanding how the model

learns from data, evolves over time, and measuring its performance objectively and

consistently, we have undertaken the following steps:

e Define Model: is the process of determining the architecture, parameters, loss

function, optimization method, and data preprocessing for a machine learningmodel The purpose is to create a detailed design for the model to learn from dataand make accurate predictions

e Accuracy and Loss Graph

- Accuracy Graph: Measures the model's accuracy at each training step,

indicating the percentage of correct predictions Particularly, accuracy on the test

set is a crucial metric for evaluating the model's generalization capability to new

data

- Loss Graph: Illustrates the model's loss level, providing insights into how the

model is learning from the data A reduction in loss signifies effective modelprogression

e Classification Report and Confusion Matrix

Healthy people

Sick people correctly incorrectly predicted as

predicted as sick by the sick by the model

a _— ~

~

\

/ /

Sick people incorrectly ⁄⁄ Healthy people correctly

predicted as not sick by _—~ predicted as not sick by

the model the model

Figure 4: Definition of Confusion Matrix [30]

Trang 21

- Confusion Matrix: A table representing the counts of correct and incorrect

predictions for each class This matrix illustrates the confusion between classes,

indicating True Positives (TP), True Negatives (TN), False Positives (FP), and

False Negatives (FN) and helps understand how the model classifies data pointsand provides information about the confusion between classes The main

diagonal elements of the matrix represent the correct predictions for each class,

while other elements indicate the number of mispredictions

- Classification Report: A summarized table providing detailed information about

the performance of a classification model It includes metrics such as Accuracy,

Precision, Recall, Fl — score and Support for each class The classification reporthelps assess the model's ability to classify each class and detect imbalances

Total number of samples

o Precision: The proportion of correct positive predictions in relation to the

overall number of positive predictions

True Positives

Calculation: ——————————————————

True Positives +False Positives

o Recall: The proportion of correct positive predictions compared to the

overall count of actual positive instances

True Positives

Calculation: ——————————————————

True Positives + False Negatives

o F1—score: Acomposite metric that takes into account both Precision and

Recall, commonly employed when it is desirable to assess both aspectssimultaneously

Precision x Recall

Calculation: 2 xX —

Precision + Recall

Trang 22

o Support: The specific amount within each category Support aids in

measuring the consistency of samples among various classes

e 5-Fold Cross Validation: A technique in machine learning to evaluate the

performance of a model In this process, the dataset is divided into 5 equal parts

(folds) The model is trained on 4 folds and validated on the 5th fold This procedure

is repeated 5 times, each time using a different fold for validation The final result

is a combination of performance evaluations from the 5 validation rounds, providing

a reliable estimate of the model's performance on the entire dataset

Our model training process is divided into two main parts: ‘Individual Models Evaluation’

and ‘Hybrid Model Evaluation’ These segments allow us to monitor how the combination

of models influences their performance and predictive capabilities

CHAPTER 3: EXPERIMENTS AND RESULTS

1 Dataset

1.1 Dataset Introduction

The "IQ-OTH/NCCD Lung Cancer Dataset" [19] was collected at the Iraq-OncologyTeaching Hospital/National Center for Cancer Diseases (IQ-OTH/NCCD) over a period ofthree months during the fall of 2019 This dataset includes CT scan images from patients

diagnosed with lung cancer at various stages, as well as images from healthy individuals

The CT scan images in this dataset were meticulously annotated by medical experts

1.2 Dataset Information

It comprises a total of 1,097 images representing 110 cases Cases are categorized into three

classes: Benign, Malignant and Normal Based on diagnoses:

- 15 cases are Benign, with a total of 120 images in the dataset

- 40 cases are Malignant, with a total of 561 images in the dataset

- 55 cases are Normal, with a total of 416 images in the dataset

Trang 23

All images in the dataset are in JPG format and resolution 512x512 pixels.

Malignant

Figure 5: Original images for the three categories: Benign, Malignant and Normal.

1.3 CT Scan details and slices

- The original CT scans were collected in DICOM format

- ASiemens SOMATOM scanner was used for image acquisition

- The CT protocol settings included 120 kV, a 1 mm slice thickness, and a window

width ranging from 350 to 1200 HU with a window center between 50 and 600 HU

- Scans were conducted with a breath-hold at full inspiration

- All images were de-identified before analysis

- Written consent was waived by the institutional review board, and the study was

approved by the institutional review board of participating medical centers

- Each scan in the dataset contains multiple slices

- The number of slices per scan varies from 80 to 200 slices

- Each slice represents an image of the human chest captured from different angles

and perspectives

1.4 Data Augmentation

In this Data Augmentation [20] [21] section, our primary objective is to augment the datasetfor the 'Benign' category, which originally had a limited number of instances (only 120

files) This augmentation process is specifically targeted at enhancing the dataset for the '

Benign' category to achieve a more balanced distribution of data among the three

categories While the 'malignant' category contains 561 files, and the 'normal' category has

416 files, the ' Benign' category was initially underrepresented with only 120 files

Trang 24

Therefore, the entire Data Augmentation process is designed to augment the ' Benign’

category to address this data imbalance

“+ Applied Data Augmentation Method

Geometric Transformations [22] are one of the most suitable methods for augmenting lungCT-Scan image data This method allows us to apply geometric transformations such asflipping, cropping, rotating, or translating images It helps introduce diversity into thedataset without altering the content of the images These transformations include:

1 Cropping a portion of an image simulates the removal of unimportant areas This

helps the machine learning model focus on the crucial parts of the images

2 Flipping: Flipping images horizontally or vertically creates mirrored images This

helps the machine learning model generalize better to image variations

After the Data Augmentation process, we obtained a total of 1385 files, including 408

Benign, 561 Malignant, and 416 Normal cases We preserved both the original images andthe transformed images to gain an insight into the alterations made

The purpose of validating the Data Augmentation outcomes is to affirm the precise and

systematic creation of new image variations This validation encompasses verifying the

proper storage of images and their logical connection with the original ones

Original Flip Horizontal

Figure 6: Image after being Data Augmentation.

1.5 Split Train Test

Trang 25

We chose the dataset split ratio of 80% and 20%, where the training set contains 1108 files

and the test set contains 277 files This choice ensures that the machine learning model istrained and tested effectively across all categories This is especially important whendealing with imbalanced data in the initial dataset

The training and testing sets are two crucial parts of the development and evaluation of a

machine learning or deep learning model The training set helps the model learn to classify

or predict, while the test set measures the model's capabilities The combination of thetraining and testing sets ensures that the model is trained and evaluated accurately andobjectively

e Procedure and Method for Train-Test Split

We employ the ‘train_test_split’ function from the scikit-learn library to divide the data in

an 80% - 20% ratio for the training set and test set, respectively

- Step 1: All images are resized to a uniform size of 128x128 pixels to ensure

consistency for the model

- Step 2: The resized data is divided into a training set and a test set following an 80%

- 20% ratio

- Step 3: The `tramn test spht` function from the scikit-learn library is used to

randomly partition the data Employing a random seed guarantee result

reproducibility and aids in monitoring and debugging the training and testing

process.

Table 1: Data Summary Table

Before Data After Data

Classes Train Set Test Set

Augmentation Augmentation

Benign 120 408 316 92

Malignant 561 561 469 92

Trang 26

Normal 416 416 323 93

Total 1097 1385 1108 277

2 Model Architecture

This section presents a diagram to illustrate the detailed flow of our model In the first step,

the lung dataset is taken as input and undergoes preprocessing on the input images, such as

augmentation and resizing In the second stage, the training/testing data is split using the

train_test_split function from the scikit-learn library, randomly partitioning the data into

an 80% and 20% ratio In the third stage, machine learning algorithms, along with deeplearning methods, are employed by optimizing hyperparameters using the RandomizedSearch CV method We applied the Machine Learning methods of Random Forest, LogisticRegression, and Support Vector Machine Additionally, Deep Learning methods, includingCNN, Resnet50, and VGG16, were also utilized

Input Stage 1 Stage 2

Training / Testing Data Validation

Vv Vv

: Hyperparameters

¡ |+ Learning Rate + Neurons

¡ |+ Epoch + Batch Size

h 1 |+ Optimizer + Pooling

Logistic ' ‡ |+Layers + Dropout

Regression ! H + Activation Function

+ Solver H 1 |+ Filters Kernel Size

' + Early Stopping

H : CNN algorithms

Random Forest ' : +CNN ° Layers

† |+ n_estimators + max features} | H + CNN 4 Layers

h + min samples split + max depth | ; H + RESNET50

+ |+ min samples leaf + bootstrap ' h +VGG18

Classification (Benign, Malignant, Normal)

Trang 27

In this research, we have developed a unique integrated pipeline, combining the strengths

of both machine learning and deep learning models to predict lung cancer accurately andefficiently This process represents a sophisticated blend between the ability to extractcomplex features from deep learning and the optimization of hyperparameters from

machine learning We begin with the input data, which consists of CT-scan images, andapply preprocessing on the input images, such as augmentation, resizing, and splitting the

data into training and testing sets with an 80% to 20% ratio In the subsequent stage, deep

learning algorithms are employed by optimizing hyperparameters using the Randomized

Search CV method and extracting features These extracted features serve as inputs for

machine learning algorithms Machine learning algorithms are also optimized for theirhyperparameters using the Randomized Search CV method The output results include

classification results, which are evaluated based on various performance metrics

Input Stage 1 Stage 2

: ¥ à ; - Data Augmentation Training / Testing

: : - Image Resampling Data Validation

Deep Learning Machine Learning

Randomized Search CV

Logistic : : |+Layers + Dropout

'8greSSi : + |+ Activation Function

Rearession h «n Feature Extraction ? |e Filters Kernel Size

+ Penalty : |+ Early Stopping

+C

CNN algorithms + CNN 3 Layers

+ CNN 4 Layers

Hl Random Forest '

+ |+ n_estimators + max features] |

+ |+ min samples split = max depth |

min samples leaf + bootstrap

Trang 28

2.1 Convolutional Neural Networks (CNN)

Convolutional Neural Network (CNN) [8] [9] [10] is a specialized type of neural networkcommonly used for image processing CNNs can autonomously learn hierarchical features

from data, making them effective in pattern recognition in images With the capability to

learn important features from image data, CNNs are an ideal model for classifying medical

images using imaging techniques

The CNN model is designed for efficient feature extraction and learning from image data.The model is constructed with 4 convolutional layers (3 convolutional layers with CNN 3

Layers) and 6 fully connected layers The ReLU activation function is applied after eachconvolutional layer to increase the non-linearity of the model, retaining positive values and

discarding negative values In each convolutional layer, filters are sequentially set to 32,

64, 128, and 256 to detect various features present in the image All filters have a kernel

size of 3x3, where these kernels move across the entire image to perform convolutional

operations, identifying small features within the image space Convolution means that asthe kernels move to each position in the image, they perform multiplication on the

corresponding elements, and the sum of these multiplications becomes the feature maps

To reduce the spatial size of the feature maps and simplify the model's computational

complexity, we use max-pooling layers with a size of 2x2 The max-pooling operation takesthe maximum values from a group of values in the scanning area, retaining prominent

features in the scanned region

Regarding the 6 fully connected layers, including 5 hidden layers with 256, 126, 64, 32

neurons, aims to create a gradually decreasing path to preserve important features The

final layer is an output layer with 3 classes In the last layer, the softmax activation function

is used to transform the output into a probability distribution

These details form a complete CNN architecture, capable of learning and extractingcomplex features from image data The use of Convolutional Neural Network (CNN) 1s an

ideal choice for predicting lung cancer CNN specializes in processing medical images and

automating the learning process from data The ability to extract features and synthesize

Trang 29

spatial information of CNN makes it a powerful tool, with positive potential in research

and application for lung cancer prediction

Conv_1 Convolution

(5x 5) kernel

valid padding

fc_3 Fully-Connected

fc_4 Eully-Coni nected

Conv_2 Convolution

n2 channels

(4x4xn2)

n2 channels (8x8xn2) n1 channels

(12x12xn1) INPUT

(28 x 28 x 1) OUTPUT

e

n3 units

Figure 9: CNN Architecture [23]

2.2 'VGG16 (Visual Geometry Group 16)

VGG16 [II], a Convolutional Neural Network (CNN) model developed by the VisualGeometry Group at the University of Oxford, is recognized in computer vision for its deep

architecture Comprising 16 layers, including convolutional and fully connected layers,

VGG16 is known for its simplicity and effectiveness in image classification Achieving

high accuracy, particularly when trained on the ImageNet dataset, VGG16 excels in

learning and extracting complex features from image data

1x1Ix4096 1x1x 1000

=—————.-EB convolution+ReLU

Ế max pooling E7 fully connected+ReLU

€7 softmax

Figure 10: VGG16 Architecture [24]

VGG16 consists of a total of 16 layers, stacked on top of each other This includes 13convolutional layers and 3 fully connected (dense layers) VGG16 is typicallycharacterized by the use of multiple convolutional blocks, each featuring a similar structure

Trang 30

with a set of convolutional layers and a max-pooling layer In the first two blocks, we

employ two convolutional layers with filters set to 64 and 128 Max pooling is utilized with

a 2x2 kernel, a stride of 2, and the ReLU activation function for non-linearity In the

subsequent three blocks, we incorporate three convolutional layers for each block, withfilters sequentially set to 256, 512, and 512 Similar to the first two blocks, we apply max

pooling with a 2x2 kernel, a stride of 2, and the ReLU activation function

Fully connected layers are used to connect the learned features into a classification model.The convolutional layers are numbered from Conv1 to Conv13, pooling layers denoted as

Pooll to Pool5, and fully connected layers labeled as FC1 to FC3 Fully Connected 1 and

Fully Connected 2 have 4096 neurons, while Fully Connected 3 (output layer) has 3neurons, corresponding to the number of classes in dataset VGG16 has a large number of

parameters, especially due to the use of 3x3 filters in multiple layers, totaling up to 138

million parameters [12] VGG16 is often trained on the ImageNet dataset, where it achieves

high performance and stands out as one of the exemplary models in computer vision

Advantages of VGGI6 include its simple and understandable architecture, capable oflearning complex features from images However, due to its large parameter count, it

results in an increased model size and demands significant computational resources

VGG - 16

Ỷ | Coni2 | Conv 2-1 Conv 2-2 | Con43 | mm| Conv 5-3 |

Figure 11: Summary of VGG16 Model [25]

With its deep and efficient architecture for learning and extracting features from image

data, VGG16 is an ideal choice for our project on lung cancer prediction Its simple design

and high accuracy, especially after being trained on the ImageNet dataset, make VGG16stand out in the field of medical image classification The flexibility and ability to

Trang 31

synthesize spatial information of VGG16 make it a powerful tool in addressing the problem

of lung cancer prediction

2.3 RESNET50

ResNet50, a potent Convolutional Neural Network (CNN) developed by Microsoft

Research, is widely applied in computer vision Its revolutionary residual block structure

addresses challenges in training deep neural networks With 50 layers divided into 16

residual blocks, ResNet50 [13] [14] excels in image classification across various classes,

demonstrating expertise in learning complex features

Activation}

‘ony IxI

(b) (c) (d)

Figure 12: RESNET50 Architecture [26]

The basic building block in ResNet50 is the residual block, which has a main structure

comprising two main paths: the identity path and the shortcut path The identity path

represents the direct mapping from input to output The shortcut path provides a shortcutfor faster gradient flow through multiple layers Each residual block in ResNet50 typicallyuses a bottleneck architecture to reduce computational complexity The bottleneckarchitecture includes three main layers: 1x1 convolution, 3x3 convolution, and another 1x1convolution 1x1 convolutions are used to reduce and then restore the size of feature maps,optimizing the number of parameters ResNet50 is built by stacking multiple residualblocks on top of each other Stacking these blocks helps the model learn hierarchical

features from simple to complex Using multiple layers helps ResNet50 capture deep and

intricate representations from the input data Instead of using a fully connected layer at the

Trang 32

end, ResNet50 often employs Global Average Pooling This layer computes the average

value of each feature map, creating a fixed-size vector GAP helps reduce the number of

parameters and acts as a regularization technique The last layer of ResNet50 is a fully

connected layer with softmax activation, responsible for creating a probability distributionacross the output classes of the model

The adaptability of ResNet50 in handling images of different sizes and its capability tolearn deep features make it a crucial and flexible tool for both research and practicalapplications in our lung cancer prediction project With 50 layers, ResNet50 not only excels

in multi-class image classification but also demonstrates efficiency in learning complex

features

2.4 Logistic Regression

Logistic Regression (LR) is a machine learning algorithm used to predict the probability

of an event occurring or not, commonly applied in classification tasks Unlike linear

regression, Logistic Regression utilizes the sigmoid function to generate probabilities,

making it effective in binary classification The term 'Logistic' originates from the S-shapedsigmoid curve, mapping the output to a range between 0 and 1, representing the probability

of the event occurring Logistic Regression is a simple and versatile tool widely used for

predicting customer purchases, image classification, and forecasting disease likelihood in

the medical field

Predicted class label

Net input Sigmoid Threshold

function activation ‡ function

Figure 13: Logistic Regression Architecture [27]

Logistic Regression [15] is a crucial algorithm in Machine Learning, commonly employed

for classification tasks It predicts the class of a new data sample based on its features.Logistic Regression utilizes the sigmoid function to transform its output into probabilities

Trang 33

Regularization is applied, specifically '12' regularization, which adds the square value of

the magnitude of weights to the loss function This regularization helps maintain stability

by preventing unnecessary complexity

The regularization strength, denoted as 'C' and set to 0.01 in this context, determines the

influence of regularization during the training process A low value, such as 0.01, indicates

strong regularization, aiding in controlling the model's complexity The Newton-Conjugate

Gradient optimization algorithm is chosen as the solver algorithm to optimize the lossfunction and find suitable weights for the model These elements collectively contribute toLogistic Regression's effectiveness, ensuring stable and well-controlled learning during themodel's training

Logistic Regression is specifically designed to tackle classification problems, highlighting

its simplicity and wide applicability Its ability to perform well on various types of data andfeasibility in applying to the lung cancer prediction problem makes Logistic Regression a

powerful and useful tool in my research

2.5 Random Forest

Random Forest [16] is a machine learning model belonging to the Ensemble Learning

family, built on the principles of Decision Trees Unlike relying on a single decision tree,

Random Forest combines multiple decision trees to create a strong and diverse model Each

decision tree in the Random Forest is constructed on a randomly sampled subset from the

Trang 34

training data, ensuring independence between the trees and preventing overfitting The

final decision of the Random Forest is determined through a voting process, where the

ultimate decision is chosen based on the majority from all the trees

After searching for optimal parameters, we use 2000 trees to enhance diversity and

accommodate different patterns within the data Choosing 'sqrt' (the square root of the totalnumber of features) as the limit for building trees helps reduce the risk of overfitting andencourages diversity The maximum depth of each tree is limited to 50 to control themodel's complexity The minimum samples required to split a node are set to 2 to ensure

stability in the splitting process The minimum samples at each leaf node are set to 3 to

control the size of the trees and avoid creating overly small leaf nodes Bootstrap Sampling

is set to False, meaning the entire dataset is used for each tree without random sampling

This helps create diversity during the tree-building process and prevents overfitting

Random Forest excels in synthesizing information and handling noise, making it a popular

tool for various classification and prediction applications

Random Forest, with its flexibility and ability to handle large datasets, is a suitable choice

for my lung cancer prediction project Its capacity to combine multiple decision treesmakes Random Forest effective in processing diverse data and learning complex models

Particularly, the stability and risk reduction of overfitting in Random Forest make it a

reliable tool for predicting and classifying lung cancer conditions

2.6 Support Vector Machine (SVM)

The Support Vector Machine (SVM) [17] [18] is a critical model in machine learning,

widely applied for classification and regression tasks SVM determines the optimal

decision hyperplane between data groups, relying on support vectors near the decisionboundary These support vectors influence the hyperplanes shape, enhancinggeneralizability and reducing computational costs The kernel, a vital SVM technique,maps data to a higher-dimensional space for linear classification SVM aims to maximize

the margin, the distance between the hyperplane and data points of both classes

Trang 35

After finding the best parameters, we chose 'linear' for a linear SVM, suitable for linearly

separable data, meaning the decision boundary is a straight line The gamma parameter is

set to 1000, controlling the influence of a training example; a high value leads to a narrow

decision boundary The regularization parameter (C) is set to 0.01, balancing a smoothdecision boundary and accurate classification of training points A low C value indicates

strong regularization, prioritizing a larger margin SVM, with its ability to create complex

decision boundaries and high accuracy, is an ideal choice for my lung cancer predictionproject Its consistency and good performance on diverse datasets make SVM a powerfultool for classifying and predicting lung cancer status

Predicted variable

ƒŒ) =ø-@()+b

paramethérs

With RBE, linear, polynomial

or sigomid

Vectors in hyperplane

Figure 15: SVM Architecture [36]

2.7 Hybrid Deep Learning and Machine Learning Models

In the field of machine learning, data classification is of paramount importance, and feature

extraction plays a crucial role in this task Deep learning models such as CNN, ResNet50,and VGG16 have demonstrated outstanding capabilities in feature extraction from images

Therefore, the proposed model combines deep learning and machine learning to harness

the strengths of both fields

The input to the model is the image to be predicted, utilizing deep learning models (CNN,

ResNet50, and VGG16) to extract features from the input image The feature maps afterthe flatten layer are used as input data for machine learning Subsequently, the feature data

from deep learning is fed into machine learning models (Logistic Regression, SVM, and

Random Forest) to perform the classification process The output result is the classificationoutcome of the input image

Trang 36

Based on the proficiency of deep learning (CNN, ResNet50, VGGI6) in extracting

complex features from images, the model has enhanced the performance of theclassification process These features, after being transferred to machine learning, create arobust and accurate multi-class classification system

3 Software Requirements

e System and Runtime Environment

- Operating System: Windows 10 Pro 64-bit

- System Model: MS-7D46

- BIOS Version: 2.10

- Processor: 12th Gen Intel(R) Core (TM) 15-12400F (12 CPUs), ~2.5GHz

- Memory: 16384MB RAM

- DirectX Version: DirectX 12

- Graphics Card: NVIDIA GeForce RTX 2060

e Libraries and Frameworks

- Programming Language: Python

- Main Libraries: NumPy, Pandas, Glob, Pickle, Joblib, Scikit-image, OpenCV,

Matplotlib, Scikit-learn, TensorFlow, Seaborn

Trang 37

4 Model Evaluation Results

4.1 Training Model Configuration

Table 2: The model training parameters

The model takes images

Input Image 128x128x1 with a size 128x128 and one

channel is Grayscale

To learn complexEpoch 20 representations from both

the training and testing data

Used to control the step size

during optimization, aiding

Learning Rate le-4 (0.0001) in convergence and stability

of the model

Meaning that for each

weight update, the modelBatch Size 35

utilizes 35 images during

training

- The machine learning

Number of Iterations 100 process consists of 100

iterations

Ngày đăng: 02/10/2024, 02:55

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN