--- 52 Table 14: Classification Report for hybrid model CNN 4 layers and Logistic Regression.... --- 57 Table 18: Classification Report for hybrid model RESNETS0 and SVM...- ---+s+xs+ss+
Trang 1VIETNAM NATIONAL UNIVERSITY — HO CHI MINH CITY
£UNIVERSITY OF INFORMATION TECHNOLOGY
ị FACULTY OF INFORMATION SYSTEMS TO Sã) Sã)ề) Sài Sà) Xã) Xã SG: GÀ Xà: Xã Bào Xã Sã SE À) Xã Xã Nà SỐ Ghi:
7 Nguyen Thanh Truc — 19522417 TU ẤY Q
NỀN: Xà: Xã) Này 3À) Xã Xã 8à 3À) Xö S&
Tran Thi Cam Tu — 19522458
Lung Cancer Prediction Using Machine Learning
Xà Y& S8) Này Xà XS AKAN
HO CHI MINH CITY, DECEMBER 2023
`Ắ.ĂĂẲ
Trang 2ASSESSMENT COMMITTEE
The Assessment Committee is established under the Decision , date
— - by Rector of the University of Information Technology
1 Assoc Prof Nguyen Dinh Thuan — Chairman
2 Dr Ngo Duc Thanh — Secretary
3 Dr Nguyen Thanh Binh — Member
Trang 3We would like to sincerely express our gratitude to the University of Information
Technology for creating favorable learning conditions and providing crucial resourcesthroughout our academic journey Our deep appreciation goes to the members of the
academic board, teachers, and mentors from the Faculty of Information Systems and other
departments, who played a vital role in shaping our understanding and supporting us
through these years
We would like to express our profound gratitude to Dr Cao Thi Nhan and MSc
Nguyen Thi Kim Phung for their exceptional guidance and unwavering support throughoutthe entire process of crafting this thesis Their expertise, encouragement, and constructive
feedback have been invaluable assets that significantly contributed to the quality and depth
of our research
We want to convey our sincere dedication and unwavering effort during the
thesis-writing process Though the journey was not without difficulties and challenges, weendeavored to overcome each obstacle with perseverance and enthusiasm We hope that
the outcome of the thesis reflects our dedication and commitment Simultaneously, we seek
understanding and compassion from our mentors if any shortcomings are identified in the
final product We sincerely thank you for the support and guidance provided by our mentors
throughout this journey, and we will be evident in the final achievement of the thesis
We sincerely appreciate it!
Nguyen Thanh Truc Tran Thi Cam Tu
Trang 4ADVISORS COMMENTS
Trang 5REVIEWER COMMENTS
Trang 64 Report Outline oo eee a 11
CHAPTER 2: BACKGROUND AND THEORY s9 ree 12
1 Basic Knowledge of Lung CaTnC€T - G3 321321135113 115111111 111111 1 1 ky 12
2 Related ch ““ :-::-13 ää 17
3 Model Evaluation ÍMGfTICS - - - SĂ 3S 112111111119 1v HH HH kg 18
CHAPTER 3: EXPERIMENTS AND RESULTS - -Ă S2 sgk, 20
1 Dataset - 20
2 Model Archif€CfUT - c2 2 1201121 12511211 1111111111111 11 111 H1 g1 H1 TH nh Hy 24
2.1 Convolutional Neural Networks (CNN) c 1S s2 rye, 26
2.2 VGG16 (Visual Geometry Group Í6) - - sec st sex srrirey 272.3 RESNET50 À/ BHPmœ AI ƒj (¿11 se, 29
2.4 Logistic R€BT€SSIOH c 2c 2c 22012 11201111121 1111 1111111 1 1111 T1 HH Hàn Hy 30
2.5 Random FOF€S( - - S3 1112011121112 1 1111112111111 11111111 11kg key 312.6 Support Vector Machine (SVM) ecceecceeccceseeeececeeeeeeceeeaeeceeeceaeeeseeseaeenseeeeaees 322.7 Hybrid Deep Learning and Machine Learning Models «+ 33S0 /284()0)0(0(2 o5 34
4 Model Evaluation R€SuÏtS - - +2 2223211211111 15311111211 11 8 1 111 11 g1 ng re 35
4.1 Training Model ConfiỹUratIOT - ĩc +21 31 1331191111511 111111 1E ke 35CN? Ji nh cố 36
4.3 Machine Learning MoOdelÌlS - - c2 2 32211211331 111 15111111 11111 11 1x key 44
5.1 Hybrid CNN 3 Layers — Machine Learning - ¿5555 +++x+sc+sexsx2 48
5.2 Hybrid CNN 4 Layers - Machine Learning 5-5 s5 ++s+svxseseerssss 515.3 Hybrid RESNETS0 — Machine Learning ccc ceceeceeeseeseeeeeeseeeeeeneenseeeenes 55
5.4 Hybrid VGG16 — Machine LearnIng s5 + +skEsessereersererke 59
Trang 76 Summary of Model Training R€SuÏfS - 5 22+ 3+2 EE + ES+EEESsersrersrererrkreske 63
6.1 Deep Learning Models ReSuÏ(S ¿c3 * + **3E*EEESEErirsrrrrrerrrree 63
6.2 Machine Learning Models ReSuÏ(S - 5+3 + *+EEsEEserssereersrrees 646.3 Hybrid Models Results 177 64
CHAPTER 4: CONCLUSIONS - HS HH ng HH HH kg 67 REFERENCES HH ng TH nọ ng it 69
LIST OF ACRONYMS AND ABBREVIATIONS
No Acronyms Meaning
1 CNN Convolutional Neural Network
2 VGGI6 Visual Geometry Group 16
3 RF Random Forest
4 LR Logistic Regression
5 SVM Support Vector Machine
Trang 8The four stages of lung cancer progression [28] | - ¿5+ 522223 *++*+t£*+e+eEsrerrxerrersrrrrrerrke 15
Definition of Confusion Matrix [30] cccccccccscsssesccesseseeseeseceeeeseeseceeceseesecseceeeeseesesseseeeeeeeseeaeens 18 Original images for the three categories: Benign, Malignant and Normal -. - 21 Image after 1150821 02012i1xs12i10 0 22 Integrated Pipeline for Lung Cancer Prediction of Individual Model s - -¿- 55x55 24 Integrated Pipeline for Lung Cancer Prediction of Hybrid Models 00.0 ceececeseseeseeeeeseeneeees 25 CNN Architecture [23] cccccececcescescceseeseeseeeeceseesecsececesessesseceeeesesaecseceeeeaeeseceeessesesaeeeeseeseaeeatens 27
\M€.6/720690iisui 010757 a 27 Summary of 46 60108)/00210P7010177 28 RESNETS0 Architecture [26] - 2c 2c 221221121121 1211151 55111115111 11111111 11 11 H1 HT HH HT rệt 29
Logistic Regression Architecture [27] - - - +2: + 3x33 E*EEEE+EEsrEsrrrerrerrrrrrrrrrrrrerrerrkre 30
€1 v100804s30209011/301010)0 01550 ốẮ 31 l2 /0.00i500150E51017777 he 33 Hybrid model CNN with ML Architecture [37] - -¿ ¿5c +: ++e*++++kEveerxererrrrrererrresrs 34 Accuracy and Loss Per Epoch of CNN 3 Layers
Figure 18: Confusion Matrix of CNN 3 Layers (Test Set) ccccccescescesseeseeeeseeeeceeeeseeseeeeeeseaeeeeseeeeeeneeaes
Figure 19: Accuracy and Loss Per Epoch of CNN 4 Layers
Figure 20: Confusion Matrix of CNN 4 Layers (Test Set) - cà tt HH HH it 39 Figure 21: Accuracy and Loss Per Epoch of RESNETS0 cccecesescesseseceseeseeseceeceeeeseeseceeeeseesesseseeeeseeaeens 40 Figure 22: Confusion Matrix of RESNETSO (Test Sef) ch HH HH HH HH HH ri 41 Figure 23: Accuracy and Loss Per Epoch of VG ÌỐ -s sk vn TT nh TH Thư 42 Figure 24: Confusion Matrix of VGG16 (Test Set) cecsceeseeseeseeeeseesceecseeseeeeseesceessesseeecesaeeeeeeaeeeeeees 43 Figure 25: Confusion Matrix of Random Forest (Test Set) ccccccsssscssseseeseseeseeseeeeseeeeeecseeseeesseeseneeeeaeeees 44 Figure 26: Confusion Matrix of Logistic Regression ÌMoOe€lL - s6 St E99 2 EsEskrrkskerkrkeeree 45 Figure 27: Confusion Matrix of Support Vector Machine Model + +5 + ssxserersreeerrreee 47
Figure 28: Confusion Matrix of CNN 3 Layers - Random FOT©Sí 6 6+ tt nghe 48
Figure 29: Confusion Matrix of CNN 3 Layers - Logistic Ñ€BT€SSIOI -.- c5 Sex ssrrsrsekseree 49 Figure 30: Confusion Matrix of CNN 3 Layers - SVM - 1h ST HT gi 50 Figure 31: Confusion Matrix of CNN 4 Layers — Random FOT€SE - + tt 2E sksEskrrkekerkree 51 Figure 32: Confusion Matrix of CNN 4 Layers — Logistic Regression cccecesssceseeseseeseeeeseeseeeeteeseeeeeees 53
Trang 9Figure 33: Confusion Matrix of CNN 4 Layers — SVM - c c St v vn TH HH TH HH xế 54 Figure 34: Confusion Matrix of RESNET50 - Random FOT€S( - 6 + 5c tt S + + EEerrksekskrrkre 55
Figure 35: Confusion Matrix of RESNETS0 - Logistic Regression cccscesseeseeseeseeeeeeseeseeaeeeeeeneeeeneenes 56
Figure 36: Confusion Matrix of RESNETS50 - SM - SG SH TH TH HH TH TH ng HH 58 Figure 37: Confusion Matrix of VGG16 - Random Forest - +: 5 2E 2E2E2 22tr 59 Figure 38: Confusion Matrix of VGG16 - Logistic RÑ€BT€SSION -.- 3+ 33t *+Evrrveerrrrsrrrrrrrrsre 60 Figure 39: Confusion Matrix of VGG16 - SVM u ccececsscesceseseesceseeeeseeeeseeseeeeseeaeeecsecseeaeeessesaeeceeaeeeeeeaee 61
Trang 10LIST OF TABLES
Table 1: Data Summary 0E) a 23
Table 2: The model training parameters 0 c 4 35 Table 3: Classification Report for CNN 3 Layers Model (Test Set) :ccccsscssssseceeeseeeeeeseeseeeeteeseeseeeaees 37 Table 4: Classification Report for CNN 4 Layers Model (Test Se†) -ó- tt ng re, 39 Table 5: Classification Report for RESNET50 Model (Test Set) cceceseesesssseesceeeteeseeeeeeeseeeceeeseeeeeeeaees 41 Table 6: Classification Report for VGG16 Model (Test Set) 00 ceceeceescesceseeeeceseesecneceeeeseeseeaeeneeeeeeaeeatenes 43 Table 7: Classification Report for Random Forest ÌMO de] - - ¿c5 62+ 13t E 3E 92 E2 ESESkEekekeskekkskerkri 44 Table 8: Classification Report for Logistic Regression MOdel + t3 evkErkrerekekreree 46 Table 9: Classification Report for SVM MOdelL +: + 313311213 E9 E151 1 1111 111111 Triết 47 Table 10: Classification Report for hybrid model CNN 3 Layers — Random Forest - ‹-s- 49
Table 11: Classification Report for hybrid model CNN 3 Layers - Logistic Regression 50
Table 12: Classification Report for hybrid model CNN 3 Layers and SM 5 + c+c+ec+xcsecee 51
Table 13: Classification Report for hybrid model CNN 4 layers and Random Forest - 52
Table 14: Classification Report for hybrid model CNN 4 layers and Logistic Regression 53 Table 15: Classification Report for hybrid model CNN 4 layers and SVM ceesesceeeeteneeesteneeeeseeeees 54 Table 16: Classification Report for hybrid model RESNETS50 and Random Forest - - 5+ 56 Table 17: Classification Report for hybrid model RESNETS0 and Logistic Regression - 57
Table 18: Classification Report for hybrid model RESNETS0 and SVM - -+s+xs+ss+sssersersers 58
Table 19: Classification Report for hybrid model VGG16 and Random Forest -¿- +5 5s =+s++ 60 Table 20: Classification Report for hybrid model VGG16 and Logistic RegresS1on + 61 Table 21: Classification Report for hybrid model VGG16 and SVM ccccsccsceeseeseeseesseeeeeseeseeseeeeeeseeaeeaes 62 Table 22: Summary Results of Deep Learn1ng - - c1 119 311 1 191 11 HT TH ng 63 Table 23: Summary Results of Machine Learning - + scs tk EExvrvrerrrrrrrrkrrrrrrrrrerrkrrkre 64 Table 24: Summary Results of Hybrid Models - (6131121151 1 1191 9151151 11 111 gu nh 64
Trang 11CHAPTER 1: INTRODUCTION
1 General Introduction
In the era brimming with excitement of the industry 4.0 revolution, rapid advancements inthe fields of computer science and artificial intelligence have burgeoned, unlocking vast
potential in applying technology to all aspects of life, particularly in healthcare By
harnessing the power of big data, rapid information processing, and machine learning
capabilities, the healthcare industry has paved the way for transforming how we understandand manage our health
One of the significant and pressing challenges that the healthcare industry faces today isthe ability to accurately diagnose and detect various diseases, especially in the case of
cancer Machine learning has become a crucial tool to support physicians and healthcare
experts The application of machine learning in healthcare brings numerous benefits,
including the efficient processing of large volumes of data, the detection of intricate
features that may be challenging for human recognition, and swift decision support
However, to ensure safety and reliability, the development of predictive disease systemsmust always be conducted under the supervision and control of healthcare experts Suchsystems not only save time but also present crucial opportunities to enhance the quality of
healthcare and increase survival chances for those afflicted
With this goal in mind, research into the unique combination of Convolutional Neural
Networks (CNN) and traditional machine learning algorithms becomes essential Thisconvergence not only opens up a new realm of research but also holds the promise of
contributing significantly to improving accuracy and efficiency in the diagnostic process
Lung cancer, particularly in the aftermath of the COVID-19 pandemic, has emerged as a
top priority in research The integration of the power of CNN with machine learning
capabilities from traditional algorithms promises to optimize the prediction and diagnosis
process, introducing new prospects to enhance treatment efficacy and increase survival
opportunities for those affected
Trang 122 The Rationale for Choosing the Topic
The decision to choose the topic "Lung Cancer Prediction using Convolutional Neural
Networks and Machine Learning Algorithms" stems from a profound understanding of the
importance of researching and applying technology in the field of healthcare Lung cancer,
one of the most daunting challenges in modern healthcare, poses an increasingly significant
problem in early diagnosis and effective treatment
The choice to use Convolutional Neural Networks (CNN) and traditional machine learning
algorithms is not only to harness the power of artificial intelligence in processing and
analyzing medical images but also represents a groundbreaking step in medical technology
It opens vast prospects, promising a substantial improvement in the diagnosis of diseases,
thereby enhancing the ability to treat and improve the quality of life for those affected
Especially, this project is not just an opportunity to develop in-depth skills and knowledge
in machine learning and artificial intelligence, but also a global mission to enhance thediagnosis and treatment of cancer, a pressing global health issue The integration of
technology and healthcare not only explores new research areas but also activelycontributes to a larger mission - protecting and improving the health of the global
community
Furthermore, the project aims at practical and humane goals, including aiding patients inearly disease detection, optimizing the treatment process, and providing comprehensive
information If successfully implemented, the predictive system developed from this
project could become a valuable tool, assisting patients and the healthcare community in
raising awareness about their health status
The fusion of predictive technology and healthcare processes will bring significant benefits
to patients, enabling them to be more proactive in managing their personal health In doing
so, we not only provide diagnostic solutions but also open opportunities to create a positive
healthcare environment, where individuals can easily access information and complete
healthcare, empowering them to make informed decisions in maintaining their health
Trang 133 Objective and Scope
Building deep learning models and machine learning models for lung cancer prediction
The goal is not only to provide predictions but also to improve healthcare quality through
accurate diagnostic decisions
3.1 Methodology
e Building Models:
— Develop individual deep learning models such as CNN, RESNETSO,
VGG16 and machine learning models such as Random Forest, LogisticRegression, Support Vector Machine
— Utilize CNN models such as VGG16 and RESNETSO in conjunction
with machine learning algorithms like SVM, LR, and RF
— Input data will consist of lung CT scan images and output data will be
classification results (Benign, Malignant, or Normal)
e Performance Evaluation: Employ metrics such as precision, recall,
fl-score, confusion matrix, and conduct cross-validation with 5 folds to assess
model performance
3.2 Comparison and Evaluation
e Model Performance Evaluation: Compare and evaluate the performance of
hybrid models to provide insights into their strengths and weaknesses
e Evaluate the performance and compare the models
4 Report Outline
Chapter 1: Introduction
- Introduce the topic, reasons for choosing the research topic, and the research
objectives
Chapter 2: Background And Theory
- Introduce Lung Cancer, the causes leading to lung cancer
Trang 14- Literature Review on relevant articles.
- Introducing the evaluation metrics used for the models
Chapter 3: Experiments And Results
- Provide an overview of the dataset, data augmentation techniques
- Explain the process of splitting the data into training and testing sets
- Introduction and presentation of the methods for constructing deep learning,
machine learning, and hybrid models
- Present the configuration of the computer used in the study
- List the parameters used in training the models
- Evaluate and provide feedback on the training model results through model
evaluation metrics
Chapter 4: Conclusion
- Present the results achieved during the research
- Identify limitations and propose directions for future development
CHAPTER 2: BACKGROUND AND THEORY
1 Basic Knowledge of Lung Cancer
1.1 What is Lung Cancer?
Lung cancer [1] is a form of cancer that originates from cells in the lungs This disease is
often closely associated with smoking, although cases also occur in non-smokers Normalcells in the lungs undergo abnormal transformations and uncontrollable growth, forming
tumors.
Symptoms of lung cancer typically become apparent in later stages, including a persistent
cough, difficulty breathing, chest pain, fatigue, and sudden weight loss Mild stages of thedisease often exhibit few symptoms, while advanced stages may spread to surroundinglung areas and other organs
Trang 15Various factors contribute to the development of lung cancer, with smoking being a major
factor The chemicals in tobacco are known to be one of the primary causes Additionally,exposure to toxic substances such as radon, asbestos, and pollutants in the workplace canincrease the risk of developing the disease
Smoking behavior can elevate the risk of lung cancer, and prolonged exposure to othercarcinogenic substances also plays a significant role in the disease's progression
Lung Cancer
Figure 1: Lung Cancer [29]
1.2 What are Malignant Lung Tumors?
It is a lung tumor characterized by the uncontrolled growth of abnormal cells, with theability to invade surrounding tissues and the potential to form distant lesions (metastasis).Cells within this tumor often participate in genetic alterations, increasing asymmetricalinteractions and organizational irregularities It develops rapidly, and malignant tumors can
spread quickly, making comprehensive treatment challenging, with a high recurrence rate
if detected late
Trang 161.3 What are Benign Lung Tumors?
If the cells within a tumor are normal, it is considered benign A benign lung tumor lacks
the ability to invade surrounding tissues, typically exhibits uniform density, and does notspread to other areas of the body Cells within benign tumors usually maintain better controlovergrowth, with no significant genetic alterations
1.4 Causative Agents of Lung Cancer
1.4.1 Primary Causes of Lung Cancer
Cigarette Smoking is the primary risk factor for lung cancer According to the Centers for
Disease Control and Prevention (CDC) [4], in the United States approximately 80% to 90%
of lung cancer-related deaths are linked to smoking In Vietnam, tobacco is the cause of
90% of lung cancer cases [5] Besides smoking, there are other main causes of lung cancer
[2] [4], such as:
- Genetics: Some individuals carry genetic mutations inherited from their parents,
increasing the risk of developing lung cancer
- Environmental Factors: Exposure to carcinogenic substances in the environment,
including industrial chemicals, air pollution, and ultraviolet radiation, can elevatethe risk
- Radiation Therapy to the Chest: Cancer survivors who underwent chest radiation
therapy are at a higher risk of developing lung cancer
- Diet: Scientists are studying various foods and dietary supplements to understand
their impact on the risk of lung cancer
1.4.2 Secondary Causes and Carcinogenic Agents
In addition to the primary causes, there are several secondary causes and carcinogenic
agents contributing to lung cancer [3] [4], including:
- Nicotine and Chemicals in Tobacco: Nicotine and other chemicals in tobacco are
major factors causing lung cancer and other smoking-related cancers
Trang 17- Asbestos: A fire-resistant and insulating material used in construction, can lead to
lung cancer and other health issues when inhaled
- Radon: A natural gas emanating from the ground and rocks, can increase the risk of
lung cancer when it accumulates indoors
- Benzene: An industrial chemical found in oil products, may cause blood cancer and
bone marrow cancer
- Formaldehyde: Used in various industrial products and household items, can
contribute to cancers of the nose, throat, and lungs
- Hazardous Chemicals in Industrial Environments: Various hazardous substances
like vinyl chloride, chromium, arsenic, nickel, and others are linked to cancer risk
- Chemicals in Food: Some food additives and preservatives are also associated with
the risk of developing cancer
1.5 Understanding Cancer Stages and Associated Symptoms
1.5.1 Lung Cancer Stages
Cancer is generally categorized into four main stages [6] based on the extent of its spread
Stage 5 Stage 4
Affected Tumor lymph nodes (>5cm) Metastases Tumor(>7cm)
3 / g /
Figure 3: The four stages of lung cancer
progression [28]
- Stage 0 (Pre-cancer): At this stage, there are cell abnormalities, but they have not
developed into a tumor and have not spread to nearby organs
Trang 18Stage I (Local): The tumor is localized in the area of origin without spreading to
distant locations This is often a stage where cancer has a high chance of cure whendetected early
Stage II (Locally Advanced): Cancer starts to invade nearby structures and organsbut is still within the area of origin
Stage IIT (Regional): In lung cancer, this stage involves the lymph nodes in the
chest
o Stage IITA: Cancer occurs in the lymph nodes but only on the same side of
the chest where the cancer first started
o Stage IIIB: Cancer has spread to lymph nodes on the opposite side or above
the collarbone
Stage IV (Advanced or Metastatic): This is the stage where cancer has spread
extensively, possibly metastasizing to other organs in the body
Early Stage Symptoms: In the early stages, there may be no apparent symptoms
When present, they are often mild and easily overlooked They may include slight
swelling, discomfort, or fatigue In the limited stage, cancer is found only in one
lung or nearby lymph nodes on the same side of the chest
Advanced Stage Symptoms: Symptoms intensify and become more noticeable
Pain, swelling, weight loss, and fatigue may increase Issues related to the affected
organ's function may arise
Widespread Cancer Symptoms: Symptoms become severe and may spread to
nearby organs, such as Throughout one lung, To the opposite lung, To lymph nodes
on the opposite side, Fluid around the lungs, To the bone marrow, To distant organs
Patients may experience symptoms like pain, weight loss, swelling, and noticeableorgan dysfunction
Cancer Metastasis Stage Symptoms: Cancer has metastasized to other organs,causing symptoms dependent on the affected organ For example, metastasis to the
bone may cause bone pain, while metastasis to the brain may result in vision
changes, headaches, and mood alterations
Trang 192 Related Works
Lung cancer has become one of the prevalent cancer types, especially following the
emergence of the COVID-19 pandemic, leading to increased risk and particular concernregarding this issue Research on medical image processing using Deep Learning (DL)began as early as 1995, primarily focusing on the classification of lung nodules through X-
ray images [38]
Rajasekar V and colleagues [31] proposed various methods, including Convolutional
Neural Network (CNN), CNN Gradient Descent (CNN GD), VGG-16, VGG-19, Inception
V3, and Resnet-50, to predict lung cancer based on CT scan images and histopathological
images The analysis results indicate that the detection accuracy is superior when utilizing
histopathological data for analysis Conversely, Vasavi CH and Sruthi ND [32] employed
a Deep Learning CNN model along with machine learning SVM After analyzing and
selecting multiple models, they concluded that CNN is the most effective method C
Lavanya and team [33] experimented with machine learning algorithms such as logistic
regression, random forest, SVM, KNN, XGBoost, and AdaBoost to predict lung cancer
based on Novel Biomarkers The results revealed that the most accurate prediction model
is random forest, followed by SVM, KNN, with logistic regression yielding the lowest
accuracy.
A research article by SaqIb Qamar and colleagues [34] on predicting TEM images using a
hybrid deep learning and machine learning model is also a noteworthy effort They
extracted features from a CNN model and applied machine learning to classify these
features The results showed that the hybrid CNN - Random Forest model has a higheraccuracy rate (73%) compared to other classifiers such as AdaBoost, XGBoost, and SVM
Regarding the impact of COVID-19 on the lungs and increased cancer susceptibility, Talal
S Qaid and team [37] utilized chest X-ray images to predict lung cancer through a hybrid
deep learning model (CNN, VGG16, VGG19) and machine learning (Naive Bayes, support
vector machine, random forest, and XGBoost) The results demonstrated high predictionaccuracy, with all models surpassing 90%
Trang 203 Model Evaluation Metrics
In the process of researching and evaluating the model's performance, aiming to create an
accurate and effective framework for the learning process, understanding how the model
learns from data, evolves over time, and measuring its performance objectively and
consistently, we have undertaken the following steps:
e Define Model: is the process of determining the architecture, parameters, loss
function, optimization method, and data preprocessing for a machine learningmodel The purpose is to create a detailed design for the model to learn from dataand make accurate predictions
e Accuracy and Loss Graph
- Accuracy Graph: Measures the model's accuracy at each training step,
indicating the percentage of correct predictions Particularly, accuracy on the test
set is a crucial metric for evaluating the model's generalization capability to new
data
- Loss Graph: Illustrates the model's loss level, providing insights into how the
model is learning from the data A reduction in loss signifies effective modelprogression
e Classification Report and Confusion Matrix
Healthy people
Sick people correctly incorrectly predicted as
predicted as sick by the sick by the model
a _— ~
~
\
/ /
Sick people incorrectly ⁄⁄ Healthy people correctly
predicted as not sick by _—~ predicted as not sick by
the model the model
Figure 4: Definition of Confusion Matrix [30]
Trang 21- Confusion Matrix: A table representing the counts of correct and incorrect
predictions for each class This matrix illustrates the confusion between classes,
indicating True Positives (TP), True Negatives (TN), False Positives (FP), and
False Negatives (FN) and helps understand how the model classifies data pointsand provides information about the confusion between classes The main
diagonal elements of the matrix represent the correct predictions for each class,
while other elements indicate the number of mispredictions
- Classification Report: A summarized table providing detailed information about
the performance of a classification model It includes metrics such as Accuracy,
Precision, Recall, Fl — score and Support for each class The classification reporthelps assess the model's ability to classify each class and detect imbalances
Total number of samples
o Precision: The proportion of correct positive predictions in relation to the
overall number of positive predictions
True Positives
Calculation: ——————————————————
True Positives +False Positives
o Recall: The proportion of correct positive predictions compared to the
overall count of actual positive instances
True Positives
Calculation: ——————————————————
True Positives + False Negatives
o F1—score: Acomposite metric that takes into account both Precision and
Recall, commonly employed when it is desirable to assess both aspectssimultaneously
Precision x Recall
Calculation: 2 xX —
Precision + Recall
Trang 22o Support: The specific amount within each category Support aids in
measuring the consistency of samples among various classes
e 5-Fold Cross Validation: A technique in machine learning to evaluate the
performance of a model In this process, the dataset is divided into 5 equal parts
(folds) The model is trained on 4 folds and validated on the 5th fold This procedure
is repeated 5 times, each time using a different fold for validation The final result
is a combination of performance evaluations from the 5 validation rounds, providing
a reliable estimate of the model's performance on the entire dataset
Our model training process is divided into two main parts: ‘Individual Models Evaluation’
and ‘Hybrid Model Evaluation’ These segments allow us to monitor how the combination
of models influences their performance and predictive capabilities
CHAPTER 3: EXPERIMENTS AND RESULTS
1 Dataset
1.1 Dataset Introduction
The "IQ-OTH/NCCD Lung Cancer Dataset" [19] was collected at the Iraq-OncologyTeaching Hospital/National Center for Cancer Diseases (IQ-OTH/NCCD) over a period ofthree months during the fall of 2019 This dataset includes CT scan images from patients
diagnosed with lung cancer at various stages, as well as images from healthy individuals
The CT scan images in this dataset were meticulously annotated by medical experts
1.2 Dataset Information
It comprises a total of 1,097 images representing 110 cases Cases are categorized into three
classes: Benign, Malignant and Normal Based on diagnoses:
- 15 cases are Benign, with a total of 120 images in the dataset
- 40 cases are Malignant, with a total of 561 images in the dataset
- 55 cases are Normal, with a total of 416 images in the dataset
Trang 23All images in the dataset are in JPG format and resolution 512x512 pixels.
Malignant
Figure 5: Original images for the three categories: Benign, Malignant and Normal.
1.3 CT Scan details and slices
- The original CT scans were collected in DICOM format
- ASiemens SOMATOM scanner was used for image acquisition
- The CT protocol settings included 120 kV, a 1 mm slice thickness, and a window
width ranging from 350 to 1200 HU with a window center between 50 and 600 HU
- Scans were conducted with a breath-hold at full inspiration
- All images were de-identified before analysis
- Written consent was waived by the institutional review board, and the study was
approved by the institutional review board of participating medical centers
- Each scan in the dataset contains multiple slices
- The number of slices per scan varies from 80 to 200 slices
- Each slice represents an image of the human chest captured from different angles
and perspectives
1.4 Data Augmentation
In this Data Augmentation [20] [21] section, our primary objective is to augment the datasetfor the 'Benign' category, which originally had a limited number of instances (only 120
files) This augmentation process is specifically targeted at enhancing the dataset for the '
Benign' category to achieve a more balanced distribution of data among the three
categories While the 'malignant' category contains 561 files, and the 'normal' category has
416 files, the ' Benign' category was initially underrepresented with only 120 files
Trang 24Therefore, the entire Data Augmentation process is designed to augment the ' Benign’
category to address this data imbalance
“+ Applied Data Augmentation Method
Geometric Transformations [22] are one of the most suitable methods for augmenting lungCT-Scan image data This method allows us to apply geometric transformations such asflipping, cropping, rotating, or translating images It helps introduce diversity into thedataset without altering the content of the images These transformations include:
1 Cropping a portion of an image simulates the removal of unimportant areas This
helps the machine learning model focus on the crucial parts of the images
2 Flipping: Flipping images horizontally or vertically creates mirrored images This
helps the machine learning model generalize better to image variations
After the Data Augmentation process, we obtained a total of 1385 files, including 408
Benign, 561 Malignant, and 416 Normal cases We preserved both the original images andthe transformed images to gain an insight into the alterations made
The purpose of validating the Data Augmentation outcomes is to affirm the precise and
systematic creation of new image variations This validation encompasses verifying the
proper storage of images and their logical connection with the original ones
Original Flip Horizontal
Figure 6: Image after being Data Augmentation.
1.5 Split Train Test
Trang 25We chose the dataset split ratio of 80% and 20%, where the training set contains 1108 files
and the test set contains 277 files This choice ensures that the machine learning model istrained and tested effectively across all categories This is especially important whendealing with imbalanced data in the initial dataset
The training and testing sets are two crucial parts of the development and evaluation of a
machine learning or deep learning model The training set helps the model learn to classify
or predict, while the test set measures the model's capabilities The combination of thetraining and testing sets ensures that the model is trained and evaluated accurately andobjectively
e Procedure and Method for Train-Test Split
We employ the ‘train_test_split’ function from the scikit-learn library to divide the data in
an 80% - 20% ratio for the training set and test set, respectively
- Step 1: All images are resized to a uniform size of 128x128 pixels to ensure
consistency for the model
- Step 2: The resized data is divided into a training set and a test set following an 80%
- 20% ratio
- Step 3: The `tramn test spht` function from the scikit-learn library is used to
randomly partition the data Employing a random seed guarantee result
reproducibility and aids in monitoring and debugging the training and testing
process.
Table 1: Data Summary Table
Before Data After Data
Classes Train Set Test Set
Augmentation Augmentation
Benign 120 408 316 92
Malignant 561 561 469 92
Trang 26Normal 416 416 323 93
Total 1097 1385 1108 277
2 Model Architecture
This section presents a diagram to illustrate the detailed flow of our model In the first step,
the lung dataset is taken as input and undergoes preprocessing on the input images, such as
augmentation and resizing In the second stage, the training/testing data is split using the
train_test_split function from the scikit-learn library, randomly partitioning the data into
an 80% and 20% ratio In the third stage, machine learning algorithms, along with deeplearning methods, are employed by optimizing hyperparameters using the RandomizedSearch CV method We applied the Machine Learning methods of Random Forest, LogisticRegression, and Support Vector Machine Additionally, Deep Learning methods, includingCNN, Resnet50, and VGG16, were also utilized
Input Stage 1 Stage 2
Training / Testing Data Validation
Vv Vv
: Hyperparameters
¡ |+ Learning Rate + Neurons
¡ |+ Epoch + Batch Size
h 1 |+ Optimizer + Pooling
Logistic ' ‡ |+Layers + Dropout
Regression ! H + Activation Function
+ Solver H 1 |+ Filters Kernel Size
' + Early Stopping
H : CNN algorithms
Random Forest ' : +CNN ° Layers
† |+ n_estimators + max features} | H + CNN 4 Layers
h + min samples split + max depth | ; H + RESNET50
+ |+ min samples leaf + bootstrap ' h +VGG18
Classification (Benign, Malignant, Normal)
Trang 27In this research, we have developed a unique integrated pipeline, combining the strengths
of both machine learning and deep learning models to predict lung cancer accurately andefficiently This process represents a sophisticated blend between the ability to extractcomplex features from deep learning and the optimization of hyperparameters from
machine learning We begin with the input data, which consists of CT-scan images, andapply preprocessing on the input images, such as augmentation, resizing, and splitting the
data into training and testing sets with an 80% to 20% ratio In the subsequent stage, deep
learning algorithms are employed by optimizing hyperparameters using the Randomized
Search CV method and extracting features These extracted features serve as inputs for
machine learning algorithms Machine learning algorithms are also optimized for theirhyperparameters using the Randomized Search CV method The output results include
classification results, which are evaluated based on various performance metrics
Input Stage 1 Stage 2
: ¥ à ; - Data Augmentation Training / Testing
: : - Image Resampling Data Validation
Deep Learning Machine Learning
Randomized Search CV
Logistic : : |+Layers + Dropout
'8greSSi : + |+ Activation Function
Rearession h «n Feature Extraction ? |e Filters Kernel Size
+ Penalty : |+ Early Stopping
+C
CNN algorithms + CNN 3 Layers
+ CNN 4 Layers
Hl Random Forest '
+ |+ n_estimators + max features] |
+ |+ min samples split = max depth |
min samples leaf + bootstrap
Trang 282.1 Convolutional Neural Networks (CNN)
Convolutional Neural Network (CNN) [8] [9] [10] is a specialized type of neural networkcommonly used for image processing CNNs can autonomously learn hierarchical features
from data, making them effective in pattern recognition in images With the capability to
learn important features from image data, CNNs are an ideal model for classifying medical
images using imaging techniques
The CNN model is designed for efficient feature extraction and learning from image data.The model is constructed with 4 convolutional layers (3 convolutional layers with CNN 3
Layers) and 6 fully connected layers The ReLU activation function is applied after eachconvolutional layer to increase the non-linearity of the model, retaining positive values and
discarding negative values In each convolutional layer, filters are sequentially set to 32,
64, 128, and 256 to detect various features present in the image All filters have a kernel
size of 3x3, where these kernels move across the entire image to perform convolutional
operations, identifying small features within the image space Convolution means that asthe kernels move to each position in the image, they perform multiplication on the
corresponding elements, and the sum of these multiplications becomes the feature maps
To reduce the spatial size of the feature maps and simplify the model's computational
complexity, we use max-pooling layers with a size of 2x2 The max-pooling operation takesthe maximum values from a group of values in the scanning area, retaining prominent
features in the scanned region
Regarding the 6 fully connected layers, including 5 hidden layers with 256, 126, 64, 32
neurons, aims to create a gradually decreasing path to preserve important features The
final layer is an output layer with 3 classes In the last layer, the softmax activation function
is used to transform the output into a probability distribution
These details form a complete CNN architecture, capable of learning and extractingcomplex features from image data The use of Convolutional Neural Network (CNN) 1s an
ideal choice for predicting lung cancer CNN specializes in processing medical images and
automating the learning process from data The ability to extract features and synthesize
Trang 29spatial information of CNN makes it a powerful tool, with positive potential in research
and application for lung cancer prediction
Conv_1 Convolution
(5x 5) kernel
valid padding
fc_3 Fully-Connected
fc_4 Eully-Coni nected
Conv_2 Convolution
n2 channels
(4x4xn2)
n2 channels (8x8xn2) n1 channels
(12x12xn1) INPUT
(28 x 28 x 1) OUTPUT
e
n3 units
Figure 9: CNN Architecture [23]
2.2 'VGG16 (Visual Geometry Group 16)
VGG16 [II], a Convolutional Neural Network (CNN) model developed by the VisualGeometry Group at the University of Oxford, is recognized in computer vision for its deep
architecture Comprising 16 layers, including convolutional and fully connected layers,
VGG16 is known for its simplicity and effectiveness in image classification Achieving
high accuracy, particularly when trained on the ImageNet dataset, VGG16 excels in
learning and extracting complex features from image data
1x1Ix4096 1x1x 1000
=—————.-EB convolution+ReLU
Ế max pooling E7 fully connected+ReLU
€7 softmax
Figure 10: VGG16 Architecture [24]
VGG16 consists of a total of 16 layers, stacked on top of each other This includes 13convolutional layers and 3 fully connected (dense layers) VGG16 is typicallycharacterized by the use of multiple convolutional blocks, each featuring a similar structure
Trang 30with a set of convolutional layers and a max-pooling layer In the first two blocks, we
employ two convolutional layers with filters set to 64 and 128 Max pooling is utilized with
a 2x2 kernel, a stride of 2, and the ReLU activation function for non-linearity In the
subsequent three blocks, we incorporate three convolutional layers for each block, withfilters sequentially set to 256, 512, and 512 Similar to the first two blocks, we apply max
pooling with a 2x2 kernel, a stride of 2, and the ReLU activation function
Fully connected layers are used to connect the learned features into a classification model.The convolutional layers are numbered from Conv1 to Conv13, pooling layers denoted as
Pooll to Pool5, and fully connected layers labeled as FC1 to FC3 Fully Connected 1 and
Fully Connected 2 have 4096 neurons, while Fully Connected 3 (output layer) has 3neurons, corresponding to the number of classes in dataset VGG16 has a large number of
parameters, especially due to the use of 3x3 filters in multiple layers, totaling up to 138
million parameters [12] VGG16 is often trained on the ImageNet dataset, where it achieves
high performance and stands out as one of the exemplary models in computer vision
Advantages of VGGI6 include its simple and understandable architecture, capable oflearning complex features from images However, due to its large parameter count, it
results in an increased model size and demands significant computational resources
VGG - 16
Ỷ | Coni2 | Conv 2-1 Conv 2-2 | Con43 | mm| Conv 5-3 |
Figure 11: Summary of VGG16 Model [25]
With its deep and efficient architecture for learning and extracting features from image
data, VGG16 is an ideal choice for our project on lung cancer prediction Its simple design
and high accuracy, especially after being trained on the ImageNet dataset, make VGG16stand out in the field of medical image classification The flexibility and ability to
Trang 31synthesize spatial information of VGG16 make it a powerful tool in addressing the problem
of lung cancer prediction
2.3 RESNET50
ResNet50, a potent Convolutional Neural Network (CNN) developed by Microsoft
Research, is widely applied in computer vision Its revolutionary residual block structure
addresses challenges in training deep neural networks With 50 layers divided into 16
residual blocks, ResNet50 [13] [14] excels in image classification across various classes,
demonstrating expertise in learning complex features
Activation}
‘ony IxI
(b) (c) (d)
Figure 12: RESNET50 Architecture [26]
The basic building block in ResNet50 is the residual block, which has a main structure
comprising two main paths: the identity path and the shortcut path The identity path
represents the direct mapping from input to output The shortcut path provides a shortcutfor faster gradient flow through multiple layers Each residual block in ResNet50 typicallyuses a bottleneck architecture to reduce computational complexity The bottleneckarchitecture includes three main layers: 1x1 convolution, 3x3 convolution, and another 1x1convolution 1x1 convolutions are used to reduce and then restore the size of feature maps,optimizing the number of parameters ResNet50 is built by stacking multiple residualblocks on top of each other Stacking these blocks helps the model learn hierarchical
features from simple to complex Using multiple layers helps ResNet50 capture deep and
intricate representations from the input data Instead of using a fully connected layer at the
Trang 32end, ResNet50 often employs Global Average Pooling This layer computes the average
value of each feature map, creating a fixed-size vector GAP helps reduce the number of
parameters and acts as a regularization technique The last layer of ResNet50 is a fully
connected layer with softmax activation, responsible for creating a probability distributionacross the output classes of the model
The adaptability of ResNet50 in handling images of different sizes and its capability tolearn deep features make it a crucial and flexible tool for both research and practicalapplications in our lung cancer prediction project With 50 layers, ResNet50 not only excels
in multi-class image classification but also demonstrates efficiency in learning complex
features
2.4 Logistic Regression
Logistic Regression (LR) is a machine learning algorithm used to predict the probability
of an event occurring or not, commonly applied in classification tasks Unlike linear
regression, Logistic Regression utilizes the sigmoid function to generate probabilities,
making it effective in binary classification The term 'Logistic' originates from the S-shapedsigmoid curve, mapping the output to a range between 0 and 1, representing the probability
of the event occurring Logistic Regression is a simple and versatile tool widely used for
predicting customer purchases, image classification, and forecasting disease likelihood in
the medical field
Predicted class label
Net input Sigmoid Threshold
function activation ‡ function
Figure 13: Logistic Regression Architecture [27]
Logistic Regression [15] is a crucial algorithm in Machine Learning, commonly employed
for classification tasks It predicts the class of a new data sample based on its features.Logistic Regression utilizes the sigmoid function to transform its output into probabilities
Trang 33Regularization is applied, specifically '12' regularization, which adds the square value of
the magnitude of weights to the loss function This regularization helps maintain stability
by preventing unnecessary complexity
The regularization strength, denoted as 'C' and set to 0.01 in this context, determines the
influence of regularization during the training process A low value, such as 0.01, indicates
strong regularization, aiding in controlling the model's complexity The Newton-Conjugate
Gradient optimization algorithm is chosen as the solver algorithm to optimize the lossfunction and find suitable weights for the model These elements collectively contribute toLogistic Regression's effectiveness, ensuring stable and well-controlled learning during themodel's training
Logistic Regression is specifically designed to tackle classification problems, highlighting
its simplicity and wide applicability Its ability to perform well on various types of data andfeasibility in applying to the lung cancer prediction problem makes Logistic Regression a
powerful and useful tool in my research
2.5 Random Forest
Random Forest [16] is a machine learning model belonging to the Ensemble Learning
family, built on the principles of Decision Trees Unlike relying on a single decision tree,
Random Forest combines multiple decision trees to create a strong and diverse model Each
decision tree in the Random Forest is constructed on a randomly sampled subset from the
Trang 34training data, ensuring independence between the trees and preventing overfitting The
final decision of the Random Forest is determined through a voting process, where the
ultimate decision is chosen based on the majority from all the trees
After searching for optimal parameters, we use 2000 trees to enhance diversity and
accommodate different patterns within the data Choosing 'sqrt' (the square root of the totalnumber of features) as the limit for building trees helps reduce the risk of overfitting andencourages diversity The maximum depth of each tree is limited to 50 to control themodel's complexity The minimum samples required to split a node are set to 2 to ensure
stability in the splitting process The minimum samples at each leaf node are set to 3 to
control the size of the trees and avoid creating overly small leaf nodes Bootstrap Sampling
is set to False, meaning the entire dataset is used for each tree without random sampling
This helps create diversity during the tree-building process and prevents overfitting
Random Forest excels in synthesizing information and handling noise, making it a popular
tool for various classification and prediction applications
Random Forest, with its flexibility and ability to handle large datasets, is a suitable choice
for my lung cancer prediction project Its capacity to combine multiple decision treesmakes Random Forest effective in processing diverse data and learning complex models
Particularly, the stability and risk reduction of overfitting in Random Forest make it a
reliable tool for predicting and classifying lung cancer conditions
2.6 Support Vector Machine (SVM)
The Support Vector Machine (SVM) [17] [18] is a critical model in machine learning,
widely applied for classification and regression tasks SVM determines the optimal
decision hyperplane between data groups, relying on support vectors near the decisionboundary These support vectors influence the hyperplanes shape, enhancinggeneralizability and reducing computational costs The kernel, a vital SVM technique,maps data to a higher-dimensional space for linear classification SVM aims to maximize
the margin, the distance between the hyperplane and data points of both classes
Trang 35After finding the best parameters, we chose 'linear' for a linear SVM, suitable for linearly
separable data, meaning the decision boundary is a straight line The gamma parameter is
set to 1000, controlling the influence of a training example; a high value leads to a narrow
decision boundary The regularization parameter (C) is set to 0.01, balancing a smoothdecision boundary and accurate classification of training points A low C value indicates
strong regularization, prioritizing a larger margin SVM, with its ability to create complex
decision boundaries and high accuracy, is an ideal choice for my lung cancer predictionproject Its consistency and good performance on diverse datasets make SVM a powerfultool for classifying and predicting lung cancer status
Predicted variable
ƒŒ) =ø-@()+b
paramethérs
With RBE, linear, polynomial
or sigomid
Vectors in hyperplane
Figure 15: SVM Architecture [36]
2.7 Hybrid Deep Learning and Machine Learning Models
In the field of machine learning, data classification is of paramount importance, and feature
extraction plays a crucial role in this task Deep learning models such as CNN, ResNet50,and VGG16 have demonstrated outstanding capabilities in feature extraction from images
Therefore, the proposed model combines deep learning and machine learning to harness
the strengths of both fields
The input to the model is the image to be predicted, utilizing deep learning models (CNN,
ResNet50, and VGG16) to extract features from the input image The feature maps afterthe flatten layer are used as input data for machine learning Subsequently, the feature data
from deep learning is fed into machine learning models (Logistic Regression, SVM, and
Random Forest) to perform the classification process The output result is the classificationoutcome of the input image
Trang 36Based on the proficiency of deep learning (CNN, ResNet50, VGGI6) in extracting
complex features from images, the model has enhanced the performance of theclassification process These features, after being transferred to machine learning, create arobust and accurate multi-class classification system
3 Software Requirements
e System and Runtime Environment
- Operating System: Windows 10 Pro 64-bit
- System Model: MS-7D46
- BIOS Version: 2.10
- Processor: 12th Gen Intel(R) Core (TM) 15-12400F (12 CPUs), ~2.5GHz
- Memory: 16384MB RAM
- DirectX Version: DirectX 12
- Graphics Card: NVIDIA GeForce RTX 2060
e Libraries and Frameworks
- Programming Language: Python
- Main Libraries: NumPy, Pandas, Glob, Pickle, Joblib, Scikit-image, OpenCV,
Matplotlib, Scikit-learn, TensorFlow, Seaborn
Trang 374 Model Evaluation Results
4.1 Training Model Configuration
Table 2: The model training parameters
The model takes images
Input Image 128x128x1 with a size 128x128 and one
channel is Grayscale
To learn complexEpoch 20 representations from both
the training and testing data
Used to control the step size
during optimization, aiding
Learning Rate le-4 (0.0001) in convergence and stability
of the model
Meaning that for each
weight update, the modelBatch Size 35
utilizes 35 images during
training
- The machine learning
Number of Iterations 100 process consists of 100
iterations