1. Trang chủ
  2. » Luận Văn - Báo Cáo

Khóa luận tốt nghiệp: Machine learning methods for cancer classification using gene expression data

96 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Machine Learning Methods for Cancer Classification
Tác giả Pham Minh Quang, Nguyen Huynh Thao Nhu
Người hướng dẫn Ph.D. NGUYEN THANH BINH
Trường học University of Information Technology
Chuyên ngành Information Systems
Thể loại Graduation Thesis
Năm xuất bản 2024
Thành phố Ho Chi Minh City
Định dạng
Số trang 96
Dung lượng 53,64 MB

Cấu trúc

  • CHAPTER 1: INTRODUCTION ...................... 5: 22523223 %%E5123 131512151 11511111111111112111111 1111111111111 1111 1111110111111 (11)
    • 1.2 I10919000917)/901900Ẽ0008Ẻ88Ẻ........^'....................... 1.3. Research Questions (0)
    • 1.4 Research Delimitations (12)
    • 1.5 Thesis Structure... ce (13)
  • CHAPTER 2: BACKGROUND AND THEORY ..................... .. ---- S1 ST 92T H111 TH HT HT HH 14 (13)
    • 2.1 Basic Knowledge of Gene ......cececcccsssssessesseseeseseeseeeseeseecesecseccseeseesesecsecseeeenesaeeeeeecseeseeeeaeeaeeeeeeeaeeseeeaees 14 ZL. What is DNA? oo. a (13)
  • PIN 00 0 190(00o2ầaaaẳầặầầọặaặaặẶẶ............-4ơọ:ễễễễ- 15 2.1.3. What is Gene? oo (0)
    • 2.1.4 Hân oto (0)
    • 2.2 Basic Knowledge of Gene EXPT€SSIOH......................... G0 11v TH HT HT HH TH TH nh HT Hàng rệt 18 (17)
      • 2.2.1 Stages in Gene EXPT€SSIOII..........................- - t1 HT TH TH TH HT TH TT HH nh ng H (0)
        • 2.2.1.1 TTAnSCTIPHION.......................... SG 121 20121 1 9111 11 11 111 HT HT ch HT TH TH TH HT TH TH ch ghi (0)
        • 2.2.1.2 Translation................... €2... ốp... A (19)
      • 2.2.2 Regulation of gene expression (20)
        • 2.2.2.1 Transcriptional regulation... cececeeeeseeseeeeseescseeseseseeseesesecseeseneesecseeecsesaeeaeeecaeeeseesaeeaeereneeaee (20)
        • 2.2.2.3 Translational regulation 0n. .ốố.ố........ỪỪDỒỒ (20)
        • 2.2.2.4 Post-translational regulafOH.......................--- ôSt + + E11 ST TH HT TH TT Hàn HT gưệt 22 2.2.3. Methods for measuring gene €XpT€SSIOII......................-- -- c3. 2631211211511 1511511511111 11 1111 11T Hàn nh chư 22 (21)
        • 2.2.3.1 Northern Bloftting..................................... + TE ST HT TT TT HT HH TH HH Tiệc 22 (0)
        • 2.2.3.2 Quantitative Polymerase Chain Reaction (QPCR) ....................... - --- -- 6+5 Sx+x£‡E‡seEsrstererrrerrsrske 22 2.2.3.3, MUCTOALTAY nh ........ .ố ố ằe (21)
    • 2.3 Basic Knowledge of Acute Leukemia (Blood Cancer) .........cececescesceeseeseeseceeeeseeseeaecnseeaeeaeenecesesaeensenees 22 (21)
      • 2.3.1 What is Blood Cancer? ........... 5ọ (22)
      • 2.3.2 What is Acute Lymphoblastic Leukaemia (ALL))?....................-- ¿5+ 3c St + +ESvEEvEEererrrrrrerrrrrrerrrrerrree 23 (22)
        • 2.3.2.1 Definition Acute Lymphoblastic Leukaerm1a...................... .-- --- +: + *t + E+EEvExeeerrrrrerrrrrrerrerrerrre 23 (22)
        • 2.3.2.2 Types of ALÌL......................... St nSn 1S * 1v 1T 11111 1T TH TH TT TH TT TH TH TH TH TH TH TH Hy 24 (23)
        • 2.3.2.3 Philadelphia positive ALL .......................- c5. 2c 32132112111 11511111 1111111 111 1 11111 TT HT nàn Hy 24 2.3.3. What is Acute Myeloid Leukaemia (AML)? ......................... -. c2: +Sc 33 113113 12 13111511111 EEETEErkrrkrree 25 (23)
        • 2.3.3.1 Definition of Acute Myeloid Leuka€Imla.......................... --- - +: 1c 3132113131 5E EEESkrrkrrkrrkrrre 25 (24)
        • 2.3.3.2 Types of AML,........................... nh TH HT HT HT TH TT HT TH TT HT Tàn ch ghe 26 2.3.3.3. AML starts in the bone IATTOW...................... án TT TT TT HT HT Hàng HH ưệt 26 (25)
      • 2.3.4 Symptoms of blood Cancer... .- (E111 EE 1 SE TT HT TT TT TT TT HT TH TH HT gưệt 27 (26)
      • 2.3.5 Cancer-Causing Að€IS.......................... kh TT HT TH TT HT TH TT TT TT TT HT TH TT gưệt 28 (27)
      • 2.3.6 Dangerous level of blood Cancer... ee ố (27)
    • 2.4 Impact of Initial and Prolonged Exposure to Carcinogen ..0.......cccsscsesssseseseeneeeeseeeseseseeeseeseeeeesseneeees 29 (28)
    • 2.5 Stages Of s1... -ada (0)
    • 4.3 I2.30I0v:10A/08.1/.020710 45-20010717 (0)
      • 4.3.1 EDA 0. nh... .......................ÔỎ 61 (0)
      • 4.3.3 Feature Engineering oo (63)
        • 4.3.3.1 Data cố. ......'"^ (0)
        • 4.3.3.2 Feature Selection... cece ẻ (63)
        • 4.4.1.1 Confusion Matrix q00 0 (0)
        • 4.4.1.2 Classification Report Confusion Matrix of Naive Bayes Model.......................- --s-ccscsssssssereererree 71 (70)
      • 4.4.2 Logistic R€BT€SSIOH.......................... TT TH TH TT HH TT HT TH TH TT HT TH TT TH ch He 71 (0)
        • 4.4.2.1 Confusion Matrix of Logistic ẹ€BT€SSIOH..................... -ú- S11 121 91 919119111101 g1 ng HH gà nưệc 71 (0)
        • 4.4.2.2 Classification Report Confusion Matrix of Logistic Regression Model ............................. --- ôs2 73 (72)
      • 4.4.3 Support Vector Machine 1...5 (72)
        • 4.4.3.1 Confusion Matrix of Support Vector Machine... eesseeeeseseeeeeeeeeseeecseeseeeeeeseeseeesseeaeeeeaeess 73 (72)
        • 4.4.3.2 Classification Report Confusion Matrix of Support Vector Machine Model (75)
        • 4.4.4.1 Confusion Matrix of Decision “TT€€..................... + 2t 1912112115111 121 1111111111 111 1n HT Tàn nưệt 76 (75)
      • 44.4.2 Classification Report Confusion Matrix of Decision Tree Model ...........................-- - 5+ +<+s<<c+xs+2 78 (77)
      • 4.4.5 Random FOT€SẨ........................ c6 c1 SE Tnhh TT TT HT TT TT TT TH HT HT HT TH nh 78 (0)
        • 4.4.5.1 Confusion Matrix of Random FOF€S(........................ ¿tt 3x3 EEE ST TT HT HH rệt 78 (77)
        • 4.4.5.2 Classification Report Confusion Matrix of Random Forest Model ......................... .--- ----++s<+s=+s+ss+ 81 “se... ... 4 (80)
        • 4.4.6.1 Confusion Matrix of XG — BOOSÍ.................... cóc Sàn TH TH HT TT TH TT Hàn Tàn tiệt 81 (80)
      • 44.6.2 Classification Report Confusion Matrix of XGB Model ...................... .-- --- -- 5c + s++ssxserserseerrerseree 84 (83)
    • 44.7 Adaboost....... cece h6.Ỉ ` ố(“ (83)
      • 4.4.7.1 Confusion Matrix of AdabOOSI...........................--.- c2 2S 2121 912121151212 1111121121112 1 111112 11T HH ưàc 84 (83)
      • 44.7.2 Classification Report Confusion Matrix of Adaboost Model .......................... ...- ¿+ s:++x++sx+exssxserseres 86 (85)
      • 4.4.8 Neural NetWwOrK............................. ch th HH HT HT HT HH HT HH, 87 (86)
        • 4.4.8.1 Confusion Matrix of Neural NetwOrd.........................---- ôc5: + x21 2211211211121 111111 1H11 ườt 87 (86)
        • 4.4.8.2 Classification Report Confusion Matrix of Neural Netword Model ............................-.. ----ô+sô+csxcsss 88 (87)
      • 4.4.9 K—means Clustering 00 (88)
        • 4.4.9.1 Confusion Matrix of K — means C[USf€TITE.......................- - 6 E111 91 9121191 5111 11 12t vn gàng rệt 89 (0)
      • 44.9.2 Classification Report Confusion Matrix of K — means Clustering Model (90)
    • 4.1 Model Evaluation.......................... ..--- ----c+ (90)
      • 4.1.1 Compare Evaluation of Built Models (0)
      • 4.1.2 Performance Metrics nh cố (92)
      • 4.1.3 Conclusion of comparing the evaluation of the modeÌS........................ --- 5+ 2333 +vE+Evxexsereerrerrrrsrrrre 94 (93)
  • CHAPTER 5: CONCLUSION AND FUTURE RESEARCH DIRECTIONS........................ .---- 5c Scscsrecssrrrres 95 (94)
  • CHAPTER 6: REFERENCES 0n (95)

Nội dung

Classification Report Confusion Matrix of Naive Bayes Model with Data Scaler....Classification Report Confusion Matrix of Naive Bayes Model with PCA...-.----+--+ Classification Report Co

INTRODUCTION 5: 22523223 %%E5123 131512151 11511111111111112111111 1111111111111 1111 1111110111111

Research Delimitations

In outlining the scope and parameters of this study, specific research delimitations have been instituted to enhance precision and relevance The primary focus of this research is intentionally restricted to the classification of patients diagnosed with Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL) through the examination of gene expression data Consequently, this study deliberately omits the consideration of other cancer types and additional biological markers.

These research delimitations are purposefully defined to enable a concentrated and targeted investigation, fostering an in-depth exploration within the specific confines of gene expression data within the context of AML and ALL classification.

Thesis Structure ce

- Overview of the context that led to the research.

- Identification and explanation of the problem addressed in the thesis.

- Constraints or limitations set for the research scope.

BACKGROUND AND THEORY S1 ST 92T H111 TH HT HT HH 14

Basic Knowledge of Gene cececcccsssssessesseseeseseeseeeseeseecesecseccseeseesesecsecseeeenesaeeeeeecseeseeeeaeeaeeeeeeeaeeseeeaees 14 ZL What is DNA? oo a

DNA [2], an abbreviation for Deoxyribonucleic Acid, comprises nucleotide units and holds immense significance not only for humans but also for the majority of living organisms It serves as the repository of genetic material and genes, thereby contributing to our uniqueness.

A complete DNA set encompasses 3 billion bases, 20,000 genes, and 23 pairs of chromosomes The inheritance of DNA is equally distributed, with half originating from our father's sperm and the other half from our mother's eggs.

Despite its paramount importance, DNA is remarkably fragile, experiencing tens of thousands of damaging events daily in each of our cells These damages may result from errors in DNA replication, exposure to free radicals, or UV radiation Fortunately, specialized proteins within our cells can detect and repair many instances of DNA damage.

DNA carries the instructions essential for an organism's growth, development, and reproduction These instructions are stored in sequences of nucleotide base pairs Cells interpret the code to synthesize the proteins required for growth, where each group of three bases corresponds to specific amino acids — the fundamental building blocks of proteins.

Mutations arise when alterations occur in the DNA sequence While mutations can be a cause for concern, they also play a beneficial role If a mutation results in a change in the DNA code that impacts protein synthesis, it can be detrimental as malfunctioning proteins may lead to the onset of diseases, such as cystic fibrosis or sickle cell anemia.

Diseases like cancer can stem from mutations, particularly when genes responsible for proteins involved in cell growth undergo alterations This can result in uncontrolled cell growth and division, contributing to the development of cancer Some mutations leading to cancer can be inherited, while others may be acquired through exposure to carcinogens like UV radiation, chemicals, or cigarette smoke.

It's crucial to acknowledge that not all mutations are harmful; some are entirely benign and, in certain instances, contribute to the diversity of our genetic makeup.

Chromosomes [3][17] are threadlike structures made of protein and a single DNA molecule. They play a vital role in transporting genomic information from one cell to another Found within the

15|Page nucleus of cells in both plants and animals, including humans, chromosomes serve as carriers of genetic material.

In humans, there are 22 pairs of numbered chromosomes known as autosomes and one pair of sex chromosomes, which can be either XX or XY, resulting in a total of 46 chromosomes Each chromosome pair consists of two chromosomes, with one inherited from each parent This inheritance pattern means that individuals receive half of their chromosomes from their mother and the other half from their father.

The DNA within each chromosome consists of numerous genes [1][21], and it also harbors extensive sequences that do not encode any proteins, the functions of which are not yet fully understood The coding regions of genes contain instructions that enable a cell to synthesize specific proteins or enzymes It is estimated that there are nearly 50,000 to 100,000 genes [18], each composed of hundreds of thousands of chemical bases.

To initiate protein synthesis, the genetic information from DNA is transcribed by each of the chemical bases into messenger RNA (ribonucleic acid), or mRNA Subsequently, the mRNA exits the nucleus and utilizes cellular organelles called ribosomes in the cytoplasm The ribosomes facilitate the formation of a polypeptide or amino acid chain, which then undergoes folding and configuration to ultimately create the intended protein This intricate process of gene expression is fundamental to cellular function and contributes to the vast diversity of proteins that carry out various biological functions in living organisms.

All the DNA within a cell collectively constitutes the human genome Approximately 20,000 essential genes are distributed among the 23 pairs of chromosomes located in the cell nucleus or on lengthy strands of DNA within the mitochondria.

Each individual inherits two copies of each gene, one from their father and one from their mother While the majority of genes are shared among all individuals, there are minor variations in a small percentage (less than 1% overall) that are characteristic of each person These variations, known as alleles, represent different sequences of base pairs, contributing to the unique physical traits of each individual.

Approximately 20,000 genes within the cell play a crucial role in guiding the growth, development, and overall health of animals or humans The genetic information housed in DNA exists in the form of a chemical code known as the genetic code, which exhibits similarities across various living organisms.

An "allele" represents one variant of a gene While many people may share the same gene, specific individuals may possess a distinct allele of that gene, influencing particular traits such as hair or eye color.

Despite commonalities in the genetic code, there exist variations that contribute to the uniqueness of each individual Most of these variations are harmless, but alterations to the genetic information can result in improper production, incorrect amounts, or the absence of certain proteins.

Faulty variations in the gene are termed mutations Single nucleotide polymorphisms (SNPs) represent changes in a single base or letter within the sequence and may encode a completely different protein, resembling a genetic mutation.

0 190(00o2ầaaaẳầặầầọặaặaặẶẶ -4ơọ:ễễễễ- 15 2.1.3 What is Gene? oo

Basic Knowledge of Gene EXPT€SSIOH G0 11v TH HT HT HH TH TH nh HT Hàng rệt 18

Genes serve as the fundamental units of inheritance, carrying the genetic information essential for determining an organism's specific traits Within genes, instructions are encoded for the synthesis of proteins that play diverse roles crucial for the cell's survival and functionality Proteins actively contribute to the structure, function, and regulation of various biological processes, making their timely production vital for cell operations.

This activation of genes to produce proteins is termed gene expression The process involves the generation of messenger RNA (mRNA) [5], serving as a blueprint for protein construction Essentially, gene expression represents the flow of genetic information from genes to proteins, ensuring the creation of specific proteins at the appropriate times and quantities.

Upon the cell's need for a particular protein, the corresponding gene undergoes activation, initiating a series of steps Initially, transcription takes place, producing an mRNA copy of the gene.

The mRNA then acts as a guide for protein synthesis during translation This intricate process entails reading the mRNA sequence in triplets and assembling the corresponding amino acids to construct a distinct protein The precise sequence of amino acids dictates the structure and function of the protein. Understanding gene expression is pivotal for comprehending the intricacies of cellular processes and ensuring the orchestrated production of proteins.

Gene expression [4] is a biological process in which the genetic information contained within genes is translated into functional outcomes This process includes two main stages: transcription and translation.

Transcription marks the inaugural phase in gene expression, a pivotal process in which the genetic information housed within DNA undergoes transcription into messenger RNA (mRNA) This transformative step is orchestrated by an enzyme known as RNA polymerase.

2 RNA Pol II microbenotes.com

The transcription process commences with the binding of RNA polymerase to a specific DNA region termed the promoter, which serves as the initiation point for transcription. Subsequently, RNA polymerase unwinds the DNA double helix near the promoter, allowing one DNA strand to act as a template for mRNA synthesis.

As RNA polymerase progresses along the DNA template, it incorporates complementary RNA nucleotides into the growing mRNA chain The synthesis occurs in the 5’ to 3’ direction, mirroring the DNA template strand Transcription continues until a termination signal within the DNA sequence prompts the detachment of RNA polymerase from the DNA template, leading to the release of the mRNA molecule.

In prokaryotes, transcription and translation transpire in the same cellular compartment Conversely, in eukaryotes, the initial RNA transcript (pre-mRNA) undergoes

19|Page modifications, including capping, splicing, and the addition of a poly-A tail, to transform into mature mRNA After processing, mature mRNA molecules are transported from the cell nucleus to the cytoplasm, where they act as templates for protein synthesis during translation.

Translation is the pivotal process that utilizes the genetic information encoded in mRNA to orchestrate the synthesis of a precise sequence of amino acids, ultimately giving rise to the creation of a functional protein This intricate mechanism ensures the accurate transformation of nucleotide sequences into the language of proteins, crucial for the diverse functions and structures that proteins contribute to within a cell.

@ initiation © Termination ve~®, gure 7 Protein Trai

The commencement of translation occurs as the mRNA binds to the ribosome, recognizing the start codon (AUG) on the mRNA and establishing the initiation point for translation.

Transfer RNA (tRNA) molecules, each carrying a specific amino acid, align their anticodon sequences with the corresponding codons on the mRNA Throughout elongation, the ribosome traverses the mRNA, and each codon pairs with the appropriate tRNA anticodon. Amino acids transported by tRNAs are sequentially joined, forming a polypeptide chain.

As the ribosome progresses along the mRNA, the polypeptide chain grows, constructing the primary structure of the protein The translation process endures until a stop codon (UAA, UAG, or UGA) is encountered on the mRNA These stop codons do not specify amino acids but rather signal the termination of translation.

Subsequent to translation, the newly synthesized protein may undergo diverse post- translational modifications These modifications encompass processes such as folding into a specific three-dimensional shape, addition of functional groups, or cleavage of specific segments to activate the protein Understanding these steps is crucial for unraveling the complexities of protein synthesis and functionality within the cell.

The regulation of gene expression is a crucial process governing the timing and quantity of protein production by genes This control mechanism ensures that proteins are synthesized in specific cells at particular times.

Gene expression undergoes regulation at various stages, including transcription, RNA processing, translation, and post-translational modification These stages collectively form a sophisticated network of checks and balances that finely tune the cellular machinery, allowing for precise control over protein synthesis Such regulation is essential for maintaining cellular homeostasis, responding to environmental cues, and ensuring the proper development and functioning of an organism.

Transcription regulation involves genetic interactions with specific DNA sites and epigenetic effects Transcription factors interact with DNA, enhancing or inhibiting RNA polymerase binding Epigenetic modifications, like DNA methylation and histone alterations, change gene accessibility without changing the DNA sequence.

Regulation of transcription in eukaryotic cells

Gene expression can be regulated by post-transcriptional mRNA processing In eukaryotes, transcribed mRNA undergoes splicing and other modifications that protect the ends of the RNA strand from degradation Splicing removes non-coding segments called introns and joins the protein-coding regions called exons.

Basic Knowledge of Acute Leukemia (Blood Cancer) cececescesceeseeseeseceeeeseeseeaecnseeaeeaeenecesesaeensenees 22

Blood cancer [22] is a term that's used to describe many different types of cancer that can affect your blood, bone marrow or lymphatic system It happens when something goes wrong with the development of your blood cells.

Blood cancer is a type of cancer that affects your blood cells Leukaemia, lymphoma and myeloma are some of the most common types of blood cancer There are also types called MPNs and

Blood cancer is caused by changes (mutations) in the DNA within blood cells This causes the blood cells to start behaving abnormally In almost all cases, these changes are linked to things we can’t control They happen during a person’s lifetime, so they are not genetic faults you can pass on.

Some types of blood cancer affect children Symptoms and treatment can be different between children and adults.

) Í (c) White blood Myeloma cells (B-cells) ©

2.3.2 What is Acute Lymphoblastic Leukaemia (ALL)?

Acute Lymphoblastic Leukemia (ALL) [24][25] is a rapidly developing form of blood cancer that primarily affects lymphoid blasts, also known as lymphoblasts or blast cells.

In the normal functioning of the body, blood contains three main types of cells: red blood cells, white blood cells, and platelets, each with specific roles.

Red blood cells: These cells contain a protein called hemoglobin, which transports oxygen to all tissues in the body Oxygen is essential for tissues, including muscles, to utilize the energy derived from food.

White blood cells: White blood cells play a crucial role in the immune system, helping to fight and prevent infections There are various types of white blood cells, including lymphocytes, monocytes, and granulocytes.

Platelets: Platelets are responsible for sticking together and forming clots to stop bleeding, such as in the case of cuts or bruises.

In the context of ALL, the condition affects lymphoid blasts (or lymphoblasts) in the bone marrow These lymphoid blasts are a type of white blood cell vital for the immune system Normally, these blasts mature into lymphocytes, which play a key role in fighting infections.

However, in individuals with ALL, the normal process goes awry The blast cells multiply excessively and fail to develop properly, leading to the formation of abnormal blast cells known as leukemia cells These abnormal cells accumulate in the bone marrow, crowding out the space for normal blood cells to be produced As a result, the body lacks a sufficient number of white blood cells, red blood cells, and platelets, leading to various symptoms.

There are different types of ALL The treatment you have may be different depending on the specific type you’re diagnosed with There are two types of lymphocytes known as B cells and T cells.

Myeloid stem cell Lymphoid stem cell

Red blood B lymphocyte X1 cells ki Neutrophil ate Natural a 6 T lymphocyte killer cell

Philadelphia positive ALL (Acute Lymphoblastic Leukemia) is a subtype of ALL characterized by a specific genetic abnormality that occurs in approximately one in four adults with ALL This genetic fault is unique to the abnormal cells and is not inherited.

24|Page e Genetic Fault: The specific genetic fault responsible for Philadelphia positive ALL is not present throughout the entire body but is confined to the abnormal cells associated with leukemia It is not an inherited genetic fault. e Chromosomes and DNA: The genetic instructions within cells are stored in structures called chromosomes Chromosomes are composed of DNA arranged in sections known as genes Each cell in the body typically has 23 pairs of chromosomes During cell division, chromosomes usually remain the same in each new cell. e Translocation: Philadelphia positive ALL is thought to originate from a chance occurrence during cell division A segment of chromosome 9, containing the ABLI gene, becomes attached to a segment of chromosome 22, containing the BCR gene This results in the formation of a new fusion gene called BCR-ABLI, causing chromosome 22 to be shorter than normal This altered chromosome is known as the Philadelphia chromosome, named after its discovery in Philadelphia, USA This genetic exchange is termed a translocation or chromosomal translocation and is specifically referred to as t(9;22). e BCR-ABLI Gene: The new BCR-ABLI gene is a crucial factor in the development of

Philadelphia positive ALL It produces a novel protein known as tyrosine kinase, which functions as an enzyme speeding up chemical processes This protein induces the leukemia cells to divide more frequently and live longer than usual.

2.3.3 What is Acute Myeloid Leukaemia (AML)?

2.3.3.1 Definition of Acute Myeloid Leukaemia

Acute Myeloid Leukemia (AML) is initiated by DNA damage occurring in the bone marrow, and this form of damage is termed an acquired mutation When the cells within the bone marrow sustain damage, the normal development of blood cells is disrupted These damaged cells may transform into abnormal, cancerous cells referred to as blasts or myeloblasts, resembling healthy immature blast cells.

As these cancerous myeloblasts proliferate, they overcrowd the bone marrow, impeding the production of healthy cells Additionally, they accumulate in the bloodstream,

25|Page leading to a reduction in the number of functional blood cells Consequently, individuals with AML often experience the following symptoms and signs: e Anemia: Resulting from insufficient red blood cells, it can manifest as fatigue, weakness, and shortness of breath. e Infections: Due to an inadequate number of mature neutrophils, individuals are more susceptible to infections. e Easy Bruising or Bleeding: A low platelet count contributes to a tendency for easy bruising or bleeding.

AML primarily manifests in the blood and bone marrow, and in some cases, it may spread to other parts of the body such as lymph nodes, spleen, liver, brain, skin, and gums. Occasionally, AML cells can coalesce to form a solid tumor known as a myeloid sarcoma or chloroma, which can develop anywhere in the body This condition is often referred to as extramedullary disease.

Myeloid stem cell Lymphoid stem cell

Red blood cells Platelets White blood cells

There are different subtypes of AML, depending on which genetic changes (mutations) the AML cells have, which types of blood cells are affected, and whether your AML has been caused by previous cancer treatment or developed from a previous blood condition.

Impact of Initial and Prolonged Exposure to Carcinogen 0 .cccsscsesssseseseeneeeeseeeseseseeeseeseeeeesseneeees 29

Genetic Instability: Initial exposure to carcinogens may trigger genetic instability in hematopoietic stem cells, the precursors to blood cells Carcinogens can induce genetic mutations, disrupting the normal functioning of these cells.

DNA Damage: Carcinogens have the potential to cause direct damage to the DNA within hematopoietic cells This damage can lead to aberrant cell behavior, promoting the development of abnormal cells, such as leukemic blasts.

Initiation of Leukemia: The genetic alterations induced by initial exposure can initiate the process of leukemogenesis, where leukemic cells begin to proliferate and accumulate in the bone marrow These early events set the stage for the development of leukemia.

Cumulative Genetic Changes: Repeated and regular exposure to carcinogens can result in cumulative genetic changes Over time, the continuous assault on hematopoietic cells may lead to a higher frequency of mutations, increasing the risk of leukemogenic transformations.

Chronic Inflammation: Certain carcinogens may induce chronic inflammation, creating a microenvironment conducive to the survival and growth of leukemic cells Inflammation-related factors can contribute to the progression of leukemia.

Latency Period: Prolonged exposure may extend the latency period between initial exposure and the clinical manifestation of leukemia This latency period varies depending on the specific carcinogen, individual susceptibility, and overall health.

Occupational and Environmental Impact: In occupational settings, where individuals may experience repeated exposure, the risk of leukemia can be heightened Regular exposure in high-risk environments, such as industries involving benzene or certain chemicals, may exacerbate the potential for leukemogenesis.

Prevention and Intervention: Regular exposure calls for vigilant preventive measures. Occupational safety protocols, environmental regulations, and personal protective equipment become crucial in minimizing the risk associated with repeated exposure to known carcinogens.

Monitoring and Early Detection: Given the cumulative nature of repeated exposure, regular health monitoring is essential Routine health check-ups and surveillance for early signs of leukemia are critical for timely intervention and treatment.

Stage 0 — A patient has high levels of white blood cells, but no other physical symptoms.

Stage 1 — A patient has high levels of white blood cells and enlarged lymph nodes.

Stage 2 — A patient has high levels of white blood cells and is anemic He or she may also have enlarged lymph nodes.

Stage 3 — A patient has high levels of white blood cells and is anemic He or she may also have enlarged lymph nodes and/or an enlarged liver or spleen.

Stage 4— A patient has high levels of white blood cells and low platelets He or she may also be anemic, have enlarged lymph nodes and have an enlarged liver or spleen.

Treatment of blood cancer with conventional therapies not only affects the cancerous tissue but also damages the surrounding tissues The side effects observed in the conventional therapies include muscle pain, disorders in blood, nervous system malfunction, vomiting, mouth sores, headache, etc.

Thinking and memory of the patient were severely affected during chemotherapy (chemo brain) Few side effects persist for a longer time that include kidney failure, lung defects, damage to the reproductive system, and heart It takes a very long duration for complete curing of blood cancer using conventional therapies Changes occur rapidly during the cancer treatment, and the effects are similar in adults and children Personalized medicine helps the doctor to treat the patient in a more effective manner.

Treatment Available for Blood Cancer

Large Volume Parenteral pe | Small Volype Radiation Therapy Parentera}

% = it available tment strat wr Blood

In general, the algorithms including Naive Bayes, Logistic Regression, SVM, Decision Tree, Random Forest, XGBoost, Adaboost, and Neural Network are categorized as machine learning and deep learning algorithms Both fields are concerned with building models from data to perform tasks such as classification, prediction, clustering, and learning complex data structures Additionally, Random Forest, XGBoost, and Adaboost are considered ensemble learning algorithms as they combine multiple models to enhance performance. e Classification Algorithms:

S Naive Bayes: Based on Bayes' theorem to classify data with the "naive" assumption of independence between features.

Logistic Regression: Uses the logistic function to predict the probability of belonging to a certain class.

SVM (Support Vector Machines): Finds the best separating hyperplane between classes using support vectors.

Decision Tree: Builds a decision tree based on asking questions about features to classify data.

Random Forest: Ensemble method that combines multiple decision trees to reduce overfitting and improve accuracy.

XGBoost (eXtreme Gradient Boosting): Boosting method that enhances decision trees to improve accuracy and reduce overfitting.

Adaboost (Adaptive Boosting): Boosting technique that enhances weak learners by focusing on misclassified samples.

Neural Network: Models the structure of the human brain with layers of neurons to learn complex data structures.

31|Page e Clustering Algorithm: o KMeans Clustering: Classifies data into clusters based on their similarity, with the number of clusters predetermined. e Ensemble Learning: Random Forest, XGBoost, Adaboost. o These algorithms fall under ensemble learning, combining multiple models to create a stronger model compared to using a single model. s* Summary: e Classification: Naive Bayes, Logistic Regression, SVM, Decision Tree. e Clustering: K Means Clustering. e Ensemble Learning: Random Forest, XGBoost, Adaboost. e Deep Learning: Neural Network.

The Naive Bayes [7] algorithm is a classification technique based on Bayes' Theorem with an independence assumption among predictors In simple terms, it assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

The Naive Bayes classifier is widely used for classification tasks such as text classification, belonging to the family of generative learning algorithms It models the distribution of inputs for a given class under the assumption that features are conditionally independent given the class, allowing for quick and accurate predictions.

Adaboost cece h6.Ỉ ` ố(“

4.4.7.1 Confusion Matrix of Adaboost a Using Origin Data

Figure 55 Confusion Matrix of Adaboost Model with Origin Data

Table 33 Classification Report Confusion Matrix of Adaboost Model with Origin Data

Figure 56 Confusion Matrix of Adaboost Model with Data Scaler

Table 34 Classification Report Confusion Matrix of Adaboost Model with Data Scaler

F 57 Ce ion Matrix of Ad 2đ 4 Table 35 Classification Report Confusion Matrix of Ada 21 with PCA

4.4.7.2 Classification Report Confusion Matrix of Adaboost Model

Table 36 Classification Report Confusion Matrix of Adaboost Model

4.4.8.1 Confusion Matrix of Neural Netword a Using Origin Data

Figure 58 Conf n Matrix ô etword Mo vin Data

Table 37 Classification | ‘t Confusion Matrix of Neural 1 with Origin Data

Figure 5 ifusior trix of Neural Ne re 1 Data Scaler

Table 38 Classification Report Confusioi ix of Neural N 1 with Data Scaler

4.4.8.2 Classification Report Confusion Matrix of Neural Netword Model

Table 39 Classification Report Confusion Matrix of Neural Netword Model

4.4.9.1 Confusion Matrix of K — means Clustering a Using Origin Data

Table 40 Classification Re) Confusion Matrix of K = mea

)rigin Data adel with Origin Data

Figure 61 sion’Matrix of K ơ Me Data Scaler

Table 41 Classificati 20ff Co sion Matrix of K h ing Model with Data Scaler

Figure ằnfusion Mat ns th PCA

Table 42 Classifi: rt Cony n Ma J 5 r Model with PCA

4.4.9.2 Classification Report Confusion Matrix of K — means Clustering Model

Table 43 Classification Report Confusion Matrix of K — means Clustering Model

Model Evaluation - c+

XGB — no PCA and no Grid Search

Compare Evaluation of Built Models

Algorithm Accuracy (rate} Time (second)

7 XGB — PCA with Grid Search 0.824 275.071393

8 XGB — no PCA or Grid Search 0.647 0.029343

9 XGB — no PCA and no Grid Search 0.812 4.196286

Figure 63:Accuracy and execution:time.

Summarizing the results of various algorithms In the end, we chose accuracy to compare the models with each other and select the model with the highest accuracy.

The chart below compares the accuracy of different algorithms, indicating that Logistic Regression and SVM yield the highest accuracy.

Logistic Regression Support Vector Machine

Random Forest Decision Tree Neural Network

Naive Bayes XGB — PCA with Grid Search

XGB — no PCA or Grid Search

Figure 64 Based on the Model Building above, it produces the highest Accuracy results.

N) t Vect an hine °F Data Scaler 91.2% 88% 78.6% 100%

4.1.3 Conclusion of comparing the evaluation of the models

The analysis of the results suggests that Logistic Regression achieved a perfect result, which might be indicative of overfitting or issues such as data leakage, especially if the dataset is imbalanced Overfitting occurs when a model learns the training data too well, capturing noise or specific patterns that do not generalize well to new, unseen data It is crucial to further investigate and validate the performance of Logistic Regression on new data.

Support Vector Machine (SVM), AdaBoost, and XG Boost without PCA or Grid Search did not provide optimal accuracy This may be attributed to the need for hyperparameter tuning or feature engineering to enhance their performance It's essential to explore various configurations and parameter settings to optimize these models.

Random Forest and Naive Bayes demonstrated maximum accuracies of 94.1% and 91.2%, respectively Random Forest's longer execution time, exceeding 800 seconds, suggests that fine- tuning its numerous parameters may improve both its speed and accuracy Careful consideration and adjustment of these parameters could potentially enhance the overall performance of Random Forest.

Neural Network and XG Boost achieved accuracies of 85.2% and 82.4%, respectively. These results might indicate the need for additional tuning of hyperparameters or exploring more complex neural network architectures to improve their classification performance.

The mention of the Unsupervised algorithm performing poorly in gene classification highlights the importance of choosing appropriate algorithms for specific tasks Unsupervised algorithms like clustering might not be well-suited for classification tasks without proper feature engineering or preprocessing.

In conclusion, while Logistic Regression outperforms in terms of both time and accuracy, further investigation and fine-tuning are essential to ensure the robustness and generalizability of the model Exploring hyperparameter tuning, feature engineering, and potentially more sophisticated models may lead to improved results for other algorithms, providing a more comprehensive understanding of their capabilities and limitations in the context of gene expression classification

CONCLUSION AND FUTURE RESEARCH DIRECTIONS 5c Scscsrecssrrrres 95

Overall Conclusion e The research on "Machine Learning Methods for Cancer Classification Using Gene Expression

Data: Classifying Patients with Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL)" not only yields significant findings regarding the ability to classify cancer patients based on gene expression data but also lays the foundation for the widespread application of machine learning in the fields of diagnosis and cancer classification Logistic Regression stands out with the highest accuracy, while XGBoost - PCA demonstrates agility with the shortest execution time, providing flexible options depending on application requirements. e Importantly, this research opens up a new direction by utilizing gene expression data for cancer classification, enhancing our understanding of the complex interaction between genes and pathology The success of Logistic Regression and XGBoost - PCA not only proves their applicability in this domain but also instills confidence in the potential of machine learning to improve cancer diagnosis and treatment.

1 Technology Transfer to Clinical Practice: e Future research should focus on transferring the classification model's results into practical applications in the medical field, possibly developing diagnostic support tools for healthcare professionals.

2 Integrating Genomic Data with Other Types of Data: e Expand the research by integrating gene expression data with other types such as genomic, proteomic, or environmental data to enhance the model's accuracy and better understand influencing factors in cancer classification.

3 Exploring Model Generalization on a Large Scale: e Broaden the research scope by testing the model on a large volume of data from diverse sources to ensure generalizability and applicability to diverse patient groups.

4 Investigating the Impact of Ethnicity and Gender Characteristics: e Analyze the impact of ethnicity and gender characteristics on the model's accuracy, aiming to optimize the model for different ethnic and gender characteristics.

5 Developing Multilayered Models: e Research may explore the development of multilayered models, combining information from multiple data types and using multiple classification levels to create a more complex and accurate model.

Ngày đăng: 02/10/2024, 02:55