nghiên cứu ứng dụng kỹ thuật học sâu trong phân vùng nhiều lớp ảnh y sinh

73 0 0
nghiên cứu ứng dụng kỹ thuật học sâu trong phân vùng nhiều lớp ảnh y sinh

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Thearchitecture encompasses an input layer comprising pass-through neurons, oneor more hidden layers of TLUs, and ultimately a output layer of TLUs.. Eachlayer incorporates a bias neuron

Trang 1

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

MASTER THESIS

Research on application of deep learningapproach in Multiclass Segmentation for

Medical Images TRINH MINH NHAT

Control Engineering and Automation

Supervisor: Assoc Prof PhD Van-Truong Pham

Advisor’ signature

School: School of Electrical and Electronic Engineering

HA NOI, 9/2023

Trang 3

I would like to take a moment to express my gratitude to the individuals who have been instrumental in shaping the trajectory of my academic journey and the com-pletion of this report Assoc Prof Van-Truong Pham, my mentor and guide, has been a constant source of inspiration His dedication to my growth as a researcher and his insightful guidance have been invaluable His willingness to share his ex-pertise and time has been instrumental in helping me navigate through challenges and overcome obstacles.

I also want to extend my heartfelt appreciation to Assoc Prof Thi-Thao Tran, whose contributions have left a lasting impact on my work Though not my primary advisor, her constructive feedback, suggestions, and discussions have added depth and perspective to my research Her commitment to fostering a culture of learning and exploration has been immensely beneficial.

In the broader context, I am deeply thankful for the unwavering support of my family Their belief in my capabilities and their encouragement during both the highs and lows have been my driving force Their sacrifices and unwavering faith have kept me motivated, and I am indebted to them for their constant presence in my life.

Lastly, I would like to acknowledge the academic community, my peers, and fellow researchers who have provided valuable insights and diverse perspectives that have enriched my work The exchange of ideas and collaborative discussions have shaped my understanding and contributed to the depth of this report.

As I reflect upon this journey, I am filled with gratitude for the people who have contributed to this report’s fruition Their collective efforts have not only facilitated the completion of this project but have also nurtured my growth as a learner and researcher.

Trang 4

In recent times, the prominence of deep learning-based techniques for medical im-age segmentation has surged These approaches primarily revolve around innovat-ing architectural designs and refininnovat-ing loss functions Conventional loss functions in this context often rely on global measures, such as Cross-Entropy and Dice Loss, or overall image intensity, yet they may fall short in addressing complexities like oc-clusion and intensity variations In response, this study introduces an original loss function, melding both local and global image features, reformulated within the Mumford-Shah framework This novel approach is extended to the domain of mul-ticlass segmentation The proposed deep convolutional neural network leverages this new loss function to facilitate end-to-end training while concurrently achiev-ing multi-class segmentation Furthermore, motivated by the PiDiNet architecture, I propose a new Attention-PiDi-UNet architecture This augmentation empowers the model to fuse contextual information across dense layers, efficiently capture se-mantic insights, and avert overfitting, resulting in precise segmentation outcomes The proposed approach is rigorously assessed across four distinct biomedical seg-mentation datasets encompassing various imaging modalities, spanning 2D to 3D dimensions, including dermoscopy, cardiac magnetic resonance, and brain mag-netic resonance Evaluation results on datasets like Lesion Boundary Segmenta-tion, the dermoscopic dataset, automated cardiac diagnosis, and 6-month infant brain MRI Segmentation corroborate the algorithm’s superior performance com-pared to existing state-of-the-art methods This robustly underscores the potency of our multiclass segmentation approach for diverse biomedical images.

Student’s signature

Trang 5

CHAPTER 2 THEORETICAL BASIS6 2.1 Artificial Neural Networks 6

2.2 Deep neural network 8

2.2.1 Glorot and He Initialization 12

2.2.2 Non-Saturating Activation Functions 12

2.2.3 Batch Normalization 14

2.2.4 Dropout 17

2.3 Convolution Neural Network 17

2.3.1 The Architecture of the Visual Cortex 18

Trang 6

2.6.1 Mumford-Shah Loss 29

2.6.2 Active Contour Method 30

CHAPTER 3 THE PROPOSED APPROACH31 3.1 Datasets 31

3.1.1 The ISIC-2018 Dataset 31

3.1.2 The PH2 Dataset 31

3.1.3 Automated Cardiac Diagnosis Challenge (ACDC) 32

3.1.4 6-month Infant Brain MRI Segmentation (iSeg) Dataset 33

4.1.3 Recall, Precision, Sensitivity and Specificity 39

4.1.4 Modified Hausdorff distance [79] 39

4.1.5 Average surface distance [79] 39

4.2 Results on ISIC-2018 Dataset 40

4.3 Results on PH2 Dataset 42

4.4 Results on ACDC Dataset 44

4.5 Results on iSeg-2017 Dataset 46

4.6 Performance of the Proposed Loss 48

Trang 7

LIST OF FIGURES

Figure 2.1Threshold logic unit: an artificial neuron applies a step function

after calculating the weighted sum of its inputs [39] 7

Figure 2.2Perceptron architecture of two neurons input, one neuron bias,and three neurons in output [39] 7

Figure 2.3Multilayer Perceptron architecture has two inputs, four neuronsin one hidden layer and three neurons in output layer [39] 8

Figure 2.4Hidden layers in Deep Neural Network [41] 9

Figure 2.5Logistic activation function saturation [39] 11

Figure 2.6ReLU activation 13

Figure 2.7Swish activation 13

Figure 2.8Mish activation [50] 15

Figure 2.9With the stoppage regularization a random set of all the neuronsis "dropped out" in each training iteration in one or more layers, withthe exception of the output layer [39] 17

Figure 2.10 The visual signal progresses through the brain, neurons respondto more complex patterns in larger receptive fields [39] 18

Figure 2.11 Square local receptive fields in CNN layers [39] 19

Figure 2.12 Relations between layers and zero padding [39] 20

Figure 2.13 Reducing dimensionality the input feature map using a stride with

Figure 2.16 Max pooling layer with2 × 2 pooling kernel, no padding and stepsize stride equal 2) [39] 23

Figure 2.17 Invariance to small translations [39] 24

Figure 2.18 Example of semantic segmentation [39] 25

Figure 2.19 Upsampling utilizing a Transposed Convolutional Layer [39] 25

Figure 2.20 Spatial resolution from lower layers is recovered by Skip layers [39] 26

Trang 8

Figure 2.21 CDCM Module [32]. 27

Figure 3.1My proposed network architecture Attention-PiDi-UNet. 34

Figure 4.1The representative segmentation results of my method on differentskin lesions size from my testing set in the ISIC-2018 dataset. 40

Figure 4.2Representative results in PH2 dataset 42

Figure 4.3The representative result of the right ventricle (yellow), myocardium(green), and left ventricle (blue) of three examples using my method onthe ACDC 2017 challenge. 44

Figure 4.4The representative result on various slices of testing sample IDs11, 16, and 17, respectively The T1 weighted, the T2 weighted, and mysegmentation result are indicated from left to right, respectively. 46

Figure 4.5The learning curves by the proposed method when training im-ages from four databases in terms of average DSC of classes (a) TheISIC-2018 dataset (b) The PH2 dataset (c) The ACDC dataset (d) TheiSeg-2017 challenge. 49

Trang 9

LIST OF TABLES

Table 4.1Comparison with other popular approaches on the ISIC 2018dataset Results have been taken from [80] except for the last four methods. 41

Table 4.2Comparison with other popular approaches on the PH2 dataset.Results have been taken from [81] except for the last four methods. 43

Table 4.3Comparison with other popular approaches on the ACDC dataset.DSCs on RV, Myo, LV and the average DSC have been calculated Re-sults have been taken from [82] except for the last four methods. 45

Table 4.4The DSC, MHD, ASD, and average metrics of segmented classesin validation dataset of 4 out of top 8 teams in [5] of the iSeg-2017challenge and my proposed approach (MHD: mm, ASD: mm). 47

Table 4.5Comparison with other loss function in DSC on three the datasets. 48

Trang 10

LIST OF SYMBOLS

µB The mean of vector input, assessed over the entire mini-batch B.

lf The standard deviation of the vector of input mB The plenty of cases in the mini-batch

x(i) The normalized inputs for case i

β The output shift (offset) parameter vector for the layer

ε Small number which prevents zero division (commonly 10−7) z(i) The Batch Normalization output operation.

Ei Output of ith encoder block

Hi Height of ith output feature map Wi Width of ith output feature map

Pvi(θ ) Softmax output for the vth pixel value of the class ith

Trang 11

CHAPTER 1 INTRODUCTION

1.1Motivation for Participating in Medical Image Segmentation Challenges

Medical image segmentation challenges provide a unique opportunity for re-searchers and practitioners to address critical problems in healthcare through the development of advanced computational techniques By participating in these challenges, participants aim to contribute to the improvement of diagnosis, treat-ment planning, and patient care In this section, I discuss the motivations behind participating in four specific medical image segmentation challenges: the Lesion Boundary Segmentation challenge on the ISIC-2018 dataset, the dermoscopic PH2 database, the 2017 MICCAI sub-challenge on automatic cardiac diagnosis bench-mark, and the 6-month infant brain MRI Segmentation (iSeg) benchmark.

Skin cancer is a prevalent and potentially deadly condition that demands early and accurate detection The ISIC-2018 challenge [1, 2] and the PH2 challenge [3] focus on segmenting skin lesions, aiming to improve the accuracy and efficiency of diagnosis The motivation to participate in this challenge arises from the urgent need to develop automated segmentation methods that can assist dermatologists in identifying and diagnosing skin cancer Successful segmentation of lesion bound-aries can enable more accurate diagnosis and early intervention, ultimately enhanc-ing patient outcomes Participatenhanc-ing in this challenge provides an opportunity to contribute to dermatological research, develop advanced segmentation techniques, and potentially revolutionize skin cancer diagnosis.

Cardiovascular diseases are a leading cause of mortality globally The MIC-CAI sub-challenge on automatic cardiac diagnosis [4] addresses the need for ac-curate cardiac segmentation to aid in diagnosing heart conditions The motivation to participate in this challenge stems from the potential to advance cardiac imag-ing and diagnosis through automated segmentation methods Precise segmentation of cardiac structures can assist cardiologists in assessing heart function, identify-ing anomalies, and guididentify-ing treatment decisions By participatidentify-ing in this challenge, researchers can collaborate with experts in cardiology, contribute to cutting-edge medical research, and develop solutions that have a direct impact on patient care.

Segmentation of infant brain MRI scans is crucial for studying early brain development and identifying abnormalities The iSeg benchmark challenge [5] fo-cuses on accurate segmentation of infant brain structures, aiding in early detection of neurological disorders The motivation to participate in this challenge lies in the potential to contribute to pediatric neuroimaging and improve the understanding of infant brain development Accurate segmentation of brain structures can assist clinicians and researchers in diagnosing conditions and monitoring developmental

Trang 12

milestones Participation in the iSeg benchmark offers the chance to advance pedi-atric imaging, collaborate with experts in the field, and create tools that facilitate early intervention and improved patient outcomes.

In conclusion, participating in medical image segmentation challenges pro-vides a unique avenue to address critical healthcare challenges The motivations behind participating in these challenges range from improving diagnosis accuracy and treatment planning to collaborating with experts and contributing to cutting-edge medical research These challenges offer a platform for researchers to develop and showcase innovative solutions that have the potential to revolutionize health-care practices and enhance patient health-care.

1.2Advancements in Medical Image Segmentation and Innovative Approaches

Image segmentation is a pivotal and challenging topic in the field of computer vision [6] Its objective is to partition an image in a way that accurately locates, identifies, and quantifies objects This process holds crucial importance in medical imaging, supporting additional clinical analysis, diagnosis, therapy planning, and disease progression measurement Within the domain of medical image segmen-tation, several primary obstacles exist These include a scarcity of well-labeled benchmarks for training, a deficiency of annotated images [7], a lack of consis-tent segmentation techniques, poor image resolution, and significant variability in image quality across patients [8] Precise calculation of segmentation accuracy and uncertainty is vital for gauging performance in other applications [9] Con-sequently, this underscores the imperative for advanced methodologies, such as Artificial Intelligence (AI)-based approaches, to enable automated, generalizable, and efficient medical image segmentation.

In the context of developing AI systems, the attributes of generalization and robustness bear critical significance, particularly in clinical trials [10] Conse-quently, the development of a resilient architecture suited for diverse biomedical applications becomes paramount Recently, convolutional neural networks (CNNs) have emerged as advanced tools for automating the segmentation of medical im-ages [11–13] This includes various modalities such as X-rays, CT scans, and MRIs, with promising outcomes compared to conventional segmentation meth-ods [14, 15] Among different CNN versions, encoder-decoder networks like Fully Convolutional Networks (FCN) [16] and their advancement such as U-Net [17] have gained substantial traction as semantic segmentation techniques for 2D im-ages A deep fully convolutional neural network designed for semantic pixel-wise segmentation that requires fewer trainable parameters yet yields high-quality seg-mentation maps was introduced by [18] Addressing dense prediction challenges, a novel convolutional network module was proposed by [19] This module utilized

Trang 13

dilated convolutions to systematically aggregate multi-scale contextual features, resulting in a significant performance enhancement for advanced automated seg-mentation systems Moreover, [20] introduced DeepLab as a segseg-mentation method DeeplabV3 [21], without DenseCRF fine-tuning, demonstrated considerable im-provements over earlier DeepLab iterations, utilizing a synthetic approach with fewer convolutional layers than FCN and U-Net architectures, along with skip con-nections between the encoder and decoder paths An efficient scene parsing net-work for comprehending complex receptive fields was proposed by [22] This approach utilized global pyramidal characteristics to facilitate the acquisition of additional contextual information.

Throughout the training process, CNN model parameters are typically re-fined using gradient descent techniques, as outlined by [23], wherein errors are quantified by a loss function that contrasts predicted labels against ground truth labels For classification endeavors, prevalent loss functions encompass cross-entropy (CE) and the L2 norm, often referred to as the mean squared error (MSE), as frequently cited in the works [24, 25] Conversely, problems centered on seg-mentation have commonly engaged the Dice Coefficient (DC) and cross-entropy (CE) [17, 26] Despite the recent strides made in CNN deployment for biomedical image segmentation, prevalent loss functions frequently revolve around pixel-wise similarity evaluation Notably, the DC and CE are tailored towards specific region feature extraction While this framework often yields impressive classification and segmentation outcomes, low loss function values do not always signify meaning-ful segmentation Instances arise where noisy images produce several indistinct contours, signaling erroneous predictions, and the indistinctness of object bound-aries stems from the difficulty in classifying pixels near the contour An additional challenge arises from susceptibility to local minima due to aberrations within the training database, high dimensionality, and the non-convex attributes of loss func-tions, as illuminated by [27].

Among frequent deep-CNN approaches, fully convolutional network (FCN) [28] and U-Net [17] have been designed that deconvolutional operations replace fully connected layers to strengthen temporal coherence; also, skip connections are used for inheriting spatial information in deeper layers Depthwise convolution [29], which is channel-wise n × n spatial convolution, segregates the image into several channels before convolving it with the preferable channel and then stacking these channels back Pointwise convolution [29] is 1 × 1 convolution operation for adjusting the feature map dimension A Depthwise Separable convolution [29] is defined as the depthwise convolution followed by the pointwise convolution, which helps prevent the model from getting overfitting by reducing the number of

Trang 14

connections in the model.

Dilated convolution [30] expands window size without increasing the number of weights by adding zero-values into convolution kernels while maintaining com-putation cost Adaptive Dilated Convolution [31] generates and fuses multi-scale features of similar spatial sizes by setting various dilation rates for different chan-nels Applying dilated convolution, Compact Dilation Convolution-based Module (CDCM) [32] is adopted in my proposed model for more useful features.

Region-based Tversky loss [33] and Focal Tversky loss [34] control the in-formation flow implicitly through pixel-level affinity and tackle class-imbalanced problems; however, their contour optimization processes are not good enough There has been an ongoing concern about exploiting the active contour models as loss functions in deep-learning solutions for better contour optimization Region-based active contour Chan-Vese model [35] has been successful for training images with two regions, each having a different mean of pixel intensity Inheriting the advantage of Mumford-Shah functional and the AC loss with some adjustments obtains the LMS loss [36] Acquiring the requirements for boundary optimiza-tion and addressing the class-imbalanced problem, I propose a new Focal Active Contour loss function.

This study yields several noteworthy contributions:

• Innovative Loss Function: I introduce a novel loss function tailored for the

training process of deep-learning models By incorporating elements of active contour methodology into the loss functions, I aim to tackle a persistent chal-lenge encountered in medical imaging and computer vision - the problem of intensity inhomogeneity within image data This amalgamation of techniques offers a promising avenue to address this issue effectively It not only helps deep-learning models achieve more accurate and robust segmentation results but also paves the way for more precise and reliable image analysis across var-ious applications, ultimately advancing the capabilities of AI-driven solutions in the field.

• End-to-End CNN Model Development: Inspired by PiDiNet, I propose a

new architecture by modifying this network from FCN-shape into U-Net-shape, using CDCM modules (without CSAM followed); combining with an Attention module, Depthwise-and-Pointwise module.

• Thorough Evaluation and Comparison: A comprehensive evaluation of

both my proposed model and the introduced loss function is conducted across 2D and 3D datasets These evaluations are benchmarked against existing state-of-the-art methods Notably, my approach consistently demonstrates

Trang 15

promis-ing outcomes when compared to baseline algorithms This observation is substantiated across diverse datasets including the Lesion Boundary Segmen-tation ISIC-2018 dataset, the dermoscopic PH2 dataset, the 2017 MICCAI sub-challenge on automatic cardiac diagnosis benchmark, and the 6-month infant brain MRI Segmentation (iSeg) benchmark.

Trang 16

CHAPTER 2 THEORETICAL BASIS2.1Artificial Neural Networks

Deep learning is a machine learning technique that is very significant It teaches a computer (PC) to filter inputs through layers in order to predict and cat-egorize data Observations may take the form of images, text, or sound The way the human brain filters knowledge is the driving force behind deep learning Its aim is to imitate how the human brain seeks to conjure up some real magic There are about 100 billion neurons in the human brain A single neuron interacts with approximately 100,000 of its peers That is what I am attempting to build, although in a computer manner As a result, the neuron (or Node) receives a signal or sig-nals (input values) that pass through it The output signal is transmitted by that neuron This knowledge is broken down into numbers and bits of binary data that a computer can understand.

What about synapses? Every one of the neurotransmitters gets assigned

weights, which are important to Artificial Neural Networks (ANNs) Weights are the way ANNs learn By changing the weights, the ANN chooses to what degree signals get passed along and the weights are changed while training your network For some decades ago, McCulloch suggested a immensely basic architecture of a biological neuron [37], which has one or more binary (on/off) inputs and one

binary output, was later called an artificial neuron When more than a certain

number of its inputs are involved, the artificial neuron stimulates its output They demonstrated in their paper that even with such a simplistic model, a network of artificial neurons can be built to compute any logical proposition.

The Perceptron, which is one of the most basic ANN architectures, was

Frank Rosenblatt [38] created The threshold logic unit (TLU) is derived from a marginally different artificial neuron (Figure 2.1) or sometimes a linear threshold unit (LTU) The inputs and outputs now are both numbers (rather than binary on/off values), and each input relation has a weight assigned to it The TLU calculates a weighted sum of its inputs (z = w1x1+ w2x2+ + wnxn = xTw), then such sum is added by a step function and returned the result: hw(x) = step(z) where z = xTw A Perceptron comprises a layer of Threshold Logic Units (TLUs), each intricately

connected to all the inputs This layer is recognized as a fully connected layeror a dense layer when each neuron within the layer establishes connections with

every neuron in the preceding layer The Perceptron’s inputs are channeled to input neurons, which serve as pass-through units, directly outputting the received input.

The assembly of these input neurons constitutes the input layer It’s worth noting

that an additional bias term is commonly integrated (x0= 1), typically introduced

Trang 17

Figure 2.1Threshold logic unit: an artificial neuron applies a step function aftercalculating the weighted sum of its inputs [39]

through a specialized neuron known as a bias neuron, perpetually yielding an

out-put of 1 A visual representation of this setup can be seen in Figure 2.2, illustrating a Perceptron equipped with two inputs and three outputs In this case, the Percep-tron functions as a multi-output classifier, concurrently categorizing instances into three distinct binary classes Perceptrons are trained using a variety of rules that

Figure 2.2Perceptron architecture of two neurons input, one neuron bias, and threeneurons in output [39]

consider the network’s error when making predictions The Perceptron learning rule refines correlations, progressively minimizing error [40] In greater detail, the Perceptron is sequentially exposed to individual training instances, yielding pre-dictions for each instance If an output neuron generates an incorrect prediction, the correlation weights pertaining to inputs that would have led to the accurate prediction are adjusted This rule is represented by Equation 2.1:

wnext stepi, j = wi, j+ η (yj− ˆyj) (2.1) In this equation:

• wi, j is the weight linking the ith input neuron and the jth output neuron.

Trang 18

• xi is the ith input value of the current training sample.

• yj is the target output of the jthoutput neuron for the current training sample • ˆyj is the output of the jth output neuron for the current training instance • η denotes the learning rate during training (typically adjusted as needed).

Given that the decision boundaries of individual output neurons remain lin-ear, Perceptrons inherently struggle to capture intricate patterns However, stacking multiple Perceptrons collectively mitigates these limitations This composite struc-ture is known as a Multilayer Perceptron (MLP), as illustrated in Figure 2.3 The

architecture encompasses an input layer (comprising pass-through neurons), oneor more hidden layers of TLUs, and ultimately a output layer of TLUs Each

layer incorporates a bias neuron except for the output layer, and these layers are fully connected to one another, creating a comprehensive neural network A deep

Figure 2.3Multilayer Perceptron architecture has two inputs, four neurons in onehidden layer and three neurons in output layer [39]

neural network (DNN) is described as an ANN with a large number of hidden layers.

2.2Deep neural network

Deep Learning revolves around the exploration of deep neural networks (DNNs), which frequently consist of intricate sequences of computations Representing the output of hidden layers as h(l)(Z), the computation for a neural network with L hidden layers is depicted as:

Trang 19

Each pre-activation function z(l)(a) entails a linear operation governed by the weight matrix W(l)and bias b(l):

The activation functions within the hidden layer, denoted as h(l)(Z), typically exhibit uniformity across layers However, there are instances where distinct acti-vation functions are employed to serve specific purposes For further clarity, the process of feedforward from the (l − 1)th to the lth layer is illustrated in Figure 2.4 below For years, researchers struggled to train Multilayer Perceptrons (MLPs)

Figure 2.4Hidden layers in Deep Neural Network [41]

effectively However, in 1986, David Rumelhart introduced a groundbreaking ap-proach [42] that revolutionized the field This apap-proach implemented the backprop-agation training algorithm, which remains a cornerstone of neural network train-ing In essence, it leverages Gradient Descent [43] along with an efficient means of automatically calculating gradients The backpropagation algorithm computes the gradient of the network’s error with respect to each model parameter in just two passes through the network – one forward and one backward This algorithm efficiently determines how relation weights and bias terms should be adjusted to minimize error It repetitively undertakes a regular Gradient Descent step using

Trang 20

these computed gradients, iteratively moving towards a solution Key aspects of the backpropagation algorithm include:

• Mini-Batch Processing and Epochs: The algorithm operates on one

mini-batch at a time (typically comprising a power of two instances for computa-tional efficiency), cycling through the entire training dataset multiple times

– each complete cycle is termed an epoch This iterative process aids in the

gradual reduction of losses.

• Forward Pass: The input layer sends the first hidden layer each mini-batch.

Subsequently, the algorithm computes the contributions of all neurons within this layer for each sample in the mini-batch This result is then propagated forward to the subsequent layer, repeating this process layer by layer until the output layer is reached This forward pass is akin to making projections, with the distinction that intermediary outcomes are retained for utilization during the backward pass.

• Error Calculation: Subsequent to the forward pass, the algorithm calculates

the network’s performance error.

• Output Contribution Evaluation: The algorithm assesses the contribution

of each output relation to the error Leveraging the chain rule, this process is executed analytically, ensuring efficiency and precision.

• Backward Error Propagation: By employing the chain rule, the algorithm

quantifies the extent to which each error input stems from each link within the layer directly below This backward process extends until the input layer is reached As previously highlighted, this backward propagation effectively assesses the error gradient throughout the entire neural network, traversing the network’s relation weights.

• Gradient Descent Phase: The final step involves adjusting all the network’s

relation weights using the computed error gradients during a Gradient Descent phase.

The backpropagation algorithm’s significance warrants reiteration: it initiates with a prediction (forward pass), calculates the error for each training step, retraces through each layer to compute error contributions from connections (reverse pass), and subsequently adjusts connection weights to minimize error (Gradient Descent step) To facilitate the proper functioning of this algorithm, a pivotal enhancement was made to the MLP’s architecture: the replacement of the step function with the

Trang 21

logistic (sigmoid) function [44], denoted as σ (z) = 1

1 + e(−z) The logistic function is characterized by a continuous nonzero derivative across its domain, enabling Gradient Descent to make progress at each step In contrast, the step function features flat segments, leading to the absence of gradients for computation.

However, a challenge arises: as the algorithm progresses down to lower lay-ers, gradients diminish due to the cumulative effect of multiplications by values less than 1 Consequently, the Gradient Descent updates predominantly influence lower layer relation weights, preventing convergence to a single solution—a predicament

known as the vanishing gradients problem Conversely, gradients can surge in

magnitude, causing layers to receive excessively large weight updates, ultimately

leading to divergence—an issue termed the exploding gradients problem A

tech-nique involving the logistic activation function and initialization procedure was presented in [45] This study demonstrated that each layer’s output variance ex-ceeds its input variance significantly As the network advances, variance escalates with each layer, culminating in activation saturation in the upper layers Notably, this saturation is exacerbated by the logistic function’s mean of 0.5, which diverges from 0.

With respect to the logistic activation function (depicted in Figure 2.5), it’s evident that the function saturates at 0 or 1 as inputs become increasingly large (negative or positive), leading to derivatives that approach zero Consequently, there exists minimal gradient available for back propagation, and any existing gra-dient becomes diluted as it traverses the network’s upper layers during back prop-agation Therefore, Glorot and Bengio [45] suggested a way to reduce the unstable

Figure 2.5Logistic activation function saturation [39]

gradient issue dramatically, it is Glorot and He Initialization.

Trang 22

2.2.1Glorot and He Initialization

The proper propagation of signals in both forward and backward passes is crucial in neural networks During prediction (forward pass) and gradient compu-tation (backward pass), signals must traverse accurately in both directions Authors emphasize that for correct signal flow, the output variance of a layer should match the input variance, ensuring proper signal propagation Furthermore, gradients need to be adjusted both before and after they travel through the back direction of the layer Achieving these conditions isn’t guaranteed even when the input and neuron layer have an equal number of connections (referred to as the f anin and

f anout of the layer).

However, Glorot and Bengio introduced a practical approach that has proven effective: initializing the connection weights of each layer with random values de-fined by equations (2.4) and (2.5), which involve normal distribution and uniform distribution with the parameters outlined Notably, f anavg= ( f anin+ f anout) /2 This initialization strategy is referred to as Xavier initialization or Glorot initial-ization in [45] The significance of this technique has been recognized for over a decade Applying Glorot initialization significantly accelerates training and is one of the influential strategies that have contributed to the success of Deep Learning.

Similar techniques for different activation functions have been presented in certain papers [46] These approaches share a common framework with variations in the

3σ2 Particularly, the initialization technique tailored for the Rectified Linear Unit (ReLU) activation function, which will be discussed in the

subsequent subsection, is sometimes referred to as He initialization.

2.2.2Non-Saturating Activation Functions

The backpropagation algorithm not only performs effectively with the logistic equation but also proves successful with various other activation functions Several common options are presented below.

(a) ReLU activation

To address the vanishing gradient problem [47] associated with sigmoid acti-vation, the Linear Unit [48] or Rectified Linear Unit (ReLU) was introduced The

Trang 23

ReLU activation function is illustrated in Figure 2.6 Unlike the sigmoid func-tion, ReLU doesn’t suffer from vanishing gradients Specifically, its derivative is 0 for x < 0 and 1 otherwise This characteristic eliminates the issue of vanishing gradients Additionally, ReLU promotes model sparsity, as gradients that turn to 0 essentially indicate that a neuron becomes inactive Moreover, ReLU computa-tions are computationally faster compared to funccomputa-tions like sigmoid and tanh The computation of ReLU, which often involves taking the maximum between (0, x), requires less computational resources Consequently, ReLU has become the stan-dard activation function in today’s deep learning landscape.

Figure 2.6ReLU activation

(b) Swish activation

Nonetheless, the exploration for improved activation functions continued In October 2017, Google Brain introduced the Swish activation function [49], aiming to enhance existing options The Swish activation function is characterized by the simple equation σ (x) = x

1 + e−x, as depicted in Figure 2.7 Swish stands out

Figure 2.7Swish activation

as a smooth function, unlike ReLU, which experiences a sudden directional shift near x = 0 Swish transitions seamlessly from 0 to non-zero values and then back

Trang 24

upwards Importantly, Swish exhibits a non-monotonic behavior—this sets it apart from functions like ReLU, which are either stable or shift in a specific direction This characteristic is highlighted in the authors’ paper, where they underscore that Swish’s non-monotonicity distinguishes it from most other activation functions.

The Swish activation function offers several advantages over ReLU due to its unique characteristics:

• Bounded and Sparse Activation: Similar to ReLU, Swish benefits from

spar-sity Extremely negative weights are zeroed out, which contributes to a sparse representation.

• Unbounded Above: Swish is not limited to saturating outputs to a maximum

value for very large inputs (e.g., 1 for all neurons) This distinguishes Swish from other activation functions, including ReLU.

• Smooth Curve and Smooth Landscape: The smoothness of the Swish curve

extends to its derivative, leading to a smoother landscape for optimization This smoothness aids in efficiently navigating the model towards minimal loss during optimization.

• Utilization of Negative Values: Unlike ReLU, where negative values are set

to zero, Swish retains negative values, particularly values close to zero This property is beneficial for capturing subtle patterns in the data, making Swish more flexible in handling different types of information.

In essence, Swish’s bounded, smooth, and flexible behavior makes it a com-pelling alternative to ReLU, offering improvements in terms of capturing complex patterns and optimizing model performance.

(c) Mish activation

Mish activation [50] draws inspiration from Swish activation The equation for the Mish activation function is defined as f (x) = x × tanh (ln (1 + ex)) The graphical representation of the Mish activation function is depicted in Figure 2.8.

While Mish shares many of the same advantages as Swish, the authors of [50] introduce the idea that the error space could potentially be smoother with Mish However, it’s important to note that the primary drawback of the Mish activation function is its significantly higher computational cost.

2.2.3Batch Normalization

Although initializing with ReLU (or its variants) can significantly reduce the likelihood of vanishing/exploding problems at the start of training, it doesn’t

Trang 25

guar-Figure 2.8Mish activation [50]

antee that these issues won’t arise during the course of training In the paper [51], a technique called Batch Normalization (BN) is introduced to address these prob-lems This method involves inserting a new operation within the model immedi-ately after the hidden layer.

The process consists of normalizing and zero-centering each input Subse-quently, the results are scaled and shifted using two learnable parameter vectors per layer: one for scaling and another for shifting This approach allows the model to learn the most appropriate scale and mean for each input layer To achieve this, the mean and standard deviation of each input must be computed to ensure central-ization, and the inputs need to be normalized This is accomplished by estimating the mean and standard deviation of the inputs over the current mini-batch The entire procedure is succinctly summarized below:

Algorithm 1 Batch Normalization algorithm

Trang 26

In this algorithm:

• µB is the mean of vector input, assessed over the entire mini-batch B.

• σB is the standard deviation of the vector of input, also evaluated over the entire mini-batch.

• mBis the plenty of cases in the mini-batch.

• x(i) is the normalized inputs for case i.

• γ is the output scale parameter vector for the layer • ⊗ expresses element-wise multiplication.

• β is the output shift (offset) parameter vector for the layer The corresponding shift parameter offsets each input.

• ε is a small number which prevents zero division (commonly 10−7) This is

named a smoothing term.

• z(i) is the BN output operation The version of the inputs is rescaled and modified.

In the training phase, Batch Normalization (BN) standardizes its inputs by normal-izing and centering them, followed by rescaling and shifting During the testing phase, BN employs two additional parameters, namely µ (the mean vector of in-puts from the last batch in the training set) and σ (the standard deviation vector of inputs from the last batch in the training set) These parameters are estimated using an exponential moving average [52] for making predictions on new instances during testing It’s important to note that while µ and σ are computed during train-ing, they are used only after training (to replace the batch input means and standard deviations in the BN algorithm during inference).

The issue of vanishing gradients has been mitigated to a point where saturated functions like the logistic function and even the tanh function can be effectively uti-lized The sensitivity of weight initialization in the networks has also been notably reduced Researchers have been able to employ significantly higher learning rates, leading to a substantial acceleration of the learning process Furthermore, Batch Normalization acts as a regularizer, reducing the necessity for other regularization techniques to prevent overfitting.

Trang 27

In the realm of deep neural networks, with their thousands or even millions of parameters, there exists an entire spectrum of possibilities This wide parame-ter space grants them incredible versatility to adapt to a diverse array of complex datasets However, this immense flexibility also heightens the risk of overfitting the training data, necessitating the incorporation of regularization techniques One

Figure 2.9With the stoppage regularization a random set of all the neurons is"dropped out" in each training iteration in one or more layers, with the excep-tion of the output layer [39]

such technique that has gained significant traction in deep neural network

regular-ization is Dropout [53] Dropout has proven to be remarkably effective, often

lead-ing to a 1-2 percent improvement in accuracy for modern neural networks While this might not sound like a dramatic enhancement, consider that a 2% increase corresponds to a reduction in error rate of nearly 40% for a model that already boasts 95% accuracy (reducing from 5% to around 3% error rate) The concept behind dropout is relatively straightforward: during each training iteration, every neuron (excluding output neurons) has a certain probability, denoted as p, of being temporarily removed This means that the neuron’s contribution is entirely disre-garded during that iteration, but it will contribute in subsequent iterations (Figure

2.9) The parameter p is termed the dropout rate, and it generally lies within the

range of 10% to 50% It is important to note that neurons no longer experience dropout during the testing or inference phase.

2.3Convolution Neural Network

Since the 1980s, researchers have harnessed the power of convolutional neu-ral networks (CNNs) in the realm of image recognition This development was spurred by investigations into the intricate workings of the visual cortex of the brain [54–56] Over the years, CNNs have undergone significant advancements and have reached a point where they can achieve performance beyond human

Trang 28

capa-bilities in various complex visual tasks These advancements have been driven by the growth in computing capabilities and the abundance of available training data As a result, CNNs play a pivotal role in applications like image analysis services, self-driving vehicles, automated video classification systems, and more Impor-tantly, CNNs are not confined solely to visual perception; they have also demon-strated remarkable prowess in various other domains, including speech recognition and the processing of natural language.

2.3.1The Architecture of the Visual Cortex

In their work presented in [56], the authors demonstrated that numerous neu-rons in the visual cortex exhibit a distinct property known as a local field of re-ception This property implies that these neurons respond exclusively to visual stimuli within a limited visual area, as illustrated in Figure 2.10 where dashed cir-cles denote the local receptive fields of five neurons It’s important to note that the receptive fields of different neurons can overlap, and when considered collectively, they comprehensively cover the entire visual field Moreover, the researchers made

Figure 2.10The visual signal progresses through the brain, neurons respond tomore complex patterns in larger receptive fields [39]

a significant observation that certain neurons exclusively responded to images fea-turing horizontal outlines, while others exhibited responses to lines of various ori-entations Additionally, they identified neurons with larger receptive fields that reacted to more intricate patterns formed by combining lower-level patterns These findings led to the formulation of a hypothesis suggesting that higher-level neu-rons utilize the outputs of neighboring lower-level neuneu-rons (as illustrated in Figure 2.10, where each neuron is connected to only a subset of neurons from the previous layer) This intricate neural architecture enables the detection of a wide array of complex patterns across different regions of the visual field.

The culmination of these insights was the introduction of the neocognitronin 1980 [57], which ultimately paved the way for the development of

convolu-tional neural networks A notable milestone in this progression was the creation

of LeNet-5 architecture introduced in [58] LeNet-5, widely employed for

Trang 29

classify-ing handwritten digits by financial institutions, integrated several well-established building blocks, such as swish functions and fully connected layers However, it

also introduced two novel components: convolutional layers and pooling layers.

2.3.2Convolutional Layers

A fundamental characteristic of a CNN is that neurons in the convolutional layers are connected to pixels within their respective receptive fields, rather than being connected to individual pixels in the input image, as explained earlier (as depicted in Figure 2.11) Additionally, each neuron in the subsequent convolu-tional layers is connected only to neurons in a small local region of the previous layer This architectural arrangement enables the network to progressively focus on lower-level features in the initial hidden layers and then combine these features in subsequent layers This hierarchical structure mirrors the organization of visual information in real-world images, which contributes to the CNN’s remarkable per-formance in image recognition tasks A neuron situated at row i and column j

Figure 2.11Square local receptive fields in CNN layers [39]

within a given layer is linked to the outputs of neurons in the preceding layer lo-cated in rows i to i + fh− 1 and columns j to j + fw− 1, where fh and fwrepresent the height and width of the receptive field, respectively (as illustrated in Figure 2.12) To ensure that a layer maintains the same height and width as the preceding layer, it is common to include zero values around the input data This technique is referred to as zero padding The use of receptive fields, as depicted in Figure 2.13, facilitates the connection of a larger input layer to a much smaller subsequent layer This leads to a significant reduction in the computational complexity of the model.

The transition from one receptive field to another is referred to as the stride In the

presented illustration, a 5 × 7 input layer is linked to a 3 × 4 layer using 3 × 3 recep-tive fields and a stride of 2, with zero padding applied It’s important to note that the stride doesn’t necessarily have to be the same in both directions, as illustrated in this example.

For instance, a neuron located at row i and column j within the higher layer

Trang 30

Figure 2.12Relations between layers and zero padding [39]

is connected to the outputs of neurons in the previous layer situated in rows i × sh

to i × sh+ fh− 1 and columns j × sw to j × sw+ fw− 1, where sh and sw repre-sent the vertical and horizontal strides, respectively This mechanism of stride and receptive fields allows CNNs to efficiently capture features across different scales and positions in the input data The weights of a neuron can be thought of as a

Figure 2.13Reducing dimensionality the input feature map using a stride with stepof 2 [39]

small image representing the receptive field Figure 2.14 illustrates two possible

sets of weights, known as filters The first filter is depicted as a black rectangle at

the center with a vertical white line running through it (this filter corresponds to a 7 × 7receptive field, where most values are 0 except for the central vertical column, which is filled with 1s) Neurons with these weights essentially focus solely on the central vertical line in their receptive field, disregarding other input values The second filter is presented as a black area with a white horizontal line in the middle Similarly, neurons with these weights emphasize the central horizontal line in their receptive field, filtering out the remaining information.

Consider a scenario where all neurons in a layer utilize the same vertical line

Trang 31

filter (along with the corresponding bias term), and the network is provided with the bottom image in Figure 2.14 (the input image) In this case, the layer’s output will resemble the top-left image The vertical white lines are accentuated, while the rest of the image becomes blurred Similarly, if all neurons employ the same horizontal line filter, the result would be the upper-right image; here, the horizontal white lines are emphasized, and the rest becomes less distinct Consequently, when

a layer of neurons shares the same filter, it generates a feature map that highlights

the regions where the filter is most responsive within the input image.

Figure 2.14Two different filters are being applied to get other two feature maps [39]

It’s important to note that filters are not manually designed; rather, the con-volutional layer learns the most relevant filters automatically during training As the learning progresses through subsequent layers, these filters are combined into more complex and sophisticated patterns, allowing the network to identify intricate features and patterns in the data.

Up to this point, I have simplified the depiction of each convolutional layer’s performance as a 2D feature map However, in reality, each convolutional layer consists of multiple filters, resulting in a more accurate 3D representation (as seen in Figure 2.14) Each filter in the convolutional layer employs different parame-ters and creates a distinct feature map The receptive field of a neuron remains consistent with the description provided earlier, but it spans across all the feature maps from the preceding layers In essence, a convolutional layer employs multiple trainable filters on its inputs simultaneously, enabling it to identify various features across its inputs.

Moreover, input images often have multiple sublayers, each corresponding to a color channel For instance, grayscale images have just one channel, whereas certain images possess additional channels—such as satellite photos that capture

Trang 32

diverse light frequencies, including infrared.

To elaborate, consider a specific convolutional layer denoted as l, where the neuron outputs in the i-th row and j-th column of a feature map k are connected to the outputs of neurons from the preceding layer l − 1 These connections involve neurons positioned in rows ranging from i × sh to i × sh+ fh− 1, and columns ranging from j × swto j × sw+ fw− 1 It’s important to note that the neuron out-puts from the same neurons in the previous layer, despite being related to various neurons on the i-th row and j-th column, pertain to different feature maps.

Figure 2.15Three color channels images and convolutional layers with many fea-tures’ maps [39]

Equation 2.6 encapsulates the previously described concepts into a compre-hensive mathematical expression, detailing the computation of a neuron’s output within a convolutional layer While the weighted sum of inputs along with a bias term might seem intricate due to the diverse indices involved, all the calculations harmonize to provide the desired outcome.

xi×sh+u, j×sw+v,k× wu,v,k′,k (2.6)

• zi, j,k is the production of the neuron placed in row i, column j in feature map depth k of the convolutional layer (layer l ).

• fhand fhare the width and height of the convolutional kernel, swand share the

Trang 33

horizontal and vertical strides and the plenty of feature maps in the preceding layer (layer l − 1) is defined as fn′.

• xi×sh+u, j×sw+v,k is the production of the neuron positioned in layer l − 1, row i× sh+ u, column j × sw+ v, k, feature map k′.

• bk is the bias part for feature map k (in layer l) It is like a button that pinches the overall intensity of the feature map k.

• wu,v,k′,k is the weight between every neuron in feature map k of the layer l and its input positioned at row u, column v and feature map k′.

2.3.3Pooling Layers

The purpose of the pooling layer is to downsample the input image, effec-tively reducing computational load, memory usage, and the number of parameters, which helps mitigate the risk of overfitting Similar to convolutional layers, the pooling layer associates the outputs from the previous layer within a small rect-angular receptive field with each neuron in the pooling layer As before, the size of the receptive field, the stride step size, and whether padding with zeros is used need to be specified Unlike convolutional layers, pooling neurons lack weights; instead, they aggregate the inputs using functions like max or mean.

Currently, the most common type of pooling layer is the max pooling layer, illustrated in Figure 2.16 In this case, a pooling kernel of size 2 × 2 with a stride of 2 and no padding is employed In max pooling, only the maximum input value within each receptive field is passed on to the next layer, while the other inputs are discarded In the example from Figure 2.16, the input values in the lower left receptive field are 1, 3, 5, and 2; hence, only the maximum value, 5, is propagated to the subsequent layer Due to the stride of 2, the output image’s width and height are halved compared to the input image A max pooling layer not only reduces

Figure 2.16Max pooling layer with2 × 2 pooling kernel, no padding and step sizestride equal 2) [39]

computations, memory usage, and the number of parameters, but it also introduces

a degree of invariance to minor translations, as depicted in Figure 2.17 This can

Trang 34

Figure 2.17Invariance to small translations [39]

be observed by looking at the three images (A, B, C) above, which undergo max pooling with two 2 × 2 kernels, a stride of 2, and no zero-padding Images B and C are identical to image A but shifted to the right by one and two pixels, respectively The outputs of the max pooling layer for images A and B remain the same, illus-trating translation invariance The output for image C, which is shifted by just one pixel to the right, still maintains 75% invariance It’s possible to achieve a certain level of translation invariance on a larger scale by incorporating max pooling layers at intervals within a CNN Additionally, max pooling provides a limited amount of rotational and scale invariance In scenarios where predictions don’t rely heavily on these variations, such as in classification tasks, such invariance (even if limited) can be advantageous.

However, max pooling does come with certain drawbacks It causes a re-duction in resolution, as the output is halved in both dimensions (even with a small kernel and a stride of 2), leading to a 25% reduction in area Invariance isn’t always desirable in all applications Take semantic segmentation, for instance: if the input image is shifted by a pixel to the right, the output image should also shift by one pixel to maintain consistency Similarly, in cases like pixel-wise image classifica-tion, where the goal is to assign each pixel to a specific class, equivalence rather than invariance is crucial: a slight change in inputs should lead to a corresponding minor change in outputs.

2.3.4Transposed Convolutional Layers in semantic segmentation

In semantic segmentation, each pixel is assigned a category based on the type of object it belongs to, as illustrated in Figure 2.18 Notably, objects of the same class are not distinguished from one another For instance, all cars are grouped together as a single large pixel region on the right side of the segmented image The primary challenge in this task arises from the fact that, as images pass through

Trang 35

conventional CNNs, their spatial resolution gradually diminishes due to layers with strides larger than 1 As a result, a typical CNN might recognize that a person is lo-cated somewhere on the left side of the image, but it would lack precise localization accuracy.

Figure 2.18Example of semantic segmentation [39]

There are various methods available for upsampling, such as bilinear inter-polation, which works reasonably well for upscaling by factors like ×4 or ×8 However, for more significant increases in resolution, the Transposed Convolu-tional Layer [59] is a preferred choice This layer can be thought of as expanding the image by adding empty rows and columns (zero padding) and then applying a convolutional layer (as shown in Figure 2.19) Some refer to it as a convolutional layer with fractional steps (e.g., 1

2 in Figure 2.19) The Transposed Convolutional Layer can be configured to mimic linear interpolation, but its advantage lies in be-ing trainable, which often leads to improved performance durbe-ing trainbe-ing Unlike pooling or convolutional layers, the stride determines how much the input image is expanded in a transposed convolutional layer used for increasing the resolution of feature maps.

Figure 2.19Upsampling utilizing a Transposed Convolutional Layer [39]

Trang 36

2.3.5Skip Layer

Utilizing Transposed Convolutional Layers is a viable approach to increasing the size of feature maps, but it may still lack precision To address this challenge, skip connections from lower layers are introduced at a factor of 2 (rather than 32) into the output image This involves adding the output of a lower layer with dou-ble the resolution Subsequently, the results are downsampled by a factor of 16, achieving a total downsampling factor of 32 (as depicted in Figure 2.20) This helps in recovering some of the spatial resolution lost in previous pooling layers To retrieve even finer details from even lower levels, the architecture is enhanced by a second skip connection In summary, the output of the initial CNN is upsam-pled, followed by the addition of a lower layer output (at the corresponding scale), then further upsampled by a factor of 2 This is followed by adding another lower layer output and another upscaling, resulting in a total factor of 8 Additionally, this technique can even be applied to increase the resolution of the original image,

a process known as super-resolution.

Figure 2.20Spatial resolution from lower layers is recovered by Skip layers [39]

2.4U-Net based architectures

The U-Net architecture [17] has played a pivotal role in shaping the landscape of deep learning-based image segmentation In the realm of automated medical image segmentation, significant efforts have been directed towards refining and advancing the U-Net framework Notably, attention-based methodologies have gar-nered considerable attention due to their efficacy in segmenting intricate features in biomedical images across diverse imaging modalities.

One such adaptation is the Residual Attention U-Net [60], which leverages the soft attention mechanism to bolster the network’s ability to discern a compre-hensive spectrum of COVID-19 effects within chest CT scans For the purpose of lung segmentation in chest X-rays, the XLSor approach [61] employs the criss-cross attention block to aggregate long-range contextual information, contributing to improved accuracy.

A noteworthy variation of the U-Net architecture, termed MultiResUNet, was

Ngày đăng: 20/04/2024, 20:36

Tài liệu cùng người dùng

Tài liệu liên quan