INTRODUCTION
THEORETICAL THESIS
Image Processing
To begin with, a fundamental concept in image processing is the RGB color system. RGB represents the primary colors of light, which are Red, Green, and Blue, and these colors can be separated by a lens By combining the three colors in different ratios, a vast range of colors can be generated This knowledge is crucial in image processing as it helps in understanding how colors are represented and manipulated in digital images.
The RGB color system is a fundamental concept in image processing, where RGB stands for Red, Green, Blue, the primary colors of light when separated from the lens.
By combining these three colors in various proportions, different colors can be produced For each set of three integer values between 0 and 255, a unique color is8 generated Since there are 256 possible values for each of the three colors, the total number of colors that can be created in the RGB system is 16,777,216 This system includes popular colors such as:
Each variation in the values of red, green, and blue produces a unique color, but the difference is usually so small that it's not noticeable to the naked eye The definition given in the previous paragraph describes the complete range of RGB However, in digital video, the convention used for RGB is often not the entire range Instead, videoRGB typically employs a scale of relative values, where a value of (16, 16, 16) represents black and (235, 235, 235) represents white.
AI – Artificial Intelligence
2.2.1 Definition, history, and development of AI
AI is a branch of computer science that strives to create intelligent machines that can perform tasks requiring human intelligence such as learning, perception, language translation, and problem-solving It has many applications in various fields such as healthcare, finance, education, transportation, and more.
The history of AI dates back to the 1940s when scientists began studying machines' ability to perform calculations The field of AI was established in the Dartmouth Conference in 1956, which focused on topics such as machine learning, natural language processing, logic, and games This conference marked the beginning of an important research field aiming to create a computer program that could "think" like a human.
In the 1960s, AI algorithms and models were further developed, and John McCarthy proposed the concepts of "computer programmer" and "computer program," which later became the main terms in the field of AI In the 1980s, artificial neural network models were developed, making the field of AI even more powerful Since then, AI has become an important field in computer science and widely applied in many fields such as healthcare, finance, education, marketing, industry, and more It is expected that AI will continue to play a significant role in solving human problems in the future.
In the field of artificial intelligence, Machine Learning (ML) and Deep Learning (DL) are the two methods that are often mentioned first and are important methods for learning from input data ML allows computers to learn and improve from input data instead of being programmed to perform a specific task It can be said that machine learning is teaching computers how to solve a specific problem by allowing them to learn from data This is done through the search for models, algorithms, and data analysis techniques, which help computers make predictions, pattern recognition, and data classification Common Machine Learning algorithms include supervised learning, unsupervised learning, and semi-supervised learning Some applications of Machine Learning include detecting credit card fraud, recommending products to customers, classifying spam emails, and facial recognition.
DL is a method of Machine Learning that uses artificial neural networks to learn from input data Deep Learning uses multiple layers (or also called layers) of neurons to learn and extract features from data The deeper and more complex the feature extraction, the higher the quality of the output.
DL is applied in speech recognition, image and video recognition, natural language processing, autonomous driving, and biotech robots.
Fig 2.2 Relationship between AI, ML, and DL
Although Deep Learning is considered a form of Machine Learning, these two methods have fundamental differences such as:
Complexity of input data: Machine Learning is usually used to process data with medium complexity, while Deep Learning is often used for data with high complexity, such as images, videos, or natural language data.
Neural network architecture: Machine Learning uses traditional machine learning algorithms such as SVM, Naive Bayes, Logistic Regression, K-Nearest Neighbors (KNN), Decision Trees, and Random Forests Meanwhile, Deep Learning uses a more complex neural network architecture, including multiple layers of neurons, allowing it to learn complex features of data.
Amount of input data: To achieve high accuracy, Deep Learning requires more input data than Machine Learning This helps Deep Learning learn more complex features, increasing the accuracy of predictions.
Processing speed: Deep Learning requires more computing power than Machine Learning, so it requires more powerful processors and system resources to process data
Fig 2.3 Comparision between Machine Learning and Deep Learning
Deep learning uses artificial neural network algorithms to extract features and learn. Each algorithm has different complexities and is used for different types of problems. Some notable deep learning algorithms include:
Deep Neural Networks – DNN: Deep neural network
Convolutional Neural Networks – CNN: Deep learning convolutional network Autoencoder – AE: Deep learning autoencoder
Recurrent Neural Networks – RNN: Deep learning network based on short-term memory
Directional Deep Learning – DDL: Directional deep learning network
Objective-Optimized Deep Learning – OODL: Deep learning network optimized for objectives
Self-adjusting Deep Learning – SADL: Deep learning network that self-adjusts
CNN – Convolutional Neural Networks
Convolutional Neural Networks (CNN), also known as convolutional neural nets, are among the most advanced and popular Deep Learning models today, ideal for solving problems with image data It helps us build intelligent systems with high accuracy like today CNN is widely used in recognition, classification, and identification of objects in images The development of network architectures goes hand in hand with the development of computer hardware such as GPUs that are faster. Distributed and parallel training techniques on multiple GPUs allow a model to be trained in just a few hours, compared to training that used to take days and be expensive Deep learning supporting frameworks also appear more, improved and become tools to meet all necessary needs for deep learning training The most popular can be mentioned are three frameworks: PyTorch (Facebook), TensorFlow (Google), and MXNet (Intel) developed and backed by leading technology companies in the world Since the ImageNet dataset, image datasets have affirmed the role of driving the development of AI Algorithms are compared to each other based on the leading results from standardized datasets Thanks to the expansion of free training platforms like Google Colab and Kaggle, everyone can access AI The global strategy for AI development in corporations and countries around the world has led to the formation of AI research institutes that bring together many outstanding scientists and breakthrough research.
Fig 2.4 The development process of CNN [7]
Many new CNN architectures have been formed, developed and improved in terms of depth, block design, and block connectivity, such as GoogleNet, DenseNet, etc. Prior to 2012, most researchers believed that the most important part of a pipeline was the representation SIFT, SURF, HOG were manually selected feature extraction methods applied in combination with machine learning algorithms such as SVM, MLP, k-NN, Random Forest
The characteristics of these architectures are:
The generated features cannot be trained because the rules that create them are fixed.
The pipeline is separated between feature extractors and classifiers.
A group of pioneering researchers believed that features can be learned through the model, and to achieve complexity, features should be learned hierarchically through multiple layers Because each image has features such as vertical, horizontal, diagonal lines, etc., and even unique features that help identify objects In the lowest layers of the CNN, the model has learned how to extract features similar to traditional feature extraction functions like HOG, SIFT This research direction continues to develop through the process of experimenting with new ideas, algorithms, and architectures At present, there are increasingly more CNN models being explored.
CNN, or Convolutional Neural Network, is a type of feedforward ANN that consists of multiple layers of stacked convolutional layers These layers use non-linear activation functions such as ReLU and tanh to activate the weights in the nodes The layers are connected to each other through convolutional mechanisms The next layer is the result of the convolutional operation performed on the previous layer, thus allowing for local connections Therefore, each neuron in the next layer is generated from the result of a filter applied to a local region of the previous neuron Each layer uses different filters, typically consisting of hundreds or thousands of filters, and combines their results There are also other layers such as pooling/subsampling layers used to filter out more useful information During network training, the CNN model automatically learns values through filter layers For example, in image classification tasks, CNNs will try to find optimal parameters for the filters The last layer is used to classify the image The main architecture of a CNN model consists of multiple components connected together in a multilayer structure, including Convolution,Pooling, ReLU, and Fully connected.
The Convolution layer (Conv) is the most important layer in the architecture of CNN It is based on the theory of signal processing, and taking the convolution will help extract important information from the data This operation can be visualized by shifting a window called a kernel over the input matrix, and the output value is the sum of the element-wise product of the kernel and the corresponding part of the input matrix In image processing, the Conv operation can transform input information into characteristic features such as edges, directions, and color spots, and the kernel can be seen as a sliding window applied on the input image.
In CNN, there are several basic mechanisms that are commonly used, including Stride, Padding, and Feature map Stride means shifting the filter map by a certain number of pixels from left to right Padding is the adding of zeros to the input to maintain the size of the feature map, and Feature map is the result of Convolution operation applied on the input In processing, filters are shifted over the entire image with S (stride) steps, and in some cases, P pixels with a given color value (usually 0) are added around the edges of the image, which is called padding Then, the output feature map is a matrix of size W2 x H2 x D2, where W, H, D represent the width, height, and depth of the output matrix.
The filter matrix of size (F x F x D1) + 1, where the additional 1 is the threshold parameter of the filter, represents the weights, and these values are constant during the Conv operation over the entire input image This is an important property of the shared weights that reduces the number of parameters that need to be learned during network training As a result, the total number of parameters needed for processing can be significantly reduced
Using Conv has the following advantages:
Reducing the number of parameters: In traditional ANNs, neurons in the previous layer are connected to all neurons in the next layer (fully connected), causing a large number of parameters to learn This is the main cause of overfitting With the use of Conv, weight sharing and local receptive fields can help reduce parameters.
The parameters in the process of using Conv or the values of the filter - kernel will be learned during training This information represents features such as corners, edges, color spots in the image so using Conv can help build a model that learns features on its own.
The pooling layer is one of the main computational components in the CNN structure Mathematically, pooling is essentially a process of matrix computation where the goal after computation is to reduce the size of the matrix while still highlighting the features present in the input matrix In CNNs, the pooling operator is performed independently on each color channel of the input image matrix This reduces the size of the data, helping to reduce the computations in the model.
Additionally, with the image size decreasing through pooling, the convolutional layer learns larger regions For example, if an image of size 224224 is pooled to
112112, a 33 region in the 112112 image corresponds to a 66 region in the original image Therefore, through pooling, the image size decreases, and the convolutional layer learns larger attributes The pooling size is denoted as KK The input of the pooling layer has a size of WHD, which is divided into D matrices with a size of WH. For each matrix, the maximum value or the average value of the data in the KK region is calculated and written into the result matrix The rules for stride and padding are applied as in the convolution calculation on the image.
Fig 2.6 Max Pooling with Size=(3*3), S=1, and P=0
In the example illustrated in the figure above, pooling is performed on an input matrix with a size of 55, with pooling parameters of Size3, stride S=1, and padding P=0 For each sliding window of size 3*3, the maximum value is taken, and the window slides until the entire input matrix is covered This results in a new matrix with a significantly reduced size compared to the input matrix Similarly, for average pooling, the average value of the data in the sliding window is taken instead of the maximum value in max pooling.
Fig 2.7 Max Pooling and Average Pooling
Basically, convolution is a linear transformation If all neurons are synthesized by linear transformations, then a neural network can be reduced to a linear function (a linear function is a function whose output value is proportional to the parameter value, and the set of output values will be a linear line) At that point, the ANN network will convert problems to logistic regression Therefore, each neural needs a nonlinear activation function Some commonly used activation functions are:
The Sigmoid function takes a real number input and converts it to a value in the range (0;1) If the input is a very small negative real number, the output will approach
0, and if the input is a large positive real number, the output will approach 1 However, the Sigmoid function has the following drawbacks:
An obvious disadvantage that when the input value is extremely large (very negative or very positive), the gradient of this function will be very close to 0 This means that the corresponding coefficients with the unit being considered will almost not be updated (also known as vanishing gradient) It does not have a center of 0, causing difficulties in convergence.
The ReLU function has been widely used in recent years when training neural networks ReLU simply filters values less than 0 Looking at the formula, we can easily understand how it works Some outstanding advantages of ReLU compared to Sigmoid are:
Faster convergence rate ReLU has a convergence rate that is 6 times faster This may be because ReLU is not saturated at both ends like Sigmoid
Faster computation Sigmoid use exp functions and more complex formulas than ReLU, so it will cost more to calculate.
However, ReLU also has a disadvantage:
For nodes with a value less than 0, after passing through the ReLU activation function, it will become 0, leading to a problem called "dying ReLU".
ResNet
ResNet [15] is a deep neural network architecture introduced by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun of Microsoft Research in 2015 The name ResNet is short for "Residual Network", meaning a neural network with residual connections.
ResNet is a CNN designed to work with hundreds of layers One problem that arises when building a CNN with many convolutional layers is the vanishing gradient problem, which leads to poor learning ResNet aims to solve this problem that occurs when using deep neural networks, called "vanishing gradient," which means that the gradient passed back from the layers behind to the previous layer has a very small value, leading to slow and inaccurate training of neural networks To solve this problem, ResNet uses branch connections to pass the input values of a layer to the output of the next layer through an addition operation These branch connections are called "residual connections," allowing the gradient to be more easily passed through each layer and helping the model learn higher-level features ResNet has achieved good results on many image recognition tasks, such as image classification, object recognition, and object detection.
Different versions of ResNet have different numbers of layers, but are usually numbered based on the number of layers, such as ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152 Different versions have different numbers of layers and are designed to fit different tasks.
U-Net
Unet[14] is a convolutional neural network (CNN) architecture developed by Olaf Ronneberger and colleagues to segment neural structures in the human brain This architecture won the EM segmentation challenge at ISBI 2012 and has since been used successfully in various image segmentation tasks, including medical cell segmentation, document segmentation, object segmentation in satellite images, and many more.
Unet's name comes from its U-shaped architecture, which is designed for image segmentation tasks The input image is passed through a series of convolutional and pooling layers to reduce its size, and then upsampled back to the original size through upsampling and convolutional layers to create a segmentation prediction model with the same size as the input image.
One of Unet's important features is its ability to handle images of different sizes and generate corresponding predictions This is particularly useful in medical applications where medical images often come in different sizes and shapes Unet has also been improved with variations such as UNet++, UNet3D, etc., to address more complex problems and improve model accuracy.
The Fig 2.10 shows the general U-Net design In this figure, the feature channels are doubled in the contraction path, and decreased in expansion path Skip connections are presented with gray arrows, integrating two feature maps
U-Net's architecture consists of two branches: an encoding branch on the left and a decoding branch on the right
Encoding: The task of feature extraction to find the context of an image A CNN network will play a role in feature extraction, so the architecture of the decoding part is similar to the CNN network architecture but does not require a Fully-Connected layer because the size of the image is returned to the original size after the encoding process.
In the encoding process, the length and width of the layers decrease gradually, from the input image size of 572x572 to only 32x32, while the depth or number of channels also increases gradually from 3 to 512.
Decoding: It consists of symmetric layers corresponding to the layers of the encoding branch The Upsampling process is applied to increase the size of the layers. Finally, we get a mask image that marks the predicted label of each pixel.
Fig 2.13 Unsampling in Decoding 2.5.1 BCDU-Net
In another work [21], BCDU-Net, a U-Net extended network, has achieved superior performance compared to other alternatives utilized in medical image segmentation.BCDU-NET encompassed four stages of two 3x3 convolutional filters each in the encoding path Each convolution filter is followed by a 2x2 max-pooling and a RELU activator These three layers together form a down-sampling process, wherein feature channels are doubled This ensures that the representation of images is gradually extracted along the encoded path and its dimensions are increased layer by layer This model has two advantages compared to other U-Net-based segmentation approaches:20
(1) The learning redundant features problem is generally prevented by the application of densely connected convolutions in the last successive convolutions of the U-Net encoded path layer (2) Each up-sampling stage in the decoding path is followed by batch normalization, which has been proven to improve the performance, speed, and stability of neural networks The outputs from the batch normalization function are combined and fed to a Deeper Bidirectional Convolutional LSTM [22] (BConvLSTM) to collect spatiotemporal information BConvLSTM applies forward, and backwardConvLSTMs on the input data in two layers, which enhances feature extraction by encouraging information flow along bi-directional streams This highly applicable model has shown results exceeding those of previous video salient object detection designs.
Evaluation Methods
In the field of artificial intelligence, when a model is built and designed to learn a specific task, it is important to have a method to evaluate whether the model is effective or not and to easily compare it to other models and methods The accuracy of a model is expressed as specific values and helps researchers to easily compare and evaluate the model Some commonly used evaluation methods for artificial intelligence models include:
Accuracy: Accuracy is the ratio of the number of correct predictions to the total number of predictions, calculated by dividing the number of correct predictions by all predictions made.
Precision reflects the model’s ability to distinguish negative samples The higher the precision, the stronger the model’s ability to distinguish negative samples:
Recall reflects the model’s ability to identify positive samples The higher the recall, the stronger the model’s ability to identify positive samples:
The F1 score is the weighted average of precision and recall, calculated as follows:
AUC (Area Under the ROC Curve): AUC is the area under the ROC curve, used to evaluate the classification performance of the model and is calculated by finding the area under the ROC curve.
Confusion matrix: The confusion matrix is calculated by counting the number of true and false predictions for each type of prediction.
Cross-validation: Cross-validation is calculated by dividing the dataset into subsets, evaluating the model on each subset, and calculating the average error to evaluate the quality of the model.
Mean squared error (MSE): MSE is calculated by finding the average of the squared difference between the predicted and actual values.
Data Storage
The DICOM (Digital Imaging and Communications in Medicine) format is a standard file format used in the field of medical imaging to store, transmit, and process medical image data.
Characteristics of the DICOM format:
DICOM was developed by the American College of Radiology and the National Electrical Manufacturers Association as a standard file format for storing, transmitting, and processing medical image data.
The DICOM format can store many types of medical image data, including X-ray, ultrasound, computed tomography (CT), magnetic resonance imaging (MRI), and many other types of medical images.
DICOM also allows the storage of metadata about image data, including information about the patient, medical facility, type of equipment, and other settings related to the image scanning session This information is extremely important to ensure the consistency and compatibility of medical image data. Different software and medical imaging devices may not be compatible with each other when using the DICOM format Therefore, it may be necessary to use conversion tools to convert data to or from DICOM format.
DICOM is used as a standard file format to share medical image data, ensuring consistency and compatibility between different software and devices in the field of medical imaging.
Popular medical image analysis software and tools such as ImageJ, Slicer, and ITK-SNAP all support the DICOM format and allow users to process medical image data in this format
DICOM allows for the storage of multiple medical image files in a single DICOM file, enabling users to store various types of medical images for a patient in one file. The DICOM format is also widely used in medical research to store image data and provide compatibility between different software and devices for data analysis.
NIfTI (Neuroimaging Informatics Technology Initiative) is a file format for storing digital medical images, especially for brain imaging studies The NIfTI format includes two versions: NIfTI-1 and NIfTI-2.
Some characteristics of the NIfTI format:
The NIfTI format was developed to improve compatibility and efficiency compared to previous medical image file formats such as ANALYZE. NIfTI-1 includes an image file (.nii) and a header or image data file (.hdr or img) NIfTI-2 uses a single file (.nii) to store images and related information.
NIfTI allows for the storage of various types of medical images, including X- ray, ultrasound, MRI, CT, and PET images.
NIfTI includes image-specific information such as size, color depth, resolution, and view angle, as well as medical information such as patient name, ID, medical history, and diagnostic results.
NIfTI is one of the commonly used standard formats in medical image research and is supported by popular medical image analysis software such as FSL (FMRIB Software Library), AFNI (Analysis of FunctionalNeuroImages), and SPM (Statistical Parametric Mapping)
METHODOLOGY
Dataset
We use the first The Cancer Image Archive (TCIA) dataset to implement this project as training and validation data For many nations worldwide, lung diseases have resulted in significant loss of life and financial resources, leaving unfortunate people with numerous long-term effects As a result, the focus on medical examinations to assure human health in general and the necessity to diagnose lung disease is growing.
The dataset contains 4DCT images of 60 patients, provided by three institutions for the Lung CT Segmentation Challenge in 2017 The participants were selected based on having thoracic diseases and in radiation treatment planning The CT images of those patients were then acquired for the auto-segmentation algorithms grand challenge and made available on TCIA for the public The images are provided in DICOM format and have resolutions of approximately 512 x 512 pixels The data consists of 36 training datasets, 12 off-site test datasets, and 12 live test datasets, which sums up to a total of 9,593 images All ground-truth were manually extracted and obtained from clinic that used them for treatment planning They were then reviewed (edited if needed) by the RTOG 1106 contouring atlas to ensure consistency among the
60 patients Our utilized dataset only contains the 36 training datasets and 12 live test datasets Each training dataset is labeled as LCTSC-Train-Sx-yyy, with Sx and yyy representing the institution and the dataset ID in the institution respectively Similarly, each live testing dataset is labeled as LCTSC-Test-Sx-20y, with Sx and 20y representing the institution and the dataset ID in the institution respectively.
The second dataset used by us includes 7 patients collected from Bach Mai hospital The images are also offered in DICOM format and have 512 × 512 pixel resolutions roughly Simultaneously, they are also given the total lung volume, right lung volume, and left lung volume so we can compare them with our suggested algorithm Afterwards, demonstrating the efficiency of our suggested method.
Ordinal Data distribution Number of number patients
Pre-processing data
In this stage, we prepare the data by pre-processing the raw input images to optimize the training results of the deep learning network We apply the dilation morphological process, and threshold-based model to the CT scans To improve the overall segmentation result and forecast, we must increase the focus of the network to train only a series of specific image features The steps for the proposed image conversion are as follows:
Image extraction using thresholding: We extract patients' CT scans from the DICOM file dataset and convert them into 2D images We remove the redundant background and acquire ROI by setting HU values with ranges (-512, 512).
Image binarization: In this step, we convert 2D scans to binary images with two values on the surfaces (black and white) The ROI in the lung is presented in black with value zero.
Dilation morphological operations: We apply the morphological operation [27] to binarized images to extract the ROI of the images After binarization, there would still be unwanted noises which are white-colored regions surrounding the lungs The process would also cause holes or erosions to the images As such, morphological operations could be used to remove such redundancies.
Result: Combine images post-morphology with binarized images.
Pre-processing algorithm of proposed BCDU-Net method:
5 X = image2binary(X) (convert to binary)
Input Optimized ROI Pre-processed
Fig 3.2 Pre-processing on Lung dataset
In Figure 3.2, we present the four stages of the lung CT dataset going through pre- processing The first image is the original 2D input lung scan, which was extracted from the DICOM dataset Next, we convert the 2D image to binary with two values of black and white on the surface The ROI is displayed in black, which can be seen in the “Optimized” image Then, we apply HU ranges (-512, 512) and morphological operation to extract ROI of the scans This is shown in the third image The final image shows the result of the pre-processing stage where all redundancies are efficiently removed.
Deep Learning Based-model Segmentation
Inspired by U-Net, BConvLSTM, and dense convolutions, the BCDU-Net was created as shown in Figure 3.3 The network utilizes the strengths of both BConvLSTM states and densely connected convolutions We detail different parts of the network in the next sub sections.
Four steps of the encoding path are made up of two convolutional 3×3 filters followed by a 2 × 2 max-pooling function and ReLU each As the layer cascades down, the number of feature maps is doubled at each step, image representations are extracted, and the dimensions of these representations are increased sequentially The final layer in the encoding path would eventually produce a high-dimensional image representation with high semantic information In our proposed method, we utilize the method of densely connected convolutions to target this redundancy issue [28] The blocks sequence of the last convolutional layer in the encoding path is shown in Figure 3.4 Based on the concept of “collective knowledge”, the feature maps iterate through the network, helping it to attain a diverse but not redundant set of feature maps, and improving its performance by encouraging information flow and recycling features.
Fig 3.3 BCDU-Net with bi-directional ConvLSTM & densely connected convolution
Fig 3.4 Dense layer of the BCDU-Net 3.3.2 Decoding path
At the start of each step in the decoding path, an up-sampling function is performed over the feature maps of the previous layer In the conventional U-Net method, the feature maps are cropped and pasted from the encoding to the decoding path correspondingly, followed by these feature maps being integrated with the output of the up-sampling function Nonetheless, we apply BConvLSTM in our proposed method to process these two kinds of feature maps in a more complex way. Specifically, the expanding path would increase the size of the feature maps progressively to reach the original size of the input image in the final layer.
The distribution of the activations varies in the intermediate layers of the training step slows down the training process tremendously since every layer in every training step must adapt itself constantly to a new distribution Batch Normalization (BN) [29] standardizes the inputs of a layer in the network by subtracting the batch mean and dividing by the standard deviation Effectively, BN increases the stability of a neural network and drives the speed of the training process Hence, the overall performance of the model is also improved.
The BConvLSTM layer would take now take in the output from the BN step In the conventional LSTM, these networks do not consider the spatial correlation since these models use full connections in input-to-state and state-to-state transitions ConvLSTM was proposed by Xingjian et al [30], which integrates densely convolutions into input- to-state and state-to-state transitions, to address the above-mentioned problem The input gate, an output gate, a forget gate, and a memory cell in the neural network act as controlling gates to access, update, and clear memory cells respectively.
Calculator Lung Volume
After training the model as mentioned in the previous section, we get the optimal set of weights Load the model with this set of weights and import the image from the Dicom file for lung segmentation The output is a numpy array with values (0,1, 2), representing the array on the graph, we can see that it is a segmented lung image in black and white format, the predicted area of the lung will be is white, the rest is black. Firstly, we have each image composed of pixels, each pixel has a value representing the characteristics of the image at that point, here for the lung image which is the output of the lung image segmentation process, the values include includes 0 - background, 1 - left lung, 2 - right lung Therefore, to calculate lung volume, we must first calculate the lung area in each tomographic layer by calculating the number of non-zero pixels in the numpy array However, this area is in pixels The Dicom file contains the medical record of each patient, including the parameters of the machine used to examine that patient, specifically the computerized tomography machine Each machine is different or each time for different patients, the settings of the machine may be different, but they are all stored in the Dicom file To calculate the patient's lung volume, we need to take the "Pixel Spacing" parameter which is the distance between pixels or the length of the edges of the pixel, the "Slice Thickness" parameter which is the thickness of each slice.
From there, we have the formula to calculate lung volume is:
“s” is the total lung area in all sections
“space” is the area of a pixel in mm space = Pixel Spacing[0]*Pixel Spacing[1]
“thick” is the thickness of the slice (Slice Thickness), in each shot this parameter is constant, depending on the shooting purpose and the camera this parameter may be different in each different shooting
With the collected images from the segmentation process, we have obtained separated right and left lung masks From there we propose to apply that mask in building a 3D lung prototype This process applies the Marching cube algorithm to build the 3D model This algorithm is often adopted in medical visualizations such as
CT and MRI images, for special effects or 3D modelling
The algorithm proceeds through the scalar field, taking eight neighbouring locations at a time (thus forming an imaginary cube), then determining the polygon(s) needed to represent the part of the surface that passes through this cube The individual polygons are then fused into the desired surface as shown in Fig.5.
Fig 3.5 Look up table of marching cubes: 15 different configurations
This is accomplished by treating each of the 8 scalar values as a bit in an 8-bit integer to create an index to a precalculated array of 256 possible polygon configurations (2^8%6) within the cube The appropriate bit is set to one if the scalar value is higher than the ISO-value, and to zero if it is lower The actual index to the polygon indices array is the last value after all eight scalars have been checked. c)
Fig 3.6 Lung s: ng mask, b After using Gaussian filter, c 3D Lung image
Based on the above application of the Marching cube algorithm, this algorithm extracts various parts of the cubes from images to build a 3D model Due to the triangular angle of the proposed algorithm, the 3D model might possess a rough shape.
As such, we apply the Gaussian filter to blur out rough edges before feeding the model into the algorithm After that, all cubes with their remaining triangular shapes are loaded through the entire lung volume to get the 3D model The whole process is presented in Fig 6.
From the actual patients’ data collected from the hospital, we found that they contain an inconsistent level of thickness We pick the ones with a thickness of 5mm since 5mm allows easier infusion in 3D partitioning and reconstruction Then we employ the interpolation formula Equation (1) as shown below to create multiple 1mm-thick slices with adjustable thickness.
Equation (1) is a linear interpolation between two known points If the two known points are given by ( and (, the linear interpolant is the straight line between these points For a value x in the interval (, the value y along the straight line is given from the equation (1) This formula goes through each 5mm slice to generate an interpolated linear equation, from which the points of the 1mm slice are inferred.
This allows us to improve the accuracy of 3D model reconstruction Fig 7 shows the results of the application of the interpolation formula to obtain slices from 1mm to5mm thickness.
Experimental Results
For the object segmentation problem, the object in the image is often used as the ratio between the predicted area and the correct area of the object Since the proposed network is the modified version of the standard U-Net, we have summarized theAccuracy and DSC of the original U-Net and current alternatives in Table 1 Dice coefficient (DSC) was show in Figure 3.5, or overlap index, is the most used statistical validation metric to evaluate medical segmentation The DSC value is a useful metric which can be utilized in studying the reproducibility and accuracy of image segmentation And accuracy score in machine learning is an evaluation metric that measures the number of correct predictions made by a model in relation to the total number of predictions made We calculate it by dividing the number of correct predictions by the total number of predictions
Fig 3.8: Formula of Dice coefficient
In model training, there are parameters such as: Epoch, Batch_size, Learn_rate Epoch is a concept used in machine learning to describe training a machine learning model Each epoch represents running the entire training dataset through the model once During training, a model is trained over many epochs to improve predictability. After each epoch, the weights of the model are updated to minimize the loss between the predicted result and the actual value The number of epochs needed to train a model depends on many factors such as the complexity of the problem, the amount of training data, the model's architecture, the learning rate, and the way it is updated. weights update method In practice, the number of epochs is often determined by testing many different values until an optimal result is reached However, using too many epochs can lead to overfitting, where the model is good on the training data but not good on the test dataset.
Batch size is used to describe the number of samples used to compute gradients and update the weights of a model during training When training a model, the training data is divided into batches, and each batch contains an equal number of samples.Batch size affects the training speed and accuracy of the model The larger the batch size, the fewer times the model will be updated with the weights and the faster the32 training speed, but it also requires more memory to store the gradient and more time to compute the gradient The smaller the batch size, the more times the model will be updated with the weights and the accuracy of the model will also increase, but the training speed will be slower However, when choosing batch size, it is necessary to consider the computing resources (CPU/GPU) and memory of the computer to avoid out of memory and ensure effective training speed.
Learning rate is a parameter that is used to specify the update weights of the model during loss function optimization During training, the model needs to optimize the loss function by adjusting its weight to minimize the error between the predicted value and the actual value The learning rate specifies the loss function optimization step, the updated value for the weights in each training loop If the learning rate is too small, the training will be very slow and may not achieve optimal results If the learning rate is too large, the training process can lead to overshooting, the loss value will increase instead of decreasing when performing weight update steps Choosing the right learning rate value is an important step in the training process Usually, the learning rate value is chosen by experimenting with different values until an optimal result is obtained.
Training the model on Colab using GPU, the initial parameters are set as follows:
To evaluate our model, we divided it into two parts: evaluating the results of the segmentation and evaluating the results of lung volume calculation
3.7.4 Evaluating the results of the segmentation
In this section, we will evaluate the results of the segment in the Lung CT Segmentation Challenge in 2017 dataset The model was trained on 36 patients and evaluated on 12 patients.
We apply DSC and AC to both measure the reproducibility of repeated manual annotations on CT images and the spatial overlap accuracy between automatic and ground truth segmentations In addition, we analyze individual added parts of the original network to further evaluate their effect on the results.
According to Table 3.2, the results of all variation of U-Net surpass other models such as Recurrent Residual-Net Additionally, U-Net combined with BConvLSTM in the skip connections is clearly superior to the one of standard (99.72 v 98.72) U-Net with BConvLSTM shows a more accurate and fine segmentation output than the original UNet It also presents an exceeding DSC of 97.31, compared to 95.02 of the original U-Net and 96.32 of R2U-Net The results gained from our proposed model are predominant with the highest accuracy of 99.80 This is primarily due to our pre- processing step where we changed the channels of the scans and further removed redundancy such as noises, holes, or erosions Our method is also efficient in separating right and left sides of the lung We achieve an excellent DSC of 97.97 for Right (R) side and 97.73 for Left (L) side In the 2017 Thoracic Auto-Segmentation Challenge, the highest DSC scores were produced by lungs segmentation, which are 97.00 for R side and 98.00 for L side [32] Both methods were trained on the same dataset, however, we were able to produce slighter better result for the right lung.
Table 3.2 Performance comparison of proposed network and the state-of-the-art alternatives on TCIA dataset
Year Author Method AC(%) DSC
3.7.5 Evaluating the results of lung volume calculation
To be used solely in this research, we collected the data of 7 patients from the Radiology Centre of Bach Mai Hospital in Hanoi, Vietnam Information on left, right and total lung volumes was obtained from the software provided by the manufacturer that came with the CT imaging system This volumetric information provided by the
CT imaging device is used by the physician in the diagnostic process, so we use these34 data as reference data, and are referred to as ground truth (GT) data Our aim in this test step is to check the accuracy of the proposed algorithm used in the program Our work computes the total left lung and right lung from various slice thicknesses. Among all the used slice thicknesses, our model produces a superior result with 1mm slices Table 1 displays the experimental results of our proposed work compared to the ground truth attained from the hospital’s software.
Table 3.3 Lung volume calculation summary
GT Right lung volume (ml)
GT Left lung volume (ml)
According to Table 3.3, the results of our method provide nearly identical results with a deviation of only less than 1% in accuracy from the ground truth of 6 patients.
We obtained approximately 98.93% accuracy across all patients when compared with results obtained from commercial software That proves the high efficiency of our proposed method when comparing actual results from Bach Mai hospital.
In summary, with the introduction of a modified Bi-directional Convolutional (ConvLSTM) U-Net (BCDU-Net) neural network, we have been able to build an automatic system to segment lung and also measure TLV with highly accurate results at a low cost We were able to address the challenge that Vietnamese doctors are having in diagnosing and treating lung-related diseases In addition, the 3D model reconstruction of the lung has been proven to hold an essential role in aiding medical specialists in early detecting and monitoring lung diseases Our proposed BCDU-Net- based model for medical image applications has shown exceptional performance in relation to the proposed equivalent state-of-the-art methods Furthermore, we addressed the issue of the extremely high cost that most medical instrumentation face. This is critical in achieving our goal of introducing new affordable biomedical software based on AI applications to a broader market in Vietnam.
[1] Forum of International Respiratory Societies “The global impact of respiratory disease - Third Edition” European Respiratory Society, 2021 Accessed 22 September, 2021.
[2] Charu C Aggarwal - IBM T J Watson “Artificial Intelligence” Center Yorktown Heights, NY, USA
[3] J McCarthy, Dartmouth College M L Minsky, Harvard University N Rochester, I.B.M Corporation C.E Shannon, Bell Telephone Laboratories “A proposal for the Dartmouth summer research project on Artificial Intelligence” August 31, 1955
[4] James Moor “The Dartmouth College Artificial Intelligence Conference: The next fifty years” , 2006
[5] Taiwo Oladipupo Ayodele “Types of Machine Learning Algorithms” University of Portsmouth United Kingdom
[6] Srivatsa Sarat Kumar Sarvepalli “Deep Learning in Neural Networks: The science behind an Artificial Brain”, Liverpool Hope University
[8] Yann LeCun, Léon Bottou, Yoshua Bengio và Patrick Haffner “Gradient-Based Learning Applied to Document Recognition”, 1998
[9] Alex Krizhevsky, Ilya Sutskever và Geoffrey Hinton “ImageNet Classification with Deep Convolutional Neural Networks”, 2012
[10] Karen Simonyan và Andrew Zisserman “Very Deep Convolutional Networks for Large-Scale Image Recognition”, 2015
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren và Jian Sun “Deep Residual Learning for Image Recognition” , 2016
[12] Sharma, T., & Raheja, J L “Artificial Neural Networks: A Brief Overview” International Journal of Computer Applications, 2017