Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 65 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
65
Dung lượng
1,33 MB
Nội dung
Hanoi University of Science and Technology School of Information and Communication Technology Master Thesis in Data Science Unified Deep Neural Networks for Anatomical Site Classification and Lesion Segmentation for Upper Gastrointestinal Endoscopy NGUYEN DUY MANH manh.nd202657m@sis.hust.edu.vn Supervisor: Dr Tran Vinh Duc Hanoi 10-2022 Author’s Declaration I hereby declare that I am the sole author of this thesis The results in this work are not complete copies of any other works STUDENT Nguyen Duy Manh Contents Contents Abstract List of Figures List of Tables List of Acronyms Introduction 1.1 General introduction 1.2 Objectives 1.3 Main contributions 1.4 Outline of the thesis 2 Artificial Intelligence and Machine Learning 2.1 Basic concepts 2.2 Types of learning 2.2.1 Supervised learning 2.2.2 Unsupervised learning 2.2.3 Reinforcement learning Techniques 2.3.1 Deep Learning 2.3.1.1 Deep Learning and Neural Networks 2.3.1.2 Perceptron 2.3.1.3 Feed forward 10 2.3.1.4 Recurrent Neural Network 11 2.3.1.5 Deep Convolutional Network 11 2.3.1.6 Training a Neural Network 2.3 2.3.2 11 Convolutional Neural Network 12 2.3.2.1 Image kernel 13 2.3.2.2 The convolution operation 13 2.3.2.3 Motivation 14 2.3.2.4 Activation function 16 2.3.2.5 Pooling 17 2.3.3 Fully convolutional network 18 2.3.4 Some common convolutional network architectures 20 2.3.5 2.3.4.1 VGG 20 2.3.4.2 ResNet 20 2.3.4.3 DenseNet 21 2.3.4.4 UNet 21 Vision Transformer 23 2.3.5.1 The Transformer 23 2.3.5.2 Transformers for Vision 24 2.3.6 Multi-task learning 26 2.3.7 Transfer learning 27 2.3.8 Avoid overfitting 29 Methodology 3.1 3.2 31 EndoUNet 31 3.1.1 Overall architecture 31 3.1.2 Encoder 31 3.1.3 Segmentation decoder 33 3.1.4 Classifiers 34 SFMNet 34 3.2.1 Overall architecture 34 3.2.2 Encoder 35 3.2.3 Compact generalized non-local module 37 3.2.4 Squeeze and excitation module 37 3.2.5 Feature-aligned pyramid network 37 3.2.6 Classifiers 39 3.3 Metrics and loss functions 39 3.4 Multi-task training 40 Experiments 42 4.1 Datasets 42 4.2 Data preprocessing and data augmentation 44 4.3 Implementation details 45 4.4 Experimental results 46 Conclusion and future work 51 References 52 Abstract Image Processing is a subfield of computer vision concerned with comprehending and extracting data from digital images There are several applications for image processing in various fields, including face recognition, optical character recognition, manufacturing automation inspection, medical diagnostics, and tasks connected to autonomous vehicles, such as pedestrian detection In recent years, the deep neural network has become one of the most popular image processing approaches due to a number of significant advancements The use of machine learning in biomedical applications can be structured into three main orientations: (1) as a computer-aided diagnosis to help the physicians for an efficient and early diagnosis, with a better harmonization and less contradictory diagnosis; (2) to enhance the medical care of patients with better-personalized therapies; and (3) to improve the human wellbeing, for example by analyzing the spread of disease and social behaviors in relation to environmental factors [1] In this work, I propose to construct the models for the first orientation that is capable of handling multiple simultaneous tasks pertaining to the upper gastrointestinal (GI) tract On a dataset of 11469 endoscopic images, the models were evaluated and produced relatively positive results List of Figures 2.1 Reinforcement learning components 2.2 Relationship between AI, ML, and DL 2.3 Neural Network 2.4 Illustration of a deep learning model [2] 2.5 Perceptron 10 2.6 Architecture of a CNN [3] 13 2.7 Example of convolution operation [4] 14 2.8 Sparse connectivity, viewed from below [2] 15 2.9 Sparse connectivity, viewed from above [2] 15 2.10 Common activation functions [5] 16 2.11 Max pooling 18 2.12 Average pooling 18 2.13 Architecture of an FCN [6] 19 2.14 Architecture of VGG16 [7] 20 2.15 A residual block [8] 21 2.16 DenseNet architecture vs ResNet architecture [9] 22 2.17 UNet architecture [10] 22 2.18 Attention in Neural Machine Translation 24 2.19 The Transformer - model architecture [11] 25 2.20 Vision Transformer architecture [12] 25 2.21 Common form of multi-task learning [2] 26 2.22 The traditional supervised learning setup 27 2.23 Transfer learning 28 3.1 Architecture of EndoUNet 31 3.2 VGG19-based shared block 32 3.3 ResNet50-based shared block 33 3.4 DenseNet121-based shared block 33 3.5 EndoUNet decoder configuration 34 3.6 SFMNet architecture 35 3.7 Grouped compact generalized non-local (CGNL) module [13] 37 3.8 A Squeeze-and-Excitation block [14] 38 3.9 Overview comparison between FPN and FaPN [15] 38 3.10 Feature alignment module [15] 39 3.11 Feature selection module [15] 39 4.1 Demostration of upper GI 42 4.2 Some samples in anatomical dataset 43 4.3 Some samples in lesion dataset 44 4.4 Some samples in HP dataset 44 4.5 Image augmentation 45 4.6 Learning rate in training phase 46 4.7 EndoUnet - Confusion matrix on anatomical site classification task on a fold 49 4.8 SFMNet - Confusion matrix on anatomical site classification task on a fold 49 4.9 Confusion matrices on lesion classification task on a fold 49 4.10 Some examples of the lesion segmentation task 50 List of Tables 3.1 Detailed settings of MiT-B2 and MiT-B3 36 4.1 Number of images in each anatomical site and lighting mode 43 4.2 Accuracy comparison on the three classification tasks 47 4.3 Dice Score comparison on the segmentation task 48 4.4 Number of parameters and speed of models 48 List of Acronyms GI Gastrointestinal HP Helicobacter Pylori AI Artificial Intelligence ML Machine Learning DL Deep Learning NN Neural Network DNN Deep Neural Network CNN Convolutional Neural Network RNN Recurrent Neural Network MTL Multi-task Learning RL Reinforcement Learning The loss Lpos for the anatomical site classification task is a multi-class cross-entropy loss defined as follows: Lpos = − N X µ Cpos X pos yipos (j) ∗ log yˆipos (j) i i=1 (3.3) j=1 where N is the number of training samples The loss Lle for lesion type classification task is another multi-class cross-entropy loss defined as follows: Lle = − N X µle i i=1 Cle X yile (j) ∗ log yˆile (j) (3.4) j=1 The loss Lhp for HP classification is the binary cross-entropy loss defined as follows: Lhp = − N X µhp i ∗ yihp ∗ log yˆihp + (1 − yihp ) ∗ log (1 − yˆihp ) (3.5) i=1 The loss Lseg for the lesion segmentation task is the primary loss to drive the learning process of the lesion segmentation task It is defined as the sum of binary crossentropy loss and dice loss as follows: Lseg = − N X ˆ iseg ) ˆ iseg ) + DICE(yiseg , y µseg BCE(yiseg , y i (3.6) i=1 The total loss function for training is a weighted sum of component loss functions, as shown below: Ltotal = λ1 ∗ Lpos + λ2 ∗ Lle + λ3 ∗ Lhp + λ4 ∗ Lseg (3.7) where λt indicates the importance level of the t-th task In this work, we set λ1 = λ2 = λ3 = λ4 = 41 Chapter Experiments 4.1 Datasets The training data used in this work is actual data collected from endoscopy findings of patients at the Institute of Gastroenterology and Hepatology and Hanoi Medical University Hospital There are three sub-datasets: one for the anatomical site classification, another for the lesion segmentation and classification, and the last for the HP classification They are combined into a huge data set for training and testing Figure 4.1: Demostration of upper GI Anatomical site dataset This dataset includes 5546 images of 10 anatomical sites, all of which are captured directly from the endoscopic machine, including four lighting modes: WLI (White Light Imaging), FICE (Flexible spectral Imaging Color Enhancement), BLI (Blue 42 Light Imaging), and LCI (Linked Color Imaging) The images in this dataset not contain any lesions and have labels specifying the anatomical site Table 4.1 describes the details of this dataset Table 4.1: Number of images in each anatomical site and lighting mode Anatomical site Pharynx Esophagus Cardia Gastric body Gastric fundus Gastric antrum Greater curvature Lesser curvature Duodenum bulb Duodenum WLI 177 169 163 174 170 155 171 155 156 163 FICE 134 141 120 135 130 143 131 140 141 138 BLI 120 116 132 124 126 131 126 134 135 127 LCI 119 127 140 120 128 125 125 126 128 131 TOTAL 550 553 555 553 554 554 553 555 560 559 Figure 4.2: Some samples in anatomical dataset Lesion dataset In this dataset, we have 4104 images of types of lesions: reflux esophagitis, esophageal cancer, gastritis, stomach cancer, and duodenal ulcer The images in this dataset have the annotations for both the classification and segmentation tasks The numbers of images for reflux esophagitis, esophageal cancer, stomach cancer, and duodenal ulcer classes are 1335, 538, 1443, 538, and 250, respectively Figure 4.3 shows some samples in the lesion dataset HP dataset We have 1819 images in this dataset, including HP-positive and HP-negative images Figure 4.4 are some samples in the HP dataset 43 Figure 4.3: Some samples in lesion dataset Figure 4.4: Some samples in HP dataset 4.2 Data preprocessing and data augmentation Given that the images come from multiple sources and have different sizes, they are first resized to 480x480 before being fed into the model for training Deep learning models usually require many data to reach better performance Hence, we can use data augmentation techniques to generate more data for the training phase The training data is augmented on the fly with a probability of 0.5 (i.e., each image has a 50% chance of being augmented every time it is selected for training) The following techniques are used: • Horizontal flip • Vertical flip • Rotate • Shift • Zoom in/out • Motion blur • Hue saturation 44 Figure 4.5 illustrates how an image transforms after applying augmentation techniques Figure 4.5: Image augmentation 4.3 Implementation details Experimental environment: the experiments were conducted in the environment with the following specifications: • OS: Ubuntu 20.04.3 LTS 64 bit • RAM: 128GB • CPU: AMD Ryzen 3970X 3.7GHz 32 cores / 64 threads • GPU: NVIDIA RTX 3090 24GB Framework: the models are implemented using Pytorch, a Python framework that allows tracking all calculations performed on learnable weights and provides many common modules in neural networks, making it easier to install and monitor the model 45 Dataset preparing: the models are evaluated using the 5-fold cross-validation schema We split each dataset into five subfolds, and each fold of the big dataset is then created by merging subfolds from datasets In detail, in the anatomical site dataset, each subfold contains the same number of images in each anatomical site and lighting mode In the lesion dataset, each subfold contains the same number of images of each lesion type In the HP data set, each fold contains the same number of samples of HP positive and HP negative Finally, we create a marker vector µ to indicate the sample type Dynamic learning rate: in the training phase, the linear warmup and cosine annealing are used to update the learning rate The minimum learning rate is 10−6 , and the maximum is 10−3 The learning rate will increase rapidly in the first two epochs (warmup) and gradually decrease in the subsequent epochs based on the cosine function Figure 4.6 describes the change in learning rate during training Figure 4.6: Learning rate in training phase 4.4 Experimental results We perform the following experiments to validate our model: • An ablation study to evaluate the impact of three popular CNN backbones on EndoUnet, including VGG19, ResNet-50, and DenseNet-121 • An ablation study to evaluate the impact of two MiT configurations on SFMbased, including MiT-B2 and MiT-B3 • In the classification tasks, we train the single-tasking instances of models, including VG19, Resnet50, DenseNet121, and MiT-B3, each of them trained on separate data We compare the performance of multi-tasking models and the single-tasking models • In the lesion segmentation task, we train five single-tasking instances of UNet and five single-tasking instances of SFMNet, each of which trained on separate 46 lesion data Then, we compare the performance of multi-tasking models versus the single-tasking instances • To demonstrate the efficacy of transfer learning, two multi-tasking instances are trained without pre-trained parameters for comparison with models that employed them We will discuss models using pre-trained parameters first Models that not use pre-trained parameters will be discussed later The effectiveness of different backbones on the accuracy of models is analyzed in Table 4.2 There is typically no significant variation in the performance of models on datasets On two datasets — anatomical site and lesion — the models perform rather well The lowest accuracies for the anatomical site classification, lesion classification, and HP classification tasks are 97.07%, 98.51%, and 91.21%, respectively, for the single-tasking VGG19 The findings of multi-tasking models are generally superior to those of single-tasking models SFMNet with MiT-B3 as the backbone and EndoUNet with Resnet50 as the backbone are the models with the best performance on three tasks: anatomical site classification at 98.46%, lesion classification at 99.63%, and HP classification at 93.46% Table 4.2: Accuracy comparison on the three classification tasks Method Backbone Anatomical site classification Lesion classification HP classification VGG19 (cls only) 97.07 ± 0.29% 98.51 ± 0.69% 91.21 ± 1, 27% Resnet50 (cls only) 97.53 ± 0.29% 98.79 ± 1.09% 91.87 ± 1.08% DenseNet121 (cls only) 97.65 ± 0.29% 99.16 ± 0.76% 91.81 ± 1.23% MiT-B3 (cls only) 97.58 ± 0.86% 99.45 ± 0.36% 91.43 ± 1.15% VGG19 98.09 ± 0.30% 99.58 ± 0.44% 93.13 ± 1.02% ResNet50 98.00 ± 0.49% 99.63 ± 0.26% 93.46 ± 0.83% ResNet50 (no pretrained) 95.86 ± 0.35% 99.16 ± 0.41% 93.12 ± 0.71% DenseNet121 98.28 ± 0.50% 99.44 ± 0.75% 93.19 ± 1.14% MiT-B2 98.30 ± 0.31% 99.11 ± 0.76% 93.35 ± 0.79% MiT-B3 98.46 ± 0.41% 99.54 ± 0.65% 93.29 ± 0.82% MiT-B3 (no pretrained) 91.26 ± 0.41% 99.00 ± 0.26% 93.21 ± 0.83% EndoUNet SFM-based The results of the segmentation task are shown in Table 4.3 In the majority of tests, the multi-tasking model outperforms its single-tasking counterpart, except for the SFMNet model in the duodenal ulcer and stomach cancer dataset The fact that the combination of MiT- B3, CGNL, SE, and FaPN achieves the greatest dice scores across all datasets demonstrates its effectiveness in the segmentation task Figure 4.10 demonstrates several examples of segmentation results Table 4.4 measures the number of parameters and speed of models Despite having a greater number of parameters and workloads, multi-tasking models attain comparable speeds to single-tasking ones SFMNet with MiT-B3 backbone has the slowest performance of all the models, hitting 21 FPS, indicating that the models can be 47 Table 4.3: Dice Score comparison on the segmentation task Reflux Esophageal Duodenal esophagitis cancer ulcer ResNet50 0.457 ± 0.011 0.807 ± 0.005 0.709 ± 0.021 0.444 ± 0.057 0.854 ± 0.021 MiT-B3 0.515 ± 0.002 0.839 ± 0.012 0.737 ± 0.008 0.477 ± 0.051 0.896 ± 0.009 VGG19 0.462 ± 0.014 0.807 ± 0.006 0.648 ± 0.024 0.419 ± 0.048 0.851 ± 0.009 ResNet50 0.464 ± 0.006 0.819 ± 0.009 0.676 ± 0.024 0.443 ± 0.065 0.860 ± 0.009 ResNet50 (no pretrained) 0.318 ± 0.006 0.727 ± 0.009 0.488 ± 0.024 0.243 ± 0.065 0.798 ± 0.009 DenseNet121 0.474 ± 0.008 0.824 ± 0.007 0.670 ± 0.014 0.457 ± 0.066 0.866 ± 0.014 MiT-B2 0.493 ± 0.021 0.837 ± 0.012 0.704 ± 0.025 0.476 ± 0.074 0.885 ± 0.007 MiT-B3 0.517 ± 0.007 0.847 ± 0.012 0.723 ± 0.009 0.502 ± 0.072 0.892 ± 0.008 MiT-B3 (no pretrained) 0.312 ± 0.033 0.695 ± 0.024 0.338 ± 0.019 0.193 ± 0.053 0.758 ± 0.009 Method Backbone UNet (seg only) SFM-based (seg only) EndoUNet SFM-based Gastritis Stomach cancer enhanced to serve real-time applications Table 4.4: Number of parameters and speed of models Method Number of parameters Backbone (million) Speed (FPS) VGG19 (cls only) 20.3 36 Resnet50 (cls only) 26.6 33 DenseNet121 (cls only) 7.5 26 MiT-B3 (cls only) 44.8 24 UNet (seg only) ResNet50 38.5 32 SFM based (seg only) MiT-B3 45.6 21 VGG19 26.2 34 ResNet50 41.2 31 DenseNet121 17.1 24 MiT-B2 26.4 24 MiT-B3 46.3 21 EndoUNet SFM-based The Figures 4.7 and 4.8 represent the confusion matrices of EndoUnet and SFMNet on a fold on anatomical site classification task The similarity between the two models is that the accuracy of the gastric body class is the lowest, at 88.07%, and it is easily confused with other similar stomach classes, including gastric fundus, gastric antrum, greater curvature, and lesser curvature Figure 4.9 depicts the confusion matrices for the lesion classification task On this dataset, the models perform quite well For the models that not use pretrained parameters, in the classification tasks, as we can see in Table 4.2, they reach relatively good results on the lesion classification task and HP classification task But their accuracy for the anatomical site classification is quite low compared to other models In the segmentation tasks, the performance drops significantly without using pretrained parameters Thus, we 48 Figure 4.7: EndoUnet - Confusion matrix on anatomical site classification task on a fold Figure 4.8: SFMNet - Confusion matrix on anatomical site classification task on a fold Figure 4.9: Confusion matrices on lesion classification task on a fold can see that transfer learning helps to train the models faster, thereby saving the resources needed 49 Figure 4.10: Some examples of the lesion segmentation task 50 Chapter Conclusion and future work In this work, the author proposed two unified models to solve four tasks for EGD images: anatomical site classification, lesion detection, HP classification, and lesion segmentation The proposed models are jointly trained on the mixed data derived from multiple sources The multi-task learning forces the models to learn a powerful unified representation across all the tasks and gain significant benefits Overall, the proposed models achieve high accuracy in the classification tasks while still yielding competitive results compared to the single-task models trained separately This work evaluated the effectiveness of the popular backbones and contextual information aggregation modules In addition, the proposed models are constructed in a modular way, making it simple to test the integration of information processing blocks Future work for this study could involve evaluating the model using several metrics and comparing the current model to numerous other classification and segmentation models It may also involve comparing the model’s impact using several medical datasets This study can be further improved by finding an effective way to combine loss functions instead of simply adding them together 51 References [1] R Zemouri, N Zerhouni, and D Racoceanu, “Deep learning in the biomedical applications: Recent and future status,” Applied Sciences, vol 9, no 8, p 1526, 2019 [2] I Goodfellow, Y Bengio, and A Courville, Deep learning MIT press, 2016 [3] MathWorks, “What is a convolutional neural network?,” URL: https://www.mathworks.com/discovery/convolutional-neural-networkmatlab.html [4] A Dertat, “Applied deep learning - part 4: Convolutional neural net- works,” URL: https://towardsdatascience.com/ applied-deep-learning-part-4convolutional-neural-networks-584bc134c1e2, 2017 [5] J Shruti, “Introduction to different activation functions for deep learning,” URL: https://medium.com/ @shrutijadon/survey-on-activation-functions -fordeep-learning-9689331ba092, 2018 [6] J Long, E Shelhamer, and T Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440, 2015 [7] M Ferguson, R Ak, Y.-T T Lee, and K H Law, “Automatic localization of casting defects with convolutional neural networks,” in 2017 IEEE international conference on big data (big data), pp 1726–1735, IEEE, 2017 [8] K He, X Zhang, S Ren, and J Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on CVPR, pp 770–778, 2016 [9] PluralSight, “Introduction to densenet with tensorflow,” URL: https://www.pluralsight.com/guides/introduction-to-densenet-with-tensorflow [10] O Ronneberger, P Fischer, and T Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, pp 234–241, Springer, 2015 52 [11] A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N Gomez, L Kaiser, and I Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol 30, 2017 [12] A Dosovitskiy, L Beyer, A Kolesnikov, D Weissenborn, X Zhai, T Unterthiner, M Dehghani, M Minderer, G Heigold, S Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020 [13] K Yue, M Sun, Y Yuan, F Zhou, E Ding, and F Xu, “Compact generalized non-local network,” Advances in neural information processing systems, vol 31, 2018 [14] J Hu, L Shen, and G Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132– 7141, 2018 [15] S Huang, Z Lu, R Cheng, and C He, “Fapn: Feature-aligned pyramid network for dense image prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 864–873, 2021 [16] P M Treuting, M J Arends, and S M Dintzis, “11 - upper gastrointestinal tract,” in Comparative Anatomy and Histology (Second Edition) (P M Treuting, S M Dintzis, and K S Montine, eds.), pp 191–211, San Diego: Academic Press, second edition ed., 2018 [17] H Sung, J Ferlay, R L Siegel, M Laversanne, I Soerjomataram, A Jemal, and F Bray, “Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,” CA: a cancer journal for clinicians, vol 71, no 3, pp 209–249, 2021 [18] A R Pimenta-Melo, M Monteiro-Soares, D Libˆanio, and M Dinis-Ribeiro, “Missing rate for gastric cancer during upper gastrointestinal endoscopy: a systematic review and meta-analysis,” European journal of gastroenterology & hepatology, vol 28, no 9, pp 1041–1049, 2016 [19] S Menon and N Trudgill, “How commonly is upper gastrointestinal cancer missed at endoscopy? a meta-analysis,” Endoscopy international open, vol 2, no 02, pp E46–E50, 2014 [20] Y Shimodate, M Mizuno, A Doi, N Nishimura, H Mouri, K Matsueda, and H Yamamoto, “Gastric superficial neoplasia: high miss rate but slow progression,” Endoscopy International Open, vol 5, no 08, pp E722–E726, 2017 53 [21] J McCarthy, “What is artificial intelligence,” URL: http://www-formal stanford edu/jmc/whatisai html, 2004 [22] C E IBM, “What is machine learning?,” URL: https://www.ibm.com/cloud/learn/machine-learning, 2020 [23] R S Sutton and A G Barto, Reinforcement learning: An introduction MIT press, 2018 [24] D H Hubel and T N Wiesel, “Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex,” The Journal of physiology, vol 160, no 1, p 106, 1962 [25] Y LeCun, L Bottou, Y Bengio, and P Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol 86, no 11, pp 2278–2324, 1998 [26] S Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016 [27] A Krizhevsky, I Sutskever, and G E Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol 60, no 6, pp 84–90, 2017 [28] K Simonyan and A Zisserman, “Very deep convolutional networks for largescale image recognition,” in ICLR (Y Bengio and Y LeCun, eds.), 2015 [29] G Huang, Z Liu, L Van Der Maaten, and K Q Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on CVPR, pp 4700–4708, 2017 [30] E Xie, W Wang, Z Yu, A Anandkumar, J M Alvarez, and P Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” Advances in Neural Information Processing Systems, vol 34, pp 12077– 12090, 2021 [31] X Wang, R Girshick, A Gupta, and K He, “Non-local neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803, 2018 [32] M Yi-de, L Qing, and Q Zhi-Bai, “Automated image segmentation using improved pcnn model based on cross-entropy,” in Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2004., pp 743–746, IEEE, 2004 54 [33] C H Sudre, W Li, T Vercauteren, S Ourselin, and M Jorge Cardoso, “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations,” in Deep learning in medical image analysis and multimodal learning for clinical decision support, pp 240–248, Springer, 2017 55 ... build unified models to tackle multiple tasks relating to the upper gastrointestinal tract The tasks include anatomical site classification, lesion classification, HP classification and lesion segmentation. .. sources for multiple tasks: anatomical site classification, lesion classification, HP classification, and segmenta31 tion The output of the encoder is a high-level coarse feature map suitable for classification. .. anatomical site classification, lesion type classification, HP classification, and lesion segmentation tasks, respectively Here µti = if the i-th sample is associated with the t-th task, and µti