Self Supervised Visual Feature Learning for Polyp Segmentation in Colonoscopy Images Using Image Reconstruction as Pretext Task Self supervised Visual Feature Learning for Polyp Segmentation in Colono[.]
2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Self-supervised Visual Feature Learning for Polyp Segmentation in Colonoscopy Images Using Image Reconstruction as Pretext Task Le Thi Thu Hong Institute of Information Technology, AMST Hanoi, Vietnam lethithuhong1302@gmail.com Nguyen Chi Thanh* Institute of Information Technology, AMST Hanoi, Vietnam thanhnc80@gmail.com Abstract— Automatic polyp detection and segmentation are desirable for colon screening because the polyps miss rate in clinical practice is relatively high The deep learning-based approach for polyp segmentation has gained much attention in recent years due to the automatic feature extraction process to segment polyp regions with unprecedented precision However, training these networks requires a large amount of manually annotated data, which is limited by the available resources of endoscopic doctors We propose a self-supervised visual learning method for polyp segmentation to address this challenge We adapted self-supervised visual feature learning with image reconstruction as a pretext task and polyp segmentation as a downstream task UNet is used as the backbone architecture for both the pretext task and the downstream task The unlabeled colonoscopy image dataset is used to train the pretext network For polyp segmentation, we apply transfer learning on the pretext network The polyp segmentation network is trained using a public benchmark dataset for polyp segmentation Our experiments demonstrate that the proposed self-supervised learning method can achieve a better segmentation accuracy than an UNet trained from scratch On the CVC-ColonDB polyp segmentation dataset with only annotated 300 images, the proposed method improves IoU metric from 76.87% to 81.99% and Dice metric from 86.61% to 89.33% for polyp segmentation, compared to the baseline UNet Keywords: Polyp segmentation, medical image analysis, transfer learning, deep learning, self-supervised learning I INTRODUCTION Colonoscopy is considered the gold-standard investigation for colorectal cancer screening However, the polyps miss rate in clinical practice is relatively high due to different factors [1] This presents an opportunity to use AI models to automatically detect and segment polyps, supporting clinicians to reduce the number of polyps missed Recently, deep learning methods have been widely used to solve medical image segmentation problems, including polyp segmentation, due to their capacity in learning image features for the segmentation task [2] Deep learning methods usually rely on a large amount of training data with manual labels However, the polyp segmentation dataset may not always be available because annotation usually requires expert knowledge of endoscopists Thus, there is more and more interest in developing methods that not require a large number of annotations for learning features of colonoscopy images to avoid time-consuming and expensive data annotations Directions of research include transfer learning, semi-supervised learning, unsupervised, and self-supervised learning Self-supervised visual feature learning allows visual feature learning from large-scale unlabeled images * Generally, computer vision pipelines that employ selfsupervised feature learning involve performing two tasks, a pretext task and a real (downstream) task [3] The pretext task is the self-supervised learning task for learning visual representations The learned representations or model weights obtained from the pretext task are used for the downstream task The real (downstream) task can be any object recognition task like classification, detection, or segmentation, with insufficient annotated data samples Self-supervised learning is a good method to discover the unlabelled images to improve the performance of a deep model when there is only limited labeled data This method not only helps to overcome the need for large amounts of annotated data but also helps to improve the robustness and uncertainty of deep convolution neural networks [4] In this work, to address the challenge of limitations of labeled polyp data, we propose a novel method for training a polyp segmentation network, which formulates a selfsupervised task for visual feature learning and decreases the cost of data annotation Image reconstruction is proposed for pretext tasks to improve the performance of real polyp segmentation tasks The visual features of colonoscopy images are learned through training UNet for image reconstruction task as pretext task We use an unlabeled colonoscopy image dataset containing 8,500 images collected from Hospital 103 in Hanoi, Vietnam, to train the pretext network The pixels and pixel channels (R, G, or B pixel channel) in the input images are dropped in a random manner, and the original image serves as the label After selfsupervised pretext task training is finished, the learned parameters serve as a pre-trained model and are transferred to the downstream task - polyp segmentation The CVCColonDB [11] dataset, containing 300 labeled segmentation polyp images, was used for the finetuning polyp segmentation network Our experiment shows that the proposed method improves the Dice metric significantly for polyp segmentation compared to the baseline segmentation network In summary, the main contributions of our work can be summarized as follows: 1) We propose a self-supervised feature learning method for training a polyp segmentation network using image reconstruction as a pretext task In the pretext task, the pixels and or pixel channels (R or G or B pixel channel) in the input images are dropped in a random manner, and the original image serves as the label 2) The experimental results on a public polyp segmentation dataset show the efficacy of our method In experiments, we also study the effect of pretext task Corresponding author 978-1-6654-1001-4/21/$31.00 ©2021 IEEE Tran Quoc Long University of Engineering and Technology, VNU Hanoi, Vietnam tqlong@gmail.com 254 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Fig.1 Overview of the proposed self-supervised visual feature learning method for polyp segmentation complexity and polyp segmentation network finetuning methods on the performance of the polyp segmentation tasks The rest of the article is organized as Section reviews related research on deep learning for polyp segmentation and self-supervised learning for medical image analysis Section describes our proposed self-supervised visual feature learning method for polyp segmentation in colonoscopy images using image reconstruction as a pretext task in detail Section outlines our experiment settings, experimental results, and discussion Finally, Section summarizes and concludes this work II RELATED WORK A Polyp segmentation using deep learning methods A Computer-Aided Diagnosis (CADx) system for polyp segmentation on colonoscopy images can be an effective clinical tool that helps endoscopists for faster screening and higher accuracy [2] However, precise polyp segmentation is still challenging due to variations of polyps in size, shape, texture, and color Like other medical imaging applications, the deep learning-based approach for polyp segmentation has gained much attention in recent years due to the automatic feature extraction process to segment polyp regions with unprecedented precision In addition, the public database of polyp images facilitated further research on the use of deep learning models for polyp segmentation Several benchmark datasets are publicly available that are used for the training and evaluation of the deep models They are as follows: CVCClinicDB [5] dataset consists of 612 images with the corresponding ground truth masks of defined polyp regions Kvasir-SEG dataset [8] includes 1000 polyp images with corresponding ground truth masks manually annotated by expert endoscopists ETIS-Larib [7] dataset contains 36 different types of polyps in 196 images with the ground truth masks were annotated by experts The CVC-ColonDB [6] consists of 300 polyp images and their corresponding pixellevel annotated polyp masks UNet [9], an encoder-decoderbased structure that uses skip connections to concatenate the features from the encoding and decoding layers, is a popular strategy for solving medical image segmentation tasks, including polyp segmentation Inspired by the success of UNet, several variants were proposed for polyp segmentation and yielded promising results, such as DoubleUNet [10], UNet++[11], ResUNet++[12] B self-supervised learning in the medical imaging domain Self-supervised learning, which formulates a pretext task based on unannotated data for feature learning, has gained more and more popularity in recent years Various types of pretext tasks have been proposed depending on data types For general image and video analysis problems, patch relative positions [13,14], local context prediction [15], colorization [16], and image reconstruction [17] have been used in selfsupervised learning In the medical imaging domain, patients often have follow-up scans, and all unlabeled images are stored on PACS systems (Picture Archiving and Communication System) At the same time, only limited labeled data is available because annotation usually requires expert knowledge Thus, self-supervised learning is a good way for mining the unlabelled images to improve the deep neural network accuracy Overpass year, self-supervised learning has also been explored for medical imaging but to a less extent Jamaludin et al [18] proposed a self-supervised learning method to predict the level of vertebral bodies, e.g., classify whether two spinal MR images came from the same patient or not They used to recognize the recognization of patients’ MR scans as a pretext task and prediction the level of vertebral bodies as a real task Tajbakhsh et al [19] proposed a self-supervised learning method for lung lobe segmentation and nodule detection tasks by using rotation prediction as a pretext task Ross et al [20] proposed the method to exploit the potential of unlabeled endoscopic video data for surgical instrument segmentation They defined the decolorization of surgical videos as a pretext task and used the pre-trained features to initialize a surgical instrument segmentation network In this article, different from previous works, we propose novel self-supervised learning, which uses image reconstruction as a pretext task and polyp segmentation as a real task A The architecture of the proposed model Overall the proposed method, which adapts selfsupervised visual feature learning for training a polyp segmentation network, is depicted in Fig.1 The image reconstruction is used as the pretext task We use UNet [16] as a backbone architecture for both the pretext and downstream tasks UNet architecture was developed for a biomedical image segmentation application and is more and more widely used in medical image analysis applications UNet consists of two paths: encoder and decoder The encoder 255 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) is a typical CNN's which generally consists of a set of convolutional and pooling layers The encoder is used to capture the context in the image The decoder is the symmetric expanding path that is used to enable precise localization using transposed convolutions MobileNetV2 [24], which is designed to effectively maximize accuracy while being mindful of the restricted resources for an ondevice or embedded application, is used as the encoder Although deeper backbones can give more accurate results, we choose MobileNetV2 as a backbone for feature extraction because this is the lightweight model suitable for an on-device or embedded application with acceptable accuracy Output layers of UNet differ for pretext task and downstream task: convolution layer with filters of for pretext network and convolution layer with filters of for the downstream task The pipeline of the proposed method is as follows: First, we use an unlabeled dataset for training a UNet-based image reconstruction network to learn visual features of the colonoscopy image The transformed images are the input of the reconstruction network, and the original images serve as ground truth labels The pretext task, image reconstruction, challenges the network to learn visual features with automatically generated pseudo-labels Then, we use the learned features to transfer to the polyp segmentation network Since both the encoder and decoder are trained simultaneously in the pretext task, results for the segmentation task could get better B Self-supervised visual feature learning We use colonoscopy image reconstruction as the pretext task for self-supervised learning The pretext task makes the model learn semantic features of the colonoscopy images by creating the ground truth labels using known input transformations We propose two transformation methods to generate self-supervised labels: random pixel drop and random channel drop An example of transformed images with pixel drop and channel drop is shown in Fig.2 With the random pixel drop method, all channels of the pixel are randomly dropped In the random channel drop method, a pixel's red, blue, or green channel is randomly dropped The drop scale, i.e., the percentage of pixel values dropped, presents the pretext task complexity We conduct experiments with different drop scales to explore the impact of pretext task complexity on downstream tasks The image reconstruction network is UNet-based Fig.3 shows the image reconstruction network that we used in this work The input to the network is a transformed image with randomly dropped pixels, and the original images serve as labels We use SSIM loss for training the reconstruction image network SSIM (Structure Similarity Index Measure) [21] is a perceptual image quality assessment between a distorted image and a reference image The images are divided into multiple square windows, and SSIM is computed in each window as follows: 2𝜇𝑥 𝜇𝑦 + 2𝜎𝑥𝑦 +𝐶2 𝑆𝑆𝐼𝑀(𝑥, 𝑦) = 𝜇2 +𝜇2 +𝐶 ∙ 𝜎2 +𝜎2+𝐶 𝑥 𝑦 𝑥 𝑦 (1) Where x, y are two nonnegative image signals, which have been aligned with each other (e.g., windows extracted from each image), 𝜇𝑥 is mean of x, 𝜇𝑦 is mean of y, 𝜎𝑥 is the variance of x, 𝜎𝑦 is the variance of y, 𝜎𝑥𝑦 is the covariance of Fig.2 Example of random pixel drop and random channel drop transformation: (a) original image, (b) pixel drop transformed image, (c) channel drop transformed image Fig.3 The proposed self-supervised learning model x and y, 𝐶1 , 𝐶2 are constants added to avoid instability The overall quality measure of the entire image is computed as: 𝑀𝑆𝑆𝐼𝑀(𝑋, 𝑌) = 𝑀 ∑𝑀 𝑗=1 𝑆𝑆𝐼𝑀(𝑥𝑗 , 𝑦𝑗 ) (2) where X and Y are the reference and the distorted images, respectively, 𝑥𝑗 , 𝑦𝑗 are the image contents at the j window SSIM loss is used to reconstruct images in accordance with human perception SSIM loss is calculated as: ℒ 𝑆𝑆𝐼𝑀 (𝑋, 𝑌) = − 𝑀𝑆𝑆𝐼𝑀(𝑋, 𝑌) (3) where X and Y are the prediction and the ground truth, respectively Colonoscopy images unlabeled dataset is used to train the image reconstruction network The network will learn the general features that capture the salient characteristics of the colonoscopy image data The learned features are transferred to the downstream network for polyp segmentation C Polyp segmentation using knowledge transferred from pretext task After the network is self-trained on the colonoscopy image reconstruction, it is transferred to the polyp segmentation task By changing the last layer of UNet to match the number of output classes, we repurpose the colonoscopy image reconstruction network for the polyp segmentation task We investigate three different methods of transfer learning for the downstream task The first method is to freeze the weights learned at all network layers, except the layer on top, and only finetune the layer on top The second way is to freeze the weights learned at the encoder and only finetune the decoder The third way is to finetune all the weights, including the encoder, decoder Fig.4 illustrates these methods, the gray area denotes the freezing of the weights learned from the pretext task To train the polyp segmentation network, we use a public polyp segmentation dataset consisting of colonoscopy images and their corresponding pixel-level annotated polyp masks that were annotated by colonoscopists The asymmetric similarity loss function [22] is used for training networks to address the unbalanced data problem The asymmetric similarity loss function is defined as: ℒ𝐴𝑠𝑦𝑚𝐶𝐸 = 𝛼 ∗ ℒ𝐶𝐸 + ℒ𝐴𝑠𝑦𝑚 (4) where ℒ𝐶𝐸 is cross-entropy loss, ℒ𝐴𝑠𝑦𝑚 = − 𝐹𝛽 is asymmetric similarity loss which is based on 𝐹𝛽 score and the 256 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) the training stage Data augmentation was performed online, including vertical flipping, horizontal flipping, random rotation, random scaling, random shearing, random Gaussian blurring, random brightness, and random cropping and padding for polyp segmentation network training The model generated at the epoch with a loss value on the validating set is the final self-supervised learning model On the downstream task, the model generated at the epoch with max dice score on the validating set is the final polyp segmentation network The UNet baseline for polyp segmentation was also trained with exactly the same setting but initialized with random weights Fig.4 Network architectures for polyp segmentation and three different ways for transfer learning: (a)- freezing all network layers, except the layer on top, and only finetuning the layer on top (b)- freezing the encoder and only finetunes the decoder, (c)- finetuning all network C Pretext task results We implement UNet for image reconstruction with backbone MobileNetV2 as described in section II.A The unlabeled colonoscopy image dataset with 10,608 images was used for training and testing the network We use both methods, random pixel drop and random channel drop discussed in section II.B, to generate self-supervised labels with equal probability To understand the impact of increased pretext task complexity on polyp segmentation tasks, we experiment with randomly dropping X% of pixels in an image, where X equals 20%,30%, 40%, 50%, 60%,70%, and 80% hyperparameter 𝛼 control the amount of cross-entropy loss term contribution in the loss function 𝐹𝛽 score is defined as: TABLE ACCURACY OF IMAGE RECONSTRUCTION WITH DIFFERENT DROP SCALES 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑟𝑒𝑐𝑎𝑙𝑙 𝐹𝛽 = (1 + 𝛽 ) 𝛽2 ∗𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 Drop scale 20% 30% 40% 50% 60% 70% 80% (5) 𝐹𝛽 score with the hyper-parameter 𝛽 generalizes Dice similarity coefficient and Jaccard (IoU) index When 𝛽 = 1, the 𝐹𝛽 score is Dice score, 𝛽 = generates F2 score, and 𝛽 = transforms the score to precision III EXPERIMENTS AND RESULTS Accuracy 85.07% 86.18% 87.46% 88.20% 87.27% 86.81% 84.25% A Dataset For the training image reconstruction network, we constructed an unlabeled colonoscopy image dataset from several datasets First, the unlabeled colonoscopy images were acquired from PACS Systems of Hospital 103 in Hanoi, Vietnam These images were extracted from colonoscopy videos of patients who may or may not have polyps in the colon After collecting and standardizing data, we obtained 8500 colonoscopy images with size (384× 288) We also add colonoscopy images from public datasets, including CVCClinicDB with 612 images, Kvasir-SEG with 1000 images, ETIS-Larib with 196 images, and CVC-ColonDB with 300 images without using the labels Finally, we already have an unlabeled dataset with 10,608 colonoscopy images We use this unlabeled dataset for training and validating the image reconstruction network For finetuning the polyp segmentation model, the CVC-ColonDB dataset is used In our experiments, both datasets are split 80/10/10 for training, validating, and testing B Implementation The proposed models are implemented using Keras and Tensorflow backend All algorithms have been programmed/trained on a PC with a GeForce GTX 1080 Ti GPU Both pretext network and downstream network are updated via Adam optimizer, the learning rate of Adam is set to 1e−4 All the training data is divided into mini-batches for network training, and the mini-batch size is set as four during Fig.4 Examples of reconstruction network predictions Table presents the accuracy of image reconstruction with different drop scales on the test set This table shows that when 257 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) the drop scale equals 50%, image reconstruction accuracy is highest with 88.20% Moreover, Fig.5 visualizes some examples of reconstruction network predictions with the drop scales of 50% This figure shows that reconstructed images have quite good quality, but there are changes in brightness and color shifts D Polyp segmentation results For polyp segmentation, we apply transfer learning from the pretext network The polyp segmentation network is finetuned using a labeled polyp segmentation dataset First, we conducted an experiment that used transfer learning UNet from the pretext task with a difference drop scale to evaluate the impact of pretext task complexity on polyp segmentation tasks We use the CVC-ColonDB dataset, which consists of 300 polyp images and their corresponding pixel-level annotated polyp masks for finetuned segmentation network The dataset is split 80/10/10 for training, validating, and testing For the evaluation of polyp segmentation performance, we use popular metrics for image segmentation: the Dice score coefficient (Dice), Jaccard index (IoU), Recall (Re), and Precision (Prec) [22] Table presents performance metrics for polyp segmentation on the test set The table shows that the polyp segmentation performance is highest when the drop scale equals 50% Next, we evaluated the performance of polyp segmentation network trained using transfer learning from pretext task with drop scale is 50% We trained UNet from scratch and finetuned UNet with a transfer learning method for polyp segmentation Table compares the performance of polyp segmentation between the U-net trained from scratch and self-supervised learning methods on the test set As it shows, the performance of the self-supervised learning method outperforms UNet trained from scratch in both IoU and Dice metrics with 5.12% in IoU and 2.72% in Dice Even if we freeze the encoder and only finetune the decoder, we can achieve a high accuracy comparable to training an UNet from scratch This indicates that selfsupervised learning can learn good features at the encoder that are transferrable for the segmentation task In addition, Fig.5 shows some examples of polyp segmentation prediction generated by different transfer learning methods This figure also shows that the self-supervised learning method with finetuning all layers of the segmentation network generates the best results D Comparison with Other Methods We evaluate our proposed segmentation method on other independent datasets, ETIS-Larib and CVC-ClinicDB Then we compare the results with current works, which have the same training and testing data scenarios: ColonDB for training and ETIS-Larib and ClinicDB for testing Our results are presented in Table The table shows that the Dice score of the proposed method outperforms previous methods on both test sets with a Dice score of 65.63% on ETIS-Larib and a Dice score of 77.25% on CVC-ClinicDB IV CONCLUSION TABLE PERFORMANCE OF POLYP SEGMENTATION FROM PRETEXT TASK WITH DIFFERENCE DROP SCALES Drop scale Dice(%) IoU(%) Re(%) Pre(%) 20% 81.84 71.42 76.76 90.98 30% 83.45 74.70 83.62 87.86 40% 85.91 77.10 83.85 87.16 50% 89.33 81.99 86.05 94.65 60% 82.86 72.30 78.32 90.39 70% 81.16 71.83 84.48 82.45 80% 77.16 64.02 76.73 79.68 TABLE PERFORMANCE OF POLYP SEGMENTATION FROM PRETEXT TASK WITH DIFFERENCE DROP SCALES Method Dice(%) IoU(%) Re(%) Pre(%) Training from scratch 86.61 76.87 86.12 87.71 Finetuning the top layer 79.15 70.63 82.19 86.43 Finetuning decoder 86.87 77.45 79.17 92.70 Finetuning all network 89.33 81.99 86.05 94.65 TABLE COMPARISON WITH OTHER METHODS Method ETIS-Larib CVC-ClinicDB UNet [9] Double Unet [10] UNet++ [11] ResUNet++ [12] PolypSegNet [23] Proposed 52.7 56.8 50.1 36.9 63.7 65.63 72.5 74.2 69.8 63.8 76.1 77.25 The highlight is that proposed self-supervised learning methods allow us to train both the encoder and decoder in tandem in the pretext task Thus, results for the polyp segmentation task could get better We experimented with different pixel drop percentages and different ways of transfer learning for the downstream task Our experimental results show that the proposed method can achieve a high polyp segmentation accuracy that is better than an UNet trained from scratch Moreover, when comparing the results on the same training and testing data scenarios: ColonDB for training and ETIS-Larib and ClinicDB for testing, the Dice score of the proposed method outperforms previous methods on both test sets In this work, we use SSIM loss for training the image reconstruction network; SSIM loss did reconstruct the images with quite good quality However, there is a change in brightness or a color shift of the reconstructed image because the SSIM is not sensitive to uniform biases In the future, we will focus on studying loss functions for training image reconstruction networks to improve the performance of the proposed method In addition, it would be interesting to extend this work for image classification and polyp detection on colonoscopy images since they are also important tasks in colonoscopy image analysis In this article, we presented self-supervised visual feature learning for polyp segmentation in colonoscopy images We adapted self-supervised visual feature learning with image reconstruction as a pretext task and polyp segmentation as a downstream task UNet with MobinetV2 backbone architecture was used for both pretext task and downstream task 258 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Fig.5 Examples of polyp segmentation prediction generated by differents transfer learning methods REFERENCES [1] A M Leufkens, M G H van Oijen, F P Vleggaar, and P D Siersema "Factors influencing the miss rate of polyps in a back-toback colonoscopy study," Endoscopy, 44(05):470475, 2012 [2] D Vázquez, J Bernal, F J Sánchez, G Fernández-Esparrach, A M López, A Romero, M Drozdzal, and A Courville, “A benchmark for endoluminal scene segmentation of colonoscopy images,” Journal of healthcare engineering, vol 2017, 2017 [3] Jing, Longlong, and Yingli Tian "Self-supervised visual feature learning with deep neural networks: A survey." IEEE transactions on pattern analysis and machine intelligence (2020) [4] Chen, Liang, et al "Self-supervised learning for medical image analysis using image context restoration." Medical image analysis 58 (2019): 101539 [5] Bernal, Jorge, et al "WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs saliency maps from physicians." Computerized Medical Imaging and Graphics 43 (2015): 99-111 [6] Bernal, Jorge, Javier Sánchez, and Fernando Vilarino "Towards automatic polyp detection with a polyp appearance model." Pattern Recognition 45.9 (2012): 3166-3182 [7] Silva, Juan, et al "Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer." International Journal of Computer Assisted Radiology and Surgery 9.2 (2014): 283-293 [8] D Jha et al., "Kvasir-seg: A segmented polyp dataset," in Proc Int Conf Multimedia Model., 2020, pp 451–462 [9] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computerassisted intervention Springer, Cham, 2015 [10] Jha, Debesh, et al "Doubleu-net: A deep convolutional neural network for medical image segmentation." 2020 IEEE 33rd International Symposium on computer-based medical systems (CBMS) IEEE, 2020 [11] Zhou, Zongwei, et al "Unet++: A nested u-net architecture for medical image segmentation." Deep learning in medical image analysis and multimodal learning for clinical decision support Springer, Cham, 2018 3-11 [12] Jha, Debesh, et al "Resunet++: An advanced architecture for medical image segmentation." 2019 IEEE International Symposium on Multimedia (ISM) IEEE, 2019 [13] Doersch, Carl, Abhinav Gupta, and Alexei A Efros "Unsupervised visual representation learning by context prediction." Proceedings of the IEEE international conference on computer vision 2015 [14] Noroozi, Mehdi, and Paolo Favaro "Unsupervised learning of visual representations by solving jigsaw puzzles." European conference on computer vision Springer, Cham, 2016 [15] Pathak, Deepak, et al "Context encoders: Feature learning by inpainting." Proceedings of the IEEE conference on computer vision and pattern recognition 2016 [16] Zhang, Richard, Phillip Isola, and Alexei A Efros "Colorful image colorization." European conference on computer vision Springer, Cham, 2016 [17] Karnam, Srivallabha Self-Supervised Learning for Segmentation using Image Reconstruction Rochester Institute of Technology, 2020 [18] Jamaludin, Amir, Timor Kadir, and Andrew Zisserman "Selfsupervised learning for spinal MRIs." Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support Springer, Cham, 2017 294-302 [19] Tajbakhsh, Nima, et al "Surrogate supervision for medical image analysis: Effective deep learning from limited quantities of labeled data." 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019) IEEE, 2019 [20] Ross, Tobias, et al "Exploiting the potential of unlabeled endoscopic video data with self-supervised learning." International journal of computer assisted radiology and surgery 13.6 (2018): 925-933 [21] Wang, Zhou, et al "Image quality assessment: from error visibility to structural similarity." IEEE transactions on image processing 13.4 (2004): 600-612 [22] L T Thu Hong, N Chi Thanh, and T Q Long, "Polyp segmentation in colonoscopy images using ensembles of u-nets with Efficientnet and asymmetric similarity loss function," in 2020 RIVF International Conference on Computing and Communication Technologies (RIVF) IEEE, 2020, pp.1–6 [23] T Mahmud, B Paul, and S A Fattah, "Polypsegnet: A modified encoder-decoder architecture for automated polyp segmentation from colonoscopy images," Computers in Biology and Medicine, vol 128, p 104119, 2021 [24] Sandler, Mark, et al "Mobilenetv2: Inverted residuals and linear bottlenecks." Proceedings of the IEEE conference on computer vision and pattern recognition 2018 259 ... self-supervised visual feature learning for polyp segmentation in colonoscopy images We adapted self-supervised visual feature learning with image reconstruction as a pretext task and polyp segmentation as. .. medical image analysis Section describes our proposed self-supervised visual feature learning method for polyp segmentation in colonoscopy images using image reconstruction as a pretext task in detail... simultaneously in the pretext task, results for the segmentation task could get better B Self-supervised visual feature learning We use colonoscopy image reconstruction as the pretext task for self-supervised