2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Classification of anatomical landmarks from upper gastrointestinal endoscopic images* Thanh-Hai Tran∗ , Phuong-Thao Nguyen∗ , Duc-Huy Tran∗ , Xuan-Huy Manh∗ , Danh-Huy Vu∗ , Nguyen-Khang Ho∗ , Khanh-Linh Do∗ , Van-Tuan Nguyen∗ , Long-Thuy Nguyen∗ , Viet-Hang Dao† , Hai Vu∗ ∗ School of Electrical and Electronic Engineering, Hanoi University of Science and Technology, Hanoi, Vietnam † Institute of Gastroenterology and Hepatology, Hanoi Medical University Hospital, Hanoi, Vietnam Abstract—In this paper, we propose a framework that automatically classifies anatomical landmarks of Upper GastroIntestinal Endoscopy (UGIE) This framework aims to select the best deep neural network in terms of both criteria of classification performances and computational costs We investigate two lightweight deep neural networks that are ResNet-18, MobileNetV2 to learn hidden discriminant features for multi classification task In addition, because convolutional neural networks (CNNs) are data hungry, we examine various data augmentation (DA) techniques such as Brightness and Contrast Transformation (BaC), Geometric Transformation (GeoT), and Variational AutoEncoder (VAE) Impacts of these DA schemes are evaluated for both CNN models The experiments are conducted on a self collected dataset of 3700 endoscopic images which contains 10 anatomical landmarks of UGIE The results show outstanding performances of both models thanks to DA techniques compared to the original data usage The best sensitivity is 97.43% and specificity is 99.71% using MobileNet-V2 with Geometric Transformation based DA technique at a frame-rate of 21fps These results highlight the best model which has significant potential for developing computer-aided esophagogastroduodenoscopy (EGD) diagnostic systems Index Terms—image classification, deep learning, data augmentation, endoscopic images I I NTRODUCTION Esophagogastroduodenoscopy (EGD) is the gold-standard procedure in the diagnosis of upper gastrointestinal (GI) diseases Preliminary study has proved that longer examination times may improve the detection of lesions if the captured images are taken at specific anatomical landmarks according to standardised photo-documentation guidelines In EGD procedure, it is important to identify accurately anatomical landmarks from the oral cavity to the duodenum in order to detect and diagnose lesions as well as to consult with other colleagues This technique requires to be conducted by an experienced specialist with several years of training However, in practice, to follow the guidelines on checking all anatomical landmarks as unifying protocol is challenging due to timelimitation and endoscopist’s skills Besides, nowadays, with the advances of deep learning techniques, many convolutional neural networks (CNNs) have been employed for different tasks such as lesion detection, segmentation and disease prediction in medical imaging Most This research is funded by Vietnam Ministry of Science and Technology under grant number KC-4.0-17/19-25 978-1-6654-1001-4/21/$31.00 ©2021 IEEE CNN models are data hungry which require enormous annotated datasets for training However, the data collection and annotation are very time consuming As a result, automatic anatomical classification of the endoscopic images can itself be beneficial not only in the clinical setting, as well as easier for physicians to interpret, but also reduce the burden in data cleansing and annotation task Therefore, in this study, we aim to construct a CNN based system that is able to correctly identify the anatomical locations as a preliminary step for the future development of a computer-aided EGD diagnostic system In the literature, only a few works have reported organ classification and the anatomical location of upper GI organs On the one side, most of existing methods such as [1], [2] and [3] possessed a large annotated datasets (ranging from 27.335 images [1] to 59.513 images [3]) Unfortunately, these datasets are not publicly available for transfer learning On the other side, these existing techniques utilize CNN models like GoogleNet, AlexNet, SSD These models are unsuitable to be deployed on edge devices in real-time because they require large memory and GFLOPs With this aim, in this paper, we will investigate two lightweight CNN models (i.e ResNet-18 and MobileNet-V2) for anatomical landmark classification task from endoscopic images of upper GI To overcome the lack of large datasets, we apply different data augmentation techniques such as Brightness and Contrast Transformation (BaC), Geometric Transformation (GeoT), and Variational Auto-Encoder (VAE) In summary, the contribution of this paper is three-fold First, we deploy two light-weight CNN models for real-time classification of anatomical landmarks of GI from endoscopic images Second, we develop and study the impacts of three data augmentation techniques Finally, we evaluate the performance of these models on a self-collected dataset in terms of specificity, sensitivity and running time In the remainder of this paper, we will present related works in section II In section III, we describe our proposed framework with re-visiting of ResNet-18 and MobileNet-V2 and presentation of data augmentation techniques Experiments and conclusions are presented in section V and VI respectively II R ELATED WORKS In the literature, there are a number of methods proposed that aim to support or assist different tasks of the technical 278 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) doctors in examining endoscopic images so far Particularly, nowadays, Convolutional Neural Networks (CNNs), one of the successful deep learning techniques, show very impressive results in domains of endoscopy or colonoscopy images For examples, they are polyp detection [4], bleeding detection in wireless capsule images [5], and finding early neoplasia in Barrett’s esophagus [6] from endoscopic images However, there are only a few ones working on anatomical landmarks classification of the upper GI from endoscopic images Takiyama et al [1] utilized a convolutional neural network GoogleNet [7] to classify parts of the upper gastrointestinal tract with a data source including 27.335 training images and 17.081 test images The model achieved an accuracy rate of up to 97%, showing the potential large scale of neural networks in the classification of regions in the gastrointestinal tract Another similar report, proposed by Qi He et al [8] investigated five convolutional neural networks including ResNet-50, Inception-v3, VGG-11, VGG-16 and DenseNet to train on 5.661 upper GI tract images with 10 divisions, all extending from esophagus to duodenum and unidentified label Their results show that all neural networks achieved good accuracy (ranging from 80% to 90%) on different test conditions In another study on the gastrointestinal tract, Yamada et al [9] designed a system to assist physicians directly during endoscopy in detecting intestinal lesions using Faster R-CNN and VGG-16 deep learning networks They trained these networks on more than 4.000 lesion images and 134.000 non-lesion images The results of the comparative test showed that their artificial intelligence system had a sensitivity and specificity of 97% and 99%, respectively, higher than the results given by a doctor of 87% and 96% In addition, the system’s calculation speed is also 0.022 s/image faster than 2.4 s/doctor’s image In [10], authors compared the latest deep learning model (EfficientNet-B1 and SE-ResNet-50) with other deep learning models (ResNet-50 and DenseNet-121) The endoscopic imaging data used to train and test the deep learning models have been expertly classified into esophagus, stomach and duodenum The purpose of this study is to propose a classification model that can be used in clinical practice In [11], Vu et al developed a diagnosis assistant system for labelling stomach anatomical locations in upper GI examination They first utilized a simplified version of AlexNet to coarsely classify the images into seven major anatomical locations, then a Graphical User Interface was developed to help the doctors/endoscopists to specify thirteen detailed locations This work helps to significantly reduce the time of labelling for trainee endoscopists In [12], the authors investigated different hand-crafted features for lesion and nonlesion of upper GI images All of the above works validated the performance of classification of anatomical landmarks image by image in an offline phase In this work, we aim at developing an assisted system for Upper GI image analysis to adapt the real-time computation Consequently, we will study a more light-weight but efficient neural networks and improve their performance with various data augmentation techniques III P ROPOSED METHODS A General framework The proposed framework is illustrated in Fig It involves three main phases: • Training phase: Firstly, we augment the annotated data by three techniques which are Brightness and Contrast Transformation, Geometric Transformation and Variational Auto-Encoder to deal with the paucity of dataset Then, two light-weight models ResNet-18 and MobileNet-V2 are trained with these augmented data • Testing phase: The endoscopic images after going through the trained CNN models are classified into one of ten anatomical positions This step aims at evaluating the performance of these studied models in terms of sensitivity, specificity and computational times • Anatomical landmark classification from continuous video streams: The CNN model that satisfies the tradeoff between the precision, memory requirement and the computational time will be selected for a real-time classification of anatomical landmarks from upper GI video streams Some ablation study of the result will be reported B Deep learning based classification a) ResNet-18: ResNet stands for Residual Network, which is a CNN architecture proposed in [13] ResNet addresses the degradation problem, which has been exposed when the network gets deeper and begins to converge, accuracy becomes saturated and degrades quickly, by adding residual blocks to the network A residual block is a stack of layers in such a way that the output of a layer is taken and added to another layer deeper in the block ResNet has been shown to outperform many other CNN models in image classification task ResNet-18 is ResNet with 18 layers It is the most lightweight architecture in ResNet family and often chosen when a real-time application is needed The architecture of ResNet-18 consists of five sequential layers, a fully connected layer and a final softmax layer The first layer is a x convolutional layer Each layer in the next four sequential layers contains two residual blocks, which means that ResNet18 has totally eight residual blocks A residual block comprises two x convolutional layers, each layer followed by a batch normalization layer b) MobileNet-V2: MobileNet-V2 introduced in [14] is a deep neural network architecture based on Inverted Residual Structure MobileNet-V2 is one of the most common lightweight CNN architectures that seeks to perform well on mobile devices and edge computing The shortcut connections at MobileNet-V2 is adjusted so that the number of channels (or depth) at the input and output of each residual block is constricted That is the reason why it is called bottleneck layers Middle layers in a block will the nonlinear transformation, so they need to be thicker to create more transformations Shortcut connections between blocks are performed on input and output bottlenecks, not on intermediate layers Therefore, 279 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Fig Our proposed framework for anatomical landmarks classification from upper GI images the input and output bottleneck layers only need to record the result and not need to perform the nonlinear transformation In between layers in an inverted residual block are depthseparated convolutional transforms to minimize the number of model parameters This is also the secret to helping reduce the size of the MobileNet-V2 model MobileNet-V2 has 154 layers, x and x convolutions, bottleneck operations, and average pooling are applied c) Training the CNN models: To classify the images, we took advantage of transfer learning technique For ResNet18, we use the PyTorch deep learning framework while MobileNet-V2 is developed on TensorFlow For both models, all layers were fine-tuned using Adam [15], a method for stochastic optimization, with an initial learning rate of 0.001 and decay by 0.5 when a validation loss has stopped improving We resized our images to 224x224 and 128x128 to ensure the compatibility with ResNet-18 and MobileNetV2, respectively (BaC) By changing these characteristics, we can obtain new images which are able to highlight some local structures inside the image and improve its quality in overall In our work, we modify the brightness and the contrast of images in our dataset to get more images to train our CNN models The formula for modifying image brightness and contrast is: Ibc (x, y) = c ∗ Iorg (x, y) + b where Iorg (x, y) is the original image, Ibc (x, y) is the generated image, c and b are two coefficients that make change the contrast and the brightness of the original image respectively In our implementation, b ∈ {0.85, 1.15} ∗ I¯org and c ∈ {0.85, 1.15} where I¯org is the average brightness of the original image Given four pairs of values (b, c), we generate four new images from a given original image that augment times the number of original training samples Fig illustrates an original image (a) and four new generated images with modified brightness and contrast (b, c, d, e) C Augmentation techniques Data augmentation (DA) is an essential step to enrich the dataset for training the data hungry CNN models Different DA techniques have been proposed in literature In this work, we will deploy two most common and simple DA techniques which are Brightness and Contrast and Geometric Transformation We also designed a Variational Auto-Encoder to generate new images In the following, we will describe our implementation of each DA technique a) Brightness and Contrast Transformation: Two important characteristics of an image are brightness and contrast Fig a) An endoscopy image and the generated images using brightness and contrast modification (b, c, d, e) b) Geometric Transformation: Geometric Transformation (GeoT) means the geometry of an image is changed without altering its actual pixel values In this work, we deploy different geometric transformations such as scaling, translating, flipping and shearing Flipping means reversing 280 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Fig a) An endoscopy image and four generated images by vertical flipping (b), scaling (c), shearing (d) and translating (e) the image pixels horizontally or vertically Shearing an image means shifting some parts of the image into one direction and other parts into some other direction Scaling resizes an image to make it bigger or smaller in x- or/and y-direction according to the formula: x0 = sx ∗ x, y = sy ∗ y Translating shifts the object horizontally or vertically by some defined off-set pixels with formula: x0 = x + δx , y = y + δy , where (x, y) and (x0 , y ) are the coordinates of the original and the generated images respectively In our implementation, we set sx , sy ∈ {0.8, 1.2} and δx , δy ∈ {0.1, 0.3} Fig illustrates an original image (a) and four newly generated images with vertical flipping (b), scaling (c), shearing (d) and translating (e) c) Variational Auto-Encoder: Variational Auto-Encoder (VAE) is an auto-encoder whose coding distribution is regulated during training to ensure that its latent space has good properties that allow us to generate some new data VAE encodes the input as a distribution over the latent space instead of encoding it as a single point The model is then trained through steps First, the input is encoded as distribution over the latent space; then, a point from the latent space is sampled from that distribution Next, the sampling point is decoded and the reconstruction error is calculated and finally, reconstruction errors are propagated back through the network The VAE architecture designed in our work is shown in Fig We randomly choose (x, y) from the latent space, for each (x, y) coordinate we generate a new data Fig illustrates an original image (a) and four new generated images with different (x, y) (b, c, d, e) Fig VAE architecture implemented in our work IV E XPERIMENTS A Datasets The evaluation dataset was collected from patients in Vietnam’s hospitals It consists of 3700 images with default resolu- Fig a) An endoscopy image and its generations using VAE tion of 1280x1024 of 10 anatomical locations along the upper GI tract during EGD, which are larynx, esophagus, cardia, gastric body, fundus, pylorus, great curvature, lesser curvature, duodenum bulb, duodenum These images were categorized by the doctors with more than years of experience which resulted in 370 images per location Fig shows 10 categories and their positions We split the data into training and testing set with the ratio of 4:1 As our dataset is still small so we used augmentation techniques to enrich our training set Each original endoscopic image in the training dataset was augmented up to 12 times (4 times by Brightness and Contrast Transformation - BaC, times by Geometric Transformation - GeoT and times by Variational Auto-Encoder - VAE) In total, the number of training images are increased to 44.400 images, while the number of testing images are 740 images B Experimental results The experiments were conducted by comparing the performance of ResNet-18 and MobileNet-V2 models when they were trained with the original dataset vs the data generated by the above data augmentation techniques We compare their performance in terms of sensitivity, specificity, accuracy and computational time Table I shows the comparison TABLE I C OMPARATIVE RESULTS OF CLASSIFICATION ResNet-18 Data augmentation without GeoT BaC VAE GeoT+BaC MobileNet-V2 Acc(%) Sens(%) Spec(%) Acc(%) Sens(%) Spec(%) 93.65 95 95.95 91.35 95.95 93.65 95 95.95 91.35 95.95 99.29 99.44 99.55 99.04 99.55 96.89 97.43 96.35 96.62 97.43 96.89 97.43 96.35 96.62 97.43 99.65 99.71 99.59 99.62 99.71 of overall accuracy, sensitivity and specificity (the number of correct positive and correct negative predictions, respectively) between original dataset and different data augmentation techniques We saw that with all data augmentation techniques, except VAE, the result increased than when only the original dataset was used The best result of ResNet-18 performance is with Brightness and Contrast Transformation, with sensitivity is 95.95% and specificity is 99.55% The sensitivity and specificity when no augmentation technique was applied is only 93.65% and 99.29%, respectively They rise to 95.00% and 99.44% for Geometric Transformation When we combine the two above techniques, the result is identical to Brightness and Contrast Transformation Similar to ResNet-18, the performance of MobileNet-V2 is improved 281 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Fig Illustration of ten anatomical landmarks of upper GI tract and the corresponding images in our dataset Fig Comparative sensitivity of each class of ResNet-18 with Brightness and Contrast Transformation vs MobileNet-V2 with Geometric Transformation TABLE II C OMPARATIVE PERFORMANCE OF R ES N ET-18 ResNet-18 MobileNet-V2 Parameter memory 45MB 16MB AND Fig Comparative specificity of each class of ResNet-18 with Brightness and Contrast Transformation vs MobileNet-V2 with Geometric Transformation M OBILE N ET-V2 Feature memory 23MB 38MB GFLOPS 1.83 0.2 FPS 10 21 when using BaC and GeoT data augmentation techniques Its sensitivity and specificity slightly increased from 96.89% to 97.43% and 99.65% to 99.71% GeoT data gives better results by translation, cutting, and flipping operations to create new images with more viewing angles, consistent with the contraction properties that cause distortion during gastroscopy For both models, the sensitivity and specificity by VAE was lower than using the original dataset counterparts The reason is that the images generated by VAE had very low quality and resolution Furthermore, as Fig shows, the VAE Fig Some misidentified images by two networks a) Body but misidentified as great curvature b) Great curvature but misidentified as body network could not construct the detailed characteristics of each landmark, so the CNN misclassified more images We notice that all of the larynx, esophagus and cardia images were classified correctly by ResNet-18 model It often misidentifies between body and great curvature site, as six 282 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Fig 10 Landmarks classification from a continuous video stream We extract some frames from the video with overlapped labels We observe the consistency of the classification result with consecutive frames body images are classified as great curvature and seven images of great curvature are predicted as body For MobileNet-V2, all of the larynx, esophagus and duodenum images were classified correctly MobileNet-V2 model often misidentifies between body and great curvature site; for the reason as stated below, two landmarks that are close together and share the same characteristics of the tissue Five body images are classified as great curvature and three images of great curvature are predicted as body The misidentified examples are shown in Fig Fig and show the detailed comparative sensitivity and specificity of each class, respectively Overall, the result of MobileNet-V2 is better than ResNet-18 Moreover, we examine the performances on the continuous video stream of upper GI tract to ResNet-18 and MobileNetV2 model When the endoscope moves quickly through 10 anatomical locations, the identification results are not stable However, when stopping at a specific position, we obtain accurate and unchanged identification results The model would output the landmarks on each frame, as shown in Fig 10 We used the Tesla-P100-PCIE GPU with 16GB RAM to process the video The comparative performance of the two models is illustrated in Table II V C ONCLUSIONS We have presented a framework that consists of two main components: an offline data augmentation including Geometric Transformation, Brightness and Contrast Transformation and Variational Auto-Encoder and CNN models to learn hidden features of anatomical landmark classes We have studied two light-weight models (ResNet-18 and MobileNet-V2) and compared their performance with three data augmentation techniques Experimental results show the best performance obtained with MobileNet-V2 with the highest sensitivity and specificity are 97.43% and 99.71% respectively Single or combination of various DA techniques help to improve both models We also evaluated MobileNet-V2 for landmark classification from video streams and showed its potential to be deployed in real-time application (21fps) In the future, we will improve confusion of some landmarks by a finegrained classification based on texture and color and deploy a computer assisted upper GI analysis system on edge devices R EFERENCES [1] H Takiyama, T Ozawa, S Ishihara, M Fujishiro, S Shichijo, S Nomura, M Miura, and T Tada, “Automatic anatomical classification of esophagogastroduodenoscopy images using deep convolutional neural networks,” Scientific reports, vol 8, no 1, pp 1–8, 2018 [2] Z Xu, Y Tao, Z Wenfang, L Ne, H Zhengxing, L Jiquan, H Weiling, D Huilong, and S Jianmin, “Upper gastrointestinal anatomy detection with multi-task convolutional neural networks.” Healthcare technology letters, vol 6, no 6, pp 176–180, 2019 [3] S Igarashi, Y Sasaki, T Mikami, H Sakuraba, and S Fukuda, “Anatomical classification of upper gastrointestinal organs under various image capture conditions using alexnet,” Computers in Biology and Medicine, vol 124, p 103950, 2020 [4] X Zhang, F Chen, T Yu, J An, Z Huang, J Liu, W Hu, L Wang, H Duan, and J Si, “Real-time gastric polyp detection using convolutional neural networks,” PloS one, vol 14, no 3, p e0214133, 2019 [5] M Sharif, M Attique Khan, M Rashid, M Yasmin, F Afza, and U J Tanik, “Deep cnn and geometric features-based gastrointestinal tract diseases detection and classification from wireless capsule endoscopy images,” Journal of Experimental & Theoretical Artificial Intelligence, pp 1–23, 2019 [6] R Hashimoto, J Requa, D Tyler, A Ninh, E Tran, D Mai, M Lugo, N E.-H Chehade, K J Chang, W E Karnes et al., “Artificial intelligence using convolutional neural networks for real-time detection of early esophageal neoplasia in barrett’s esophagus (with video),” Gastrointestinal Endoscopy, 2020 [7] C Szegedy, W Liu, Y Jia, P Sermanet, S Reed, D Anguelov, D Erhan, V Vanhoucke, and A Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp 1–9 [8] Q He, S Bano, O F Ahmad, B Yang, X Chen, P Valdastri, L B Lovat, D Stoyanov, and S Zuo, “Deep learning-based anatomical site classification for upper gastrointestinal endoscopy,” International journal of computer assisted radiology and surgery, vol 15, no 7, pp 1085– 1094, 2020 [9] M Yamada, Y Saito, H Imaoka, M Saiko, S Yamada, H Kondo, H Takamaru, T Sakamoto, J Sese, A Kuchiba et al., “Development of a real-time endoscopic image diagnosis support system using deep learning technology in colonoscopy,” Scientific reports, vol 9, no 1, pp 1–9, 2019 [10] J.-W Park, Y Kim, W.-J Kim, and S.-J Nam, “Automatic anatomical classification model of esophagogastroduodenoscopy images using deep convolutional neural networks for guiding endoscopic photodocumentation,” Journal of the Korea Society of Computer and Information, vol 26, no 3, pp 19–28, 2021 [11] H Vu, X H Manh, B Q Duc, V K Ha, V H Dao, P B Nguyen, B L Hoang, and T H Vu, “Labelling stomach anatomical locations in upper gastrointestinal endoscopic images using a cnn,” in Proceedings of the Tenth International Symposium on Information and Communication Technology, 2019, pp 362–369 [12] D.-H Vu, L.-T Nguyen, V.-T Nguyen, T.-H Tran, V.-H Dao, and H Vu, “Boundary delineation of reflux esophagitis lesionsfrom endoscopic images using color and texture,” in Proceedings of the fourth International Conference on Multimedia Analysis and Pattern Recognition IEEE, 2021 [13] K He, X Zhang, S Ren, and J Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp 770–778 [14] M Sandler, A Howard, M Zhu, A Zhmoginov, and L.-C Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp 4510–4520 [15] D P Kingma and J Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014 283 ... neoplasia in Barrett’s esophagus [6] from endoscopic images However, there are only a few ones working on anatomical landmarks classification of the upper GI from endoscopic images Takiyama et al... works validated the performance of classification of anatomical landmarks image by image in an offline phase In this work, we aim at developing an assisted system for Upper GI image analysis to adapt... into one of ten anatomical positions This step aims at evaluating the performance of these studied models in terms of sensitivity, specificity and computational times • Anatomical landmark classification