2021 8th NAFOSTED Conference on Information and Computer Science (NICS) U-Net Semantic Segmentation of Digital Maps Using Google Satellite Images Loi Nguyen-Khanh1,2 , Vy Nguyen-Ngoc-Yen1,2 , Hung Dinh-Quoc1,2 Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet Street, District 10, Ho Chi Minh City, Vietnam Vietnam National University Ho Chi Minh City, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam (nkloi, vy.nguyen2711, hung.dinh)@hcmut.edu.vn Abstract—Satellite images contain an enormous data warehouse and give us details to the general perspective of what is happening on the earth’s surface These images are essential for agricultural development research, urban planning, surveying and, especially for evaluating the location design of broadcast stations, the input of coverage simulation and signal quality in telecommunications The analysis of large amounts of complex satellite imagery is challenging while the evolving semantic segmentation approaches based on convolution neural network (CNN) can assist in analyzing this amount of data In this paper, we introduce an approach for constructing digital maps with dataset provided by Google We utilize the efficient U-Net architecture, which is an efficient combination of EfficientNet, namely EfficientNet-B0 as the encoder to extract the geographic features with U-Net as decoder to reconstruct the detailed features map We evaluate our models using Google satellite images which demonstrate the efficiency in terms of Dice Loss and Categorical Cross-Entropy Index Terms—Satellite Images, Digital Maps, Image Segmentation, Semantic Segmentation, EfficientNet, U-Net I I NTRODUCTION Digital maps store information on different types of terrain and are used to analyze map elements about road detection, forests, buildings, forestry research, urban planning [1] The authors in [2] performed satellite image segmentation and classification using convolution neural network (CNN) with five labels: trees, vacant land, roads, buildings, water They proposed pixel-by-pixel CNN methods are single-CNN and multiple CNNs In addition, the study incorporated an averaged classification method to improve accuracy With dataset taken from DeepGlobe data, reference [3] proposed to use stacked U-Nets for line detection, using a hybrid loss function to solve the problem of unbalanced layers of training data The other approach in [4] proposed the attention dilation linkNet (AD-LinkNet) neural network using an encoder-decoder structure, parallel-serial conjugate convolution, attention channel-wise, and the encoder has prior training to semantic segmentation Alternatively, Lim et al [5] proposed CNN sets with encoding-decoding architecture: single short network (SSN), single long network (SLN), double long network (DLN) differentiate between ground and background, implementing compare topographic changes from two images Kuo et al [6] proposed a deep aggregation network used to solve the task of classifying soil layers, which extracts and combines multi-layered features in the 978-1-6654-1001-4/21/$31.00 ©2021 IEEE image partitioning process, introducing soft- graph-based semantic improve segmentation performance Although there are many approaches to satellite image analysis and the results were very positive, in general, most of the subjects identified are only object (single-class) Reference [2] performed five classes identification but did not combine results to produce a complete digital map and the classifications are quite simple Moreover, the use of the data available, almost all which are not been updated will reduce the significance of the analysis results for practical applications Therefore, the concern is to find a source of data with high image quality that is regularly updated, along with processing methods and aggregating analysis resulting in a digital map with high accuracy that meets the needs of the applications Most recent studies use datasets provided by the Deep Globe [1], [4], [6], [7], [12] for surface segmentation tasks: road detection, building, ship, grass, water, and detecting topographic changes between different times Audebert et al [8] exploited data from the Open Street Map and proved that this data source can effectively integrate into deep learning models Pascal Kaiser et al [9] used Open Street Map to refer to the semantic segmentation of images to classify buildings, and roads using CNN architectures With Open Street Map, although being supporting with free data, in general, the updating is limited due to contributions by the users [10] With the Deep Globe, they will provide an available dataset of satellite imagery, which can be used to study specific tasks such as road extraction, building detection, land cover classification [1], [3], [4], [6], [7], which is not frequently updated These data sources are almost not suitable for building a digital map U-Net architectures are normally considered as one of the most powerful tools for segmentation images [14] To further improve segmentation accuracy, Weng et al [15] proposed a variant U-Net network: NAS-UNet is stacked by some downsampling and upsampling on a U-like backbone network There are many other approaches inspired by the U-Net network architecture The authors in [16] designed the Res-UNet network model based on ResNet’s ability to 386 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) (a) Fast-growing areas Imagery Date: 6/28/2021 (b) Slow-growing areas Imagery Date: 6/4/2020 Fig 1: Satellite Image is provided by Google with regular updates process complex images The authors in [17] build U-Net network with VGG11 encoder to segment images Reference [14] compared the U-Net architecture with encoders: VGG11, VGG13, VGG16, VGG19, Resnet18, Densenet121, Inceptionv3, and Incetionresnetv2; Mingxing Tan & Quoc V Le [18] studied a scale-up model from the ConvNet baseline called EfficienNet and determined that balancing the depth, width and resolution would increase accuracy and improve performance compared to previous ConvNets in image classification This network model has various versions ranging from B0-B7 with different coefficients It demonstrated that EfficientNet B7 achieves the highest accuracy Baheti et al [19] proposed the efficient U-Net architecture, which combines EfficientNet with functions such as a decoder and U-Net decoder to create a detailed segmentation map, and EfficientNet B7 achieved the highest accuracy in the test suite Inspired from the above discussions, the present paper aims at developing an efficient classification architecture to classify the satellite images To this end, we firstly collect image data from Google, perform image labeling, then propose an effective approach to segment satellite images, and initially build a digital map To obtain high accuracy for classification, we develop an efficient model to classify satellite images into 12 classes by invoking EfficientNet [20] and U-Net segmentation architectures [23] II S YSTEM D EVELOPMENT Fig 2: Examples of tiler layouts and zoom coefficients We download the satellite image in jpeg format from google server provided by google tiler We manual satellite labeling with classes: street, tree, water, residential, urban, buildings, industrial and commercial, vacant land in urban, sparse forest park, grass, agricultural, sparse urban B Description of Applied Architectures In this section, we will summarize the encoder-decoder architecture for semantic segmentation, with EfficientNet-B0 as the encoder and U-Net as a decoder 1) Encoder-Decoder Architecture: The encoder-decoder architecture includes a CNN to extracts the features from the input image, details are modern neural networks like ResNet [16], VGG [17], But, these network models reduce the width, the height of the input image to get the final feature map It is challenging to rebuild the segmentation map to the size of the A Data Collection and Manual Label Due to the ever-changing nature of human activities and the laws of nature, the satellite imagery will change constantly, which makes no sense to use the existing ones and not regularly updated datasets in the construction of digital maps Meanwhile, Google has an enormous and regularly updated set of satellite imagery orthoimagery datasets An example is shown in Fig and Fig Taking advantage of these data, the topic focuses on researching the segmentation of satellite images from Google to build digital maps 387 Fig 3: Examples of tiler layouts and zoom coefficients 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Fig 4: Architecture efficient U-Net original image The decoder section contains a set of layers that upsamples the feature map of the encoder to restore spatial information A simple encoder-decoder network for semantic segmentation is shown in Fig 2) Feature Extraction: convolutional neural networks are evolved from available resources, then scaled to improve model performance Depth-scaling is the most common way to capture many complex features [20] However, arbitrarily increasing the depth makes the training more difficult or does not increase model performance, or even decrease [21] Similar to width and resolution Tan et al [20] proposed a new scaling method: uniformly proportional all dimensions depth, width, resolution They used a Neural Architecture Search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets It includes models from B0-B7, each with different equalization ratios and number of parameters The basic building block of EfficientNet is based on Mobile Inverted Bottleneck Convolution (MBConv) [22] is shown in Fig Here, the architecture is divided into seven blocks Fig 5: Architecture of EfficientNet-B0 with MBConv as Basic building blocks based on filter size, striding, and some channels Different EfficientNet models have different numbers of MBconv blocks From EfficientNetB0- EfficientNetB7, increasing depth, width, resolution, and dimension model lead to an increase in the number of parameters used in a calculation that makes the strong model and accuracy are also gradually improved [20] However, due to limited tool support as well as a limited calculation of a large number of parameters, this takes a lot of work and time to process; our research draws our attention in the encoder test to architectures EfficientNet-B0, EfficientNetB1, EfficientNet-B2 3) Network Architecture: U-Net is one of the most powerful integrated network architectures for fast and precise segmentation of images, first published in 2015 for biomedical image segmentation [23] It consists of encoder-decoders that make the ‘U’ shape The encoder, or contraction path, is a typical convolutional network that has convolution, activation, and pooling layers to capture the features of the input image During the encoder process, spatial dimension information (height and width) is decreased while feature information is increased The decoder or expansion path part combines the features and spatial information through a series of convolution structures and joins the high-resolution features from the contracting path In the original U-Net, the expansion path is almost symmetrical with the contracting path [23] In our research, we propose to use EfficientNet as an encoder instead of a set of conventional convolution layers The decoder module is similar to the original U-Net Details of the proposed architecture are illustrated in Fig The input image size is 1024x1024 The detailed architecture of blocks in the encoder can be found in Fig First, we bilinearly upsample the feature map of the last logits in the encoder by a factor of two, then append the feature map from the encoder with the same spatial resolution This is followed by × convolution layers before again upsampled by a factor of two This process is 388 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) repeated until the segment map of the same size as the original size of input image is recovered The proposed architecture is asymmetric, unlike the original U-Net Here, the contracting path is deeper than the expansion path Putting a powerful CNN like EfficientNet as an encoder improves the overall performance of the algorithm [19] C Loss Functions Loss functions play an essential role in determining model performance and different loss functions can be used under various circumstances [13] In this study, we select three loss functions suitable for the model: 1) Dice Loss: is a measure of overlap between corresponding pixel values of prediction and ground truth respectively, which is widely used to assess segmentation performance [20] The Dice Loss is defined as: Pn yi · yˆi + (1) LDL (y, yˆ) = − Pn i=0 Pn y + i=1 i i=1 yˆi + Here yˆ is the predicted set of pixels, and y is the ground truth is added in numerator and denominator to ensure that the function is not undefined in edge case scenarios such as when y = yˆ = [13] 2) Categorical Cross Entropy: is a measure of the difference between two probability distributions for a given random variable or set of events It is widely used for grading purposes, especially pixel-level grading [13]: LCCE (y, yˆ) = − n X C X yic · log(yˆic ), Fig 6: Test graph in Efficient U-Net B0 network model TABLE I: Results For Comparison Of Various Encoder Architecture With Loss Functions U-Net with backbone VGG11 ResNet18 EffNet-B0 EffNet-B1 EffNet-B2 (2) (LDL + LCCE ) (3) III E XPERIMENTAL R ESULTS We tested with the U-Net original decoder with different backbone used for encoder such as VGG11 [17], ResNet18 [16], EfficientNet-B0, EfficientNet-B1, EfficientNet-B2 The results are shown in Table I We use the loss functions outlined above to evaluate the models It can be easily observed that EfficientNet-B0 gives the best results of 1.110 Categorizcal cross entropy loss, 0.731 Dice loss and 0.997 Average loss At the same time, EfficientNet-B0 has several computational parameters of only 4M much less than other models, which makes the calculation simpler, minimizes effort and processing time To test the efficient U-Net B0 network model, we use 1,317 images for train and 304 images for validation The data anotation tool we use is CVAT which is provided by OpenVINO Toolkit In the training process, we set the coefficient learningrate = 0.0001, as shown in Fig The test result is shown in Figure L= Categorical cross entropy loss 1.194 1.191 1.110 1.374 1.134 Dice loss 0.770 0.770 0.731 0.806 0.747 Average loss 1.066 1.065 0.997 1.204 1.018 IV C ONCLUSION i=1 c=1 where C is the number of classes, yic is if and only if sample i belongs to class c and yˆic is the output probability that sample i belongs to class c 3) Average Loss: comprising of two weighted probability distribution is given by: Total params 32M 18M 4M 6.5M 8M Developing the semantic segmentation architecture to analyze the geographic structures in satellite imagery is very challenging, but a meaningful task in real-world applications This paper has conducted the segmentation of satellite images with 12 classes In our research, we have considered a segmentation method, the efficient U-Net architecture, which makes use of the efficiency of EfficientNet as an encoder to extract the feature with U-Net as a decoder to rebuilt detailed feature maps Although there are fewer parameters than other structures, EfficientNet-B0 still gives very positive results in the result table ACKNOWLEDGEMENTS This research is funded by Ho Chi Minh City University of Technology - VNU-HCM under grant number T-ÐÐT-202045 We acknowledge the support of time and facilities from Ho Chi Minh City University of Technology (HCMUT), VNUHCM for this study 389 R EFERENCES [1] I Demir et al., “Deepglobe 2018: A challenge to parse the earth through satellite images,” in Proc IEEE/CVF Conf Comput Vis Pattern Recognit Workshops (CVPRW), May 2018, pp 172209 [2] M Lăangkvist, A Kiselev, M Alirezaie, and A Loutfi, “Classification and segmentation of satellite orthoimagery using convolutional neural networks,” Remote Sensing, vol 8, no 4, p 329, Apr 2016 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Fig 7: Results of semantic segmentation on Google dataset with proposed architecture First column shows the input images depicting different scenarios from unstructured environment Second and third column shows the ground truth and predicted segmentation map respectively where different colors signify different classes [3] T Sun, Z Chen, W Yang and Y Wang, “Stacked U-Nets with multioutput for road extraction,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp 1871874 [4] M Wu, C Zhang, J Liu, L Zhou and X Li, “Towards accurate high resolution satellite image semantic segmentation,” in IEEE Access, vol 7, pp 55609-55619, 2019 [5] K Lim, D Jin and C Kim, “Change detection in high resolution satellite images using an ensemble of convolutional neural networks,” 2018 AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2018, pp 509-515 [6] T Kuo, K Tseng, J Yan, Y Liu and Y F Wang, “Deep aggregation net for land cover classification,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp 2472474 [7] S Aich, W van der Kamp, and I Stavness, “Semantic binary segmentation using convolutional networks without decoders,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018 [8] N Audebert, B Le Saux, and S Lefevre, “Joint learning from earth observation and OpenStreetMap data to get faster better semantic maps,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 [9] P Kaiser, J D Wegner, A Lucchi, M Jaggi, T Hofmann, and K Schindler, “Learning aerial image segmentation from online maps,” IEEE Trans Geosci Remote Sens., vol 55, no 11, pp 6054–6068, 2017 [10] J.-F Girres and G Touya, “Quality assessment of the french OpenStreetMap dataset: Quality assessment of the french OpenStreetMap dataset,” Trans GIS, vol 14, no 4, pp 435–459, 2010 390 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) [11] N Baghdadi, C Mallet, and M Zribi, QGIS and Generic Tools London, England: ISTE, 2018 [12] K Zhao, J Kang, J Jung, and G Sohn, “Building extraction from satellite images using mask R-CNN with building boundary regularization,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018 [13] S Jadon, “A survey of loss functions for semantic segmentation,” in 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2020 [14] S W Chang and S W Liao, “KUnet: Microscopy image segmentation with deep unet based convolutional networks,” in 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), 2019 [15] Y Weng, T Zhou, Y Li, and X Qiu, “NAS-Unet: Neural Architecture Search for Medical Image Segmentation,” IEEE Access, vol 7, pp 44247–44257, 2019 [16] Z Chu, T Tian, R Feng, and L Wang, “Sea-land segmentation with res-UNet and fully connected CRF,” in IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium, 2019 [17] V Iglovikov and A Shvets, “TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation,” arXiv [cs.CV], 2018 [18] M Tan and Q V Le, “EfficientNet: Rethinking model scaling for convolutional Neural Networks,” arXiv [cs.LG], 2019 [19] B Baheti, S Innani, S Gajre, and S Talbar, “Eff-UNet: A novel architecture for semantic segmentation in unstructured environment,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020 [20] K He, X Zhang, S Ren, and J Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016 [21] S Zagoruyko and N Komodakis, “Wide Residual Networks,” in Procedings of the British Machine Vision Conference 2016, 2016 [22] M Sandler, A Howard, M Zhu, A Zhmoginov, and L.-C Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018 [23] O Ronneberger, P Fischer, and T Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in Lecture Notes in Computer Science, Cham: Springer International Publishing, 2015, pp 234–241 391 ... architectures EfficientNet-B0, EfficientNetB1, EfficientNet-B2 3) Network Architecture: U- Net is one of the most powerful integrated network architectures for fast and precise segmentation of images, ... in Efficient U- Net B0 network model TABLE I: Results For Comparison Of Various Encoder Architecture With Loss Functions U- Net with backbone VGG11 ResNet18 EffNet-B0 EffNet-B1 EffNet-B2 (2) (LDL... Fig Taking advantage of these data, the topic focuses on researching the segmentation of satellite images from Google to build digital maps 387 Fig 3: Examples of tiler layouts and zoom coefficients