2021 8th NAFOSTED Conference on Information and Computer Science (NICS) A Lightweight Model for Remote Sensing Image Retrieval with Knowledge Distillation and Mining Interclass Characteristics Khanh-An C Quan∗ ,Vinh-Tiep Nguyen† , Minh-Triet Tran‡ ∗† University of Information Technology, Ho Chi Minh City, Vietnam ∗‡ John von Neumann Institute, Ho Chi Minh City, Vietnam ∗‡ University of Science, Ho Chi Minh City, Vietnam ∗†‡ Vietnam National University, Ho Chi Minh City, Vietnam Email: anqck@uit.edu.vn, tiepnv@uit.edu.vn, tmtriet@fit.hcmus.edu.vn Abstract—There are more and more practical applications of remote sensing image retrieval in a wide variety of areas, such as land-cover analysis, ecosystem monitoring, or agriculture It is essential to have a solution for this problem with both high accuracy and efficiency, e.g small-sized models and low computational cost This motivates us to propose a lightweight model for remote sensing image retrieval We first employ interclass characteristic mining to train a cumbersome and robust model, aiming to boost the quality of retrieval results Then, from the complex model, we apply the knowledge distillation to reduce significantly the neural network’s size Our experiments conducted on the UC Merced Land Use dataset demonstrate the advantage of our method Our lightweight model achieves the mAP of 0.9680 with only 3.8M parameters This model has a higher mAP and lower number of parameters than EDML method, proposed by Cao et al Index Terms—remote-sensing, image retrieval, deep metric learning, knowledge distillation I I NTRODUCTION High-resolution remote-sensing photos have been widely available due to the advancement of technologies and remote sensors This has opened up new opportunities for exploiting data in a range of essential applications, such as land-cover analysis [1], ecosystem monitoring [2], agriculture [3] In fact, visual interpretation of remote-sensing scenes is still a challenging task because researchers need to deal with high intra-class and low inter-class variability [4] Despite deep neural networks’ excellent performance, the successful management of an extensive remote-sensing database is complicated by numerous issues caused by temporal differences, viewpoints, high resolution, and different contents [5] The ability to retrieve vast amounts of remotesensing images is a crucial first step toward adequately managing enormous volumes of remote-sensing data [6] Deep metric learning methods, in particular, have demonstrated remarkable success in characterizing complex remote-sensing data [7] It should be noticed that remote-sensing data has some common characteristics shared between classes With a triplet deep metric learning network, although achieving relatively high results, when visualizing some failure cases, we can see 978-1-6654-1001-4/21/$31.00 ©2021 IEEE Fig Top nearest neighbors of query image in triplet deep metric learning network In the first query We can see that there are quite similar characteristics between the returned results although some results are the different classes with the query image (green left border if the resulting image same class with the query image; otherwise, red left border) that there are quite similar characteristics between the returned results with the query, as shown in Figure This shows that the ability to distinguish between two classes that share the same characteristics is not well-solved with the triplet deep metric learning network for remote-sensing images Thus, it is necessary to improve the capability to discriminate classes for remote-sensing images Furthermore, it is crucial to create a lightweight network that can achieve both high accuracy for image retrieval and low computation cost, in terms of reducing the number of parameters in the learned model In this paper, our objective is to propose a lightweight retrieval solution for remote sensing images with two criteria: achieving high mAP while using only a low number of parameters, compared to other existing methods We inherit the EDML method, proposed by Cao et al [8] and propose enhancement for this method Because we have two criteria for 217 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) our work, our approach consists of two stages First, we learn the shared features between classes that improve the discriminative features for the remote-sensing image retrieval problem, adopted the idea from Roth et al [9] Then we reduce the model complexity by applying knowledge distillation [10] We conduct extensive experiments on the popular dataset in remote-sensing image retrieval (UCM dataset [11]) First, by mining the shared features between classes adopted from [9], the retrieval performance with ResNet-101 and Margin loss [12] enhance from 0.9750 to 0.9762 in mAP These models have higher mAP results than the original EDML method [8] with 0.9663 Then, we reduce the complexity from the robust model ResNet-101 with 43.3M parameters by transferring the learned knowledge to the compact MobileNet v2 model with only 3.8M It is worth mentioning that, although we can signification reduce the number of parameters from 43.3M to 3.8M (more than 11 times), our model slightly decreases in retrieval performance (mAP reduces from 0.9762 to 0.9680) The content of our paper is as follows In Section II, we briefly present about the remote-sensing image retrieval, metric learning, and knowledge distillation Our proposed method is presented in Section III We present experimental results and ablation studies on our proposed method in Sections IV and V, respectively Finally, conclusions and future work are discussed in Section VI II R ELATED W ORKS 1) Remote-sensing image retrieval: The main objective of content-based image retrieval is to find the powerful discriminative features from images Previously, the remote-sensing image retrieval method relied on handcrafted features that necessitate specialist expertise and take time Global features such as color, texture, and shape features, as well as local features such as bag of visual words (BoVW) [11], vector of locally aggregated descriptors (VLAD) [13], and Fisher vector (FV) [14], are widely used as image representations in remote-sensing image retrieval works The development of deep learning has significantly advanced content-based image retrieval Based on the learning capacity of Convolutional Neural Network (CNNs), semantic and robust feature representation obtained shown better performance in remotesensing image retrieval over traditional handcrafted features [15, 16] The development of large-scale remote-sensing image classification and retrieval datasets, such as UCM [11], AID [17], and PatternNet [16], has also boosted the development of content-based remote-sensing image retrieval 2) Metric learning: Deep metric learning is technique that combination of deep learning and metric learning Deep neural network aim to learn mapping the input image into a features vector in the metric space Metric loss function is one of the most important components in deep learning metric , which can be categorized into two types of contrastive loss and triplet loss Given a triplet (xi , xj , xk ), xj is a similar sample to the reference xi and xk is a dissimilar sample to the xi , the triplet loss is defined as l = max(d (xi , xj )−d (xi , xk )+α, 0) where α is margin parameter, where d (xi , xj ) is euclidean distance Many studies showed the effectiveness of applying deep metric learning (DML) in many problems such as image retrieval, visual search, image classification, etc Recently, Roth et al and Lin et al showed that learning the shared features between classes improve the discriminate ability of the model [9, 18] In the remote-sensing field, DML also shown the effective in the remote-sensing problems like classification [19], and image retrieval [20] Cao et al [8] shown that applying DML with triplet loss enhances the result on remotesensing retrieval tasks compared to traditional deep learning methods Cao et al [20] shown that combining DML and GAN can archive potential results on a small training dataset 3) Knowledge distillation: Knowledge distillation is a process to distill the knowledge from the cumbersome model to the lightweight model without significant performance loss The ideal of the knowledge instead of learning knowledge from data with labels like the traditional way, in the knowledge distillation, the student network (which is the lightweight model) tries to learn how to predict like the teacher network (the powerful model with heavy architecture, or ensemble of models) Hinton et al [10] propose a distill method in which a student model trains with the objective of matching the distribution of the softmax output of the teacher model in classification problems Tang et al [21] the efficiency of using the Mean Square Error (MSE) loss between the student’s logits and the teachers’ logits for the knowledge distillation Prior research has demonstrated that knowledge distillation is effective for semi-supervised learning [22], domain adaptation [23] and many other applications III A PPROACH FOR R EMOTE S ENSING I MAGE R ETRIEVAL The overview of our method is shown in Figure Our approach contains two stages: first, we train a robust teacher model that has good discriminant ability on remote-sensing retrieval problems, then we distill the knowledge of the teacher network, which is a cumbersome network, to the compact student network to reduce the complexity of the final model As a result, we will have a compact network that has the ability to predict good discriminative features for the remote-sensing image retrieval problem 1) Stage 1: Training a robust teacher network: In the first stages, our primary goal is to train a robust network capable of extracting highly discriminant features We follow the deep metric learning to achieve this goal For the first stage, we propose to replace the deep metric learning used in [8] with the mining interclass characteristic (MIC) method [9] The overview of this stage pipeline is shown at Figure (left) We evaluate the advantage of this enhancement with different backbones in Table I In this approach, there are three main components: a features extractor, encoder Eα and encoder Eβ For the image x ∈ R Height×Width×3 , the features extractor extract the image representative f (x) ∈ R d Then, the encoder class-discriminative encoder Eα and the intra-class Eβ learn from the shared features f (x) with different purpose These three components trained jointly by standard back-propagation algorithm 218 Conventional Deep Metric Learning Query result Test set KNN search Query Image Features Extractor Image Database F C Metric Learning Loss Training set v Backpropagation Our approach Stage 2: Stage 1: Eβ Teacher Surrogate task F C Features Extractor Metric Learning loss Eα Mutual information loss Training set + v Features Extractor F C Metric Learning loss Test set Query Image Backpropagation Distillation loss Student Training set v F C Features Extractor Query result F C KNN search Image Database Backpropagation Fig Overview of our approach compare to conventional deep metric learning The class-discriminative encoder Eα aims to learn how to distinguish objects between different classes This refers to a fully connected layer as a classifier Eα representative satisfy properties of the metric learning through the metric loss function lα The Eα can be trained on the provided ground truth labels of the training dataset The intra-class encoder Eβ aim to learn the shared characteristic between classes Due to reasons of same class usually share many common features like color, context, shape To remove the characteristics shared within classes, normalization guided by ground truth classes applied For each class y in the training set, this approach compute the mean µy and standard deviation σy based on the features f (xi ) , ∀xi : yi = y Then from the new standardized image representation obtained Z = [z1 , · · · , zN ] f (xi )−µui with zi = , where the class influence is now reduced σyi Afterwards, the auxiliary encoder Eβ can be trained using the surrogate labels [c1 , · · · , cN ] produced by clustering the space Z in the Surrogate task The intra-class encoder Eβ also learned with metric loss function lβ Many variations of metric learning loss have been proposed recently, among them Margin loss [12] with adding and additional margin β shown the potential result which used at lα and lβ in our approach The Margin loss is expressed as lmargin (xi , xj ) = [α + µij (d (E (f (xi )) , E (f (xj ))) − β)]+ where is d (E (f (xi )) , E (f (xj ))) the Euclidean distance between the embeddings of sample xi and xj α and β is a parameter determines the separation margin and the boundary between positive and negative pairs, respectively µij indicates whether the samples in the pair are similar (µij = 1) or different (µij = −1) As the two encoders shared the same input f (x), they will learn some similar characteristics To reduce the similar characteristics shared between them and restrict the discriminative and shared characteristic to their own encoding space, the mutual information loss is applied: 2 ld = − Eαr (f (x)) ⊙ R Eβr (f (x)) with R is function learned from two-layer fully-connected neural network that map Eα to the encoder space of Eβ ⊙ is a element-wise product (Hadamark product) and r stand for gradient reversal layer The main objective of is transfer non-discriminate characteristics to an intra-class encoder Eβ Finally, the total loss function to train all components in this method is computed by L = lα + lβ + γld , where γ is the weights the mutual information loss contribution in relation to the class metric loss lα and the auxiliary metric loss lβ 2) Stage 2: Distilling knowledge to the student network: In this stage, our primary goal is to transfer the representation ability of the teacher network to the student network The overview of this stage pipeline is shown in Figure (right) For the teacher network, we use the features extract and the class-discriminative encoder Eα adopt from the first stage The student network is a combination of a lightweight network (i.e., ResNet-18[24], MobileNet v2[25]) and the fully connected layer generating an embedding vector for computing similarities To train the student model, we use the Mean Square Error loss between the teacher predict Eα and the student predict Estudent The Mean Square Error loss is defined as Ldistill = ∥Eα − Estudent ∥2 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) IV E XPERIMENTS A Datasets and Evaluation Metrics For the evaluation, we use UC Merced Land Use Dataset (UCM) dataset [11], which is most widely used as a benchmark for remote-sensing image retrieval problems The UCMD dataset contains 21 classes; each class has 100 images All the images have the size of 256 x 256 pixels, and the spatial resolution of each pixel is 0.3m We follow the data splitting that yields the best performance in [26], which randomly selects 50% images of each class for training and the rest 50% for performance evaluation For the similarity measurement, we use Euclidean distances to measure the similarity of features vector corresponding to the images The Euclidean distance is one of the most effective and widely used measurement methods in image retrieval similarity measurement We use mean average precision (mAP) and precision at K (P@K) to evaluate retrieval performance, which is widely used for evaluating the image retrieval model performance The mAP is defined as follows: PQ q=1 AveP(q) (1) mAP = Q The definition of AveP is: Pn k=1 (P (k) × rel(k)) AveP = number of relevant images (2) where Q is the number of all images in the dataset, P (k) is the precision at cut-off k, and rel(k) is a piecewise function The precision is calculated once for every image returned and multiplied by the precision by the coefficient rel(k) If the current returned image is related, the rel(k) is 1; otherwise B Experiments setup We are using Google Colab Pro with NVidia Tesla P100 and NVidia Tesla V100 for training and testing for the experiment environment The maximum training iteration is 100 epochs For training the teacher model, we follow the setup of MIC mentioned in the original paper [9] Specifically, we train the model using Adam with a learning rate of 1e−5 and decrease the learning rate to 3−5 when the training epoch reaches 50 We set the triplet parameters following [9], initializing β = 1.2 for the Margin loss and α = 0.2 as fixed triplet margin For γ we utilize values in range [250, 2000] During training, we randomly crop images of size 224 × 224 after resizing them to 256 × 256, followed by random horizontal flips After class standardization, the clustering is performed via standard k-means using the faiss framework [27] For efficiency, the clustering can be computed on GPU using faiss [27] The number of clusters is set before training to a fixed to 30 for UCM dataset [11] We update the cluster labels every other epoch The model is robust to both parameters since many parameters give comparable results Later in Section 5, we study the effect of cluster numbers and cluster label update frequencies for each dataset in more detail to motivate the chosen numbers Finally, class assignments by clustering, Fig Qualitative nearest neighbor evaluation for UCM dataset based on Eα and Eβ encodings and their combination The results show that Eβ leverages class independent information (direction, color) while Eα becomes independent to those features and focuses on the class detection The combination of the two reintroduces both.) especially in the initial training stages, become near arbitrary for samples further away from cluster centers To ensure that not reinforce such a strong initial bias, we follow the MIC method to ease the class constraint by randomly switching samples with samples from different cluster classes (with probability p ≤ 0.2) For the result in Table II, the Eα and Eβ are same dimension and varies from 128, 256, 512, 1024 The backbone architechture is ResNet-50 [24] model pretrained on ImageNets For comparison, the features extracted by the conventional triplet deep metric learning network are used as baseline For knowledge distillation stages, we use the Adam optimizer with a learning rate of 1e-4 The maximum training iteration is 25 epochs We compare both ResNet-18 [24] and MobileNet v2 [25] as the backbone of the student network C Result and analysis 1) Training the teacher model: The overall results on the UCM datasets with different backbones are shown in Table I For the baseline result of each backbone (denoted as DML in Table I), we use the conventional deep metric learning with the same Margin loss [12] Our enhanced method, (denoted as +MIC), achieves the better performance than the baseline solution with conventional triplet deep learning networks By changing the network architecture more complex, the results increase noticeably Our approach with the ResNet-101 backbone achieves the best mAP result of 0.9762 220 Backbone ResNet-18 ResNet-34 ResNet-50 ResNet-101 MobileNet v2 Method DML +MIC DML +MIC DML +MIC DML +MIC DML +MIC mAP 0.9188 0.9226 0.9577 0.9586 0.9712 0.9720 0.9750 0.9762 0.8883 0.9186 Metric P@5 P@10 0.9558 0.9535 0.9589 0.9815 0.9714 0.9692 0.9733 0.9716 0.9819 0.9800 0.9846 0.9815 0.9811 0.9799 0.9823 0.9815 0.9440 0.9344 0.9621 0.9541 P@50 0.8802 0.9389 0.9214 0.9225 0.9381 0.9389 0.9444 0.9450 0.8431 0.8723 TABLE I OVERALL RESULTS ON UCM DATASET WITH DIFFERENT FEATURES EXTRACTOR WITH EMBEDDING SIZE OF 128 T HE BETTER RESULTS OF EACH BACKBONE ARE HIGHLIGHTED IN BOLD 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Metric mAP P@5 P@10 P@50 EDML (Cao et al [20]) 0.9663 0.9775 0.9757 0.9320 DML 128 0.9712 0.9819 0.9800 0.9381 +MIC 128 0.9720 0.9846 0.9815 0.9389 DML 256 0.9720 0.9827 0.9812 0.9389 +MIC 256 0.9750 0.9848 0.9828 0.9422 DML 512 0.9749 0.9840 0.9829 0.9428 +MIC 512 0.9754 0.9844 0.9830 0.9440 DML 1024 0.9769 0.9840 0.9829 0.9435 +MIC 1024 0.9772 0.9848 0.9830 0.9459 TABLE II OVERALL RESULTS ON UCM DATASET WITH DIFFERENT EMBEDDING DIMENSION WITH R ES N ET-50 BACKBONE T HE BEST RESULTS OF EACH DIMENSION ARE HIGHLIGHTED IN BOLD Method Dim The overall results on the UCM datasets with different embedding dimensions with ResNet-50 backbone are shown in II For the baseline of each dimension, we use the conventional deep metric learning with the same Margin loss [12] and ResNet-50 as the backbone as reference Compared to deep metric learning with triplet loss by Cao et al [20], our baseline with an enhanced version of triplet loss (Margin loss [12]) achieves mAP of 0.9712, higher than the original method using the same ResNet-50 architecture for feature extraction In general, MIC-based features achieve the best performances for each dimension compared to the conventional triplet deep learning networks In addition, the higher dimension, the better the image retrieval performance There is a considerable difference in performance between EDML and MIC methods at 256-dimension Meanwhile, at 512-dimensional, there is only a slight difference It can also be seen that MIC at 1024dimension has achieved the best results, outperforming others on all the evaluation metrics Qualitative results are shown in Figure 3, the class encoder Eα retrieves images sharing class-specific characteristics, while the auxiliary encoder Eβ finds intrinsic, classindependent object properties (e.g direction, context) The combination retrieves images with both characteristics To investigate in detail, qualitative results of several difficult query cases are presented in Figure 4, which shows the top-5 retrieved images that are similar to the query images, using features extracted by conventional triplet deep metric learning network, MIC-based, respectively MIC-based features improves performance compare to conventional triplet deep metric learning on the cases noticeably The results indicate that learning the shared features between classes can enhance remote-sensing retrieval performance 2) Distilling knowledge to the student network: In this experiment, we distill the knowledge from the ResNet-101 backbone (our best result model at mAP in the embedding size of 128) to the student network The overall results on the UCM datasets with different student backbones are shown in Table III For the baseline of each backbone, we use the result with different backbone obtained in the first stage The student model distilled from the ResNet-101 teacher outperformed the baseline result The MobileNet v2 backbone achieved the highest mAP result with 0.9680, 5.38% higher than baseline Fig Top-5 retrieval results for UCM Each figure part consists of two rows The first image in each row is the query image; the first and second rows are the extracted features from the conventional triplet deep metric learning network, MIC-based, respectively The left green border and red border indicate correct and false results, respectively It is worth noting that with the number of parameters less than 11.39 times, the MobileNet backbone gives results lesser than the teacher model with the ResNet-101 backbone 0.82% V A BLATION S TUDIES In this section, we investigate the properties of our model and evaluate its components Specifically, we study the first stage model on the UCM dataset, including evaluation of Eα as a function of the Eβ capacity, the influence of the number of clusters, the influence of the cluster label update frequency To examine the number of cluster hyper-parameters, Figure compares the performances using a range of cluster numbers The chart depicts how the number of clusters affects the final results, implying that the quality of the latent structure recovered by the auxiliary encoder Eβ is critical for improved Backbone Method ResNet-101 Baseline Baseline Distilled Baseline Distilled GCN [28] SGCN [28] MiLaN [28] EDML [8] EDML [8] ResNet-18 MobileNet v2 VGG-16 ResNet-50 Number of Params 43.3M 11.2M 3.8M mAP 0.9762 0.9226 0.9672 0.9186 0.9680 0.6481 0.6989 0.9040 0.9487 0.9663 Metric P@5 P@10 0.9823 0.9815 0.9589 0.9535 0.9751 0.9725 0.9621 0.9541 0.9751 0.9736 0.8712 0.9363 0.9841 0.9775 0.9687 0.9757 P@50 0.9450 0.8802 0.9340 0.8723 0.9365 0.9057 0.9320 TABLE III OVERALL RESULTS ON UCM DATASET OF OUR MODEL AFTER APPLYING KNOWLEDGE DISTILLATION COMPARE TO OTHER METHODS 221 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Fig Influence of the number of clusters on mAP A fixed cluster label update period of was used with an equal learning rate and consistent scheduling higher than the mAP of EDML [8], while the model has only 3.8M parameters However, our approach has some disadvantages, such as the number of hyperparameters and training costs The number of hyperparameters that need to be tuned is more than that of deep metric learning with triplet: the number of clusters, frequency of cluster updates, the weight of adversarial loss, and these parameters highly depend on data Some of this hyper-parameter has a high impact on the final result Although the number of parameters does not increase much, the cost of training time increases due to the clustering process For future works, by evaluate other datasets (e.g., AID [17], PatternNet [16]), evaluation on different network architectures (e.g., EfficientNet [29]) and other metric loss functions and sampling methods may give a comprehensive insight ACKNOWLEDGMENT Khanh-An C Quan was funded by Vingroup Joint Stock Company and supported by the Domestic Master/ PhD Scholarship Programme of Vingroup Innovation Foundation (VINIF), Vingroup Big Data Institute (VINBIGDATA), code VINIF.2020.ThS.JVN.07 R EFERENCES Fig Influence of the cluster label update frequency on mAP classification The MIC model performs best on the UCM dataset when the number of clusters is set to 30 Figure illustrates the update frequency for the auxiliary labels affect the retrieval result Frequently updating the auxiliary label of auxiliary encoder Eβ has good results VI C ONCLUSION Content-based remote-sensing image retrieval is key to the effective use of the ever-growing remote-sensing images In this paper, we show that learning the shared features between classes can enhance the retrieval performance of remote-sensing images We evaluate the MIC method on the UCM dataset and achieve potential results compare to the conventional triplet deep metric learning network We also achieve the second objective in this work to significantly reduce the number of parameters in learned models by applying the knowledge distillation approach After training a robust teacher network with Resnet-101 as the backbone, which can achieve the mAP up to 0.9762, we train a lightweight student network with the backbone as ResNet18 or MobileNet v2 Our best found model is with MobileNet v2 achieving the mAP of 0.9680 on the UCM dataset, even 222 [1] J Kang, D Hong, J Liu, G Baier, N Yokoya, and B Demir, “Learning convolutional sparse coding on complex domain for interferometric phase restoration,” IEEE transactions on neural networks and learning systems, vol 32, no 2, pp 826–840, 2020 [2] R Fernandez-Beltran, F Pla, and A Plaza, “Sentinel-2 and sentinel-3 intersensor vegetation estimation via constrained topic modeling,” IEEE Geoscience and Remote Sensing Letters, vol 16, no 10, pp 1531–1535, 2019 [3] J Segarra, M L Buchaillot, J L Araus, and S C Kefauver, “Remote sensing for precision agriculture: Sentinel-2 improved features and applications,” Agronomy, vol 10, no 5, p 641, 2020 [4] Z Gong, P Zhong, W Hu, and Y Hua, “Joint learning of the center points and deep metrics for land-use classification in remote sensing,” Remote Sensing, vol 11, no 1, 2019 [5] Z Gong, P Zhong, Y Yu, and W Hu, “Diversitypromoting deep structural metric learning for remote sensing scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol 56, no 1, pp 371–390, 2017 [6] G Cheng, C Yang, X Yao, L Guo, and J Han, “When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative cnns,” IEEE transactions on geoscience and remote sensing, vol 56, no 5, pp 2811–2821, 2018 [7] J Kang, R Fernandez-Beltran, Z Ye, X Tong, P Ghamisi, and A Plaza, “Deep metric learning based on scalable neighborhood components for remote sensing scene characterization,” IEEE Transactions on Geo- 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] science and Remote Sensing, vol 58, no 12, pp 8905– 8918, 2020 R Cao, Q Zhang, J Zhu, Q Li, Q Li, B Liu, and G Qiu, “remote sensing image retrieval using a triplet deep metric learning network,” International Journal of Remote Sensing, vol 41, no 2, pp 740–751, 2020 K Roth, B Brattoli, and B Ommer, “Mic: Mining interclass characteristics for improved metric learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp 8000–8009 G Hinton, O Vinyals, and J Dean, “Distilling the knowledge in a neural network,” in NIPS Deep Learning and Representation Learning Workshop, 2015 Y Yang and S Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” in Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, 2010, pp 270–279 C.-Y Wu, R Manmatha, A J Smola, and P Krahenbuhl, “Sampling matters in deep embedding learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp 28402848 ă S Ozkan, T Atesá, E Tola, M Soysal, and E Esen, “Performance analysis of state-of-the-art representation methods for geographical image retrieval and categorization,” IEEE Geoscience and Remote Sensing Letters, vol 11, no 11, pp 1996–2000, 2014 P Napoletano, “Visual descriptors for content-based retrieval of remote-sensing images,” International journal of remote sensing, vol 39, no 5, pp 1343–1376, 2018 Y Li, Y Zhang, X Huang, and A L Yuille, “Deep networks under scene-level supervision for multi-class geospatial object detection from remote sensing images,” ISPRS journal of photogrammetry and remote sensing, vol 146, pp 182–196, 2018 W Zhou, S Newsam, C Li, and Z Shao, “Patternnet: A benchmark dataset for performance evaluation of remote sensing image retrieval,” ISPRS journal of photogrammetry and remote sensing, vol 145, pp 197–209, 2018 G.-S Xia, J Hu, F Hu, B Shi, X Bai, Y Zhong, L Zhang, and X Lu, “Aid: A benchmark data set for performance evaluation of aerial scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol 55, no 7, pp 3965–3981, 2017 X Lin, Y Duan, Q Dong, J Lu, and J Zhou, “Deep variational metric learning,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp 689–704 G Cheng, Z Li, J Han, X Yao, and L Guo, “Explor- [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] 223 ing hierarchical convolutional features for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol 56, no 11, pp 6712–6722, 2018 Y Cao, Y Wang, J Peng, L Zhang, L Xu, K Yan, and L Li, “DML-GANR: Deep metric learning with generative adversarial network regularization for high spatial resolution remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol 58, no 12, pp 8888–8904, 2020 R Tang, Y Lu, L Liu, L Mou, O Vechtomova, and J Lin, “Distilling task-specific knowledge from bert into simple neural networks,” arXiv preprint arXiv:1903.12136, 2019 A Tarvainen and H Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” arXiv preprint arXiv:1703.01780, 2017 Z Meng, J Li, Y Gong, and B.-H Juang, “Adversarial teacher-student learning for unsupervised domain adaptation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE, 2018, pp 5949–5953 K He, X Zhang, S Ren, and J Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp 770–778 M Sandler, A Howard, M Zhu, A Zhmoginov, and L.-C Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp 4510– 4520 F Ye, H Xiao, X Zhao, M Dong, W Luo, and W Min, “Remote sensing image retrieval using convolutional neural network features and weighted distance,” IEEE geoscience and remote sensing letters, vol 15, no 10, pp 1535–1539, 2018 J Johnson, M Douze, and H J´egou, “Billion-scale similarity search with gpus,” IEEE Transactions on Big Data, 2019 U Chaudhuri, B Banerjee, and A Bhattacharya, “Siamese graph convolutional network for content based remote sensing image retrieval,” Computer Vision and Image Understanding, vol 184, pp 22–30, 2019 M Tan and Q Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning PMLR, 2019, pp 6105–6114 ... of large-scale remote- sensing image classification and retrieval datasets, such as UCM [11], AID [17], and PatternNet [16], has also boosted the development of content-based remote- sensing image. .. Yan, and L Li, “DML-GANR: Deep metric learning with generative adversarial network regularization for high spatial resolution remote sensing image retrieval, ” IEEE Transactions on Geoscience and. .. Hu, B Shi, X Bai, Y Zhong, L Zhang, and X Lu, “Aid: A benchmark data set for performance evaluation of aerial scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol 55,