An Application Improving the Accuracy of Image Classification An application improving the accuracy of image classification Pham Tuan Dat Faculty of Information Technology Vietnam Maritime University[.]
2021 8th NAFOSTED Conference on Information and Computer Science (NICS) An application improving the accuracy of image classification Pham Tuan Dat Faculty of Information Technology Vietnam Maritime University Hai Phong, Vietnam datpt@vimaru.edu.vn Nguyen Kim Anh Faculty of Information Technology Vietnam Maritime University Hai Phong, Vietnam anhnk@vimaru.edu.vn Therefore, the objective of this paper is to propose an application combining a ResNet model and image manipulation to improve the accuracy of classification on CIFAR-10 The estimated accuracy of the classifier is around 93% on the test set, and this result is better than that of the CNN and Attentive CutMix ResNet-34 Abstract—There have been various research approaches to the problem of image classification so far For image data containing kinds of objects in the wild, many machine learning algorithms give unreliable results Meanwhile, deep learning networks are appropriate for big data, and they can deal with the problem effectively Therefore, this paper aims to build an application combining a ResNet model and image manipulation to improve the accuracy of classification The classifier performs the training phases on CIFAR-10 in a feasible time In addition, it achieves around 93% accuracy of the test data This result is better than that of some recently published studies Keywords—classification, residual, augmentation, cutmix, normalization II THEORETICAL BACKGROUND A Image Augmentation In some cases, deep learning networks may give too high accuracy on the training data but achieve unreliable results on the test data Image augmentation is a solution to this situation It generates new data from original data, but new patterns still keep the original nature of patterns On the basis of data diversity, deep learning networks decrease validation errors and increase test accuracy convolutional, I INTRODUCTION Social networks have stored and managed a massive information volume on the Internet To meet the needs of users, social networks have to build useful applications From a given keyword, the search services need to find relevant information on the same subject exactly and fast Obviously, relevant information does not just contain text but also includes images A challenge of applications is that they must develop an effective mechanism that can classify patterns into the same subject if they represent a kind of object There are two practical approaches to image augmentation: image manipulation and deep learning However, the experiments in this paper and published studies [7,8,9] apply image manipulation to the problem of image classification Thus, the paper only presents an overview of image manipulation Image manipulation needs a small amount of memory to transform and store data It takes a lower computational cost if compared with the deep learning approach Generally, image manipulation [6] includes geometric transformations, color jitter, mixing images, and several other techniques In fact, the problem of image classification is not a new issue Machine learning algorithms are actually applied to cope with this problem For instance, K-Nearest Neighbor and Support Vector Machines solve the problem of handwritten digit classification on MNIST very well [12] But many conventional algorithms only achieve poor performances on data sets such as CIFAR-10 and CIFAR-100, which contain kinds of objects in the wild [12,13] Typical geometric transformations are shifting, flipping, cropping, and rotation When images are taken in the wild, they not just contain informative regions of objects, so a classifier sometimes predicts labels of patterns incorrectly Cropping can reduce the confusion possibility of classification for such images The use of geometric transformations does not guarantee effectiveness for every data set For a data set including patterns of letters and digits, rotation or flipping changes shapes of patterns, so labels of patterns are incorrectly classified Nevertheless, for images of objects in the wild, rotation or flipping does not lose labels of patterns In Fig.1, observers can see a kind of object in some images after a series of transformations In recent years, deep learning networks have overcome the weaknesses of machine learning algorithms Deep learning networks can train big data and get optimal training results A problem of deep learning networks is that when increasing the number of layers, they generate more training errors This will make the accuracy get saturated ResNet is a typically deep learning network, the key point of ResNet is the residual block that may cope with the degradation problem [1,2] Residual blocks reduce the above drawback and allow ResNet to achieve impressive accuracy in the case of adding layers Color jitter is another technique of image manipulation For the problem of letter classification, images of letters are relatively simple, and they are usually converted into binary images, so color jitter is not really necessary By contrast, images of objects in the wild are much more sophisticated and the poor quality of images will reduce the effectiveness of classification In this case, color jitter may bring noticeable effects for data augmentation Color jitter consists of brightness change, hue, and saturation adjustment Brightness change makes dark images get brighter Over-saturated On the other hand, the performance effectiveness does not just depend on network architecture but also comes from data The lack of data diversity makes deep learning networks work inefficiently By modifying patterns of training data, augmented images will represent a more comprehensive data set [6] Consequently, image augmentation minimizes the difference between patterns in training data and validation data, as well as test data 978-1-6654-1001-4/21/$31.00 ©2021 IEEE 12 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) over each mini-batch, as described in (1) and (2) Afterward, BN normalizes each activation, then each normalized activation will become the input for each transformation in m xi B m i 1 m ( x ) B i B m i 1 xˆi xi B B2 yi xˆi (4) For deep learning networks such as CNN [3], BN operates as a layer, which usually goes with ReLu functions and convolutional layers In learning networks, one convolutional layer can receive BN(x) as the input data instead of x Fig A series of translations and rotations for a pattern images look so artificial, whereas many actual images often give impure colors Hence, the brightness, saturation, and hue of such images need to be adjusted C The Overview of ResNet As mentioned above, the problem of vanishing gradient in learning networks can be addressed by a solution such as BN However, there are still difficulties in optimizing deep learning networks The degradation problem is exposed when the deep of networks increases: networks generate more training errors and the accuracy gets saturated In this situation, over-fitting [15] is not a reason Mixing images has been seen as a potential technique for data augmentation It combines patterns into new training instances CutMix [7] is a typical example of this technique For each pair of images, it replaces a removed region on the first image with a patch from the second image The ground truth labels are mixed proportionally to the area of the patches New training instances of CutMix not lose nature if compared with a few regional dropout strategies [10,11] But CutMix is unable to capture the most informative regions on images Attentive CutMix [8] adjusts the strategy of CutMix, it takes out a 7×7 grid map from the first image and picks top N (the optimal value is in the range of to 15) attentive patches These patches are pasted onto the second image at their respective original locations (images have the same size) ResNet is a deep learning network overcoming the degradation problem It shares the idea of LSTM and components of CNN Nevertheless, it does not have gates controlling data flow in units ResNet builds residual blocks in which the activation of any deeper block is the sum of the activation of a shallower block and a residual function Kaiming He and partners investigate the benefits of identity shortcuts [1,2], which make ResNet get higher accuracy ResNet includes residual blocks, and each one has an overview structure as illustrated in Fig.2 In one residual block, a ReLu and weight layers are placed alternatively To accelerate the convergence of ResNet, batch normalization may be inserted into each block Moreover, ResNet also includes pooling layers B Batch Normalization Training neural networks might become ineffective if they encounter high learning rates or too small weights [14, 16] when carrying out back-propagation This loses the learning ability and does not enhance the performance of networks An ordinary solution to the problem of vanishing gradient is using ReLu and choosing small learning rates But this way is not good enough for vanishing gradients Batch Normalization [5] (BN) is a better alternative to ReLu, it normalizes input data and speeds up the convergence of learning networks In fact, BN stabilizes the growth of parameters during training phases, so networks are able to work with a broader range of learning rates without the risk of divergence Let xl and f(F(xl) + h(xl)) denote the input for the lth residual block and the output of this block, respectively F is defined as a residual function, which includes two or three convolutional layers If F only contains a layer, it will bring fewer advantages The identity mapping is h(xl) = xl and f is one ReLu function From these hypotheses, the authors indicate that the output of the lth (l from to L-1) unit is the summation of the outputs of all previous residual functions In an extremely deep learning network, when the identity mapping in the lth layer is replaced with h(xl) = λlxl, the authors obtain an equation as follows: There are opposing viewpoints about the link between BN and ICS [17], or the link between BN and the exploding gradient problem [14] One viewpoint indicates that the use of BN improves the accuracy of networks, but it does not decrease ICS in several test cases Another opinion shows that adding BN layers may exacerbate the problem of exploding gradient [14] Nonetheless, the experiments in [17] not deny a clear improvement in terms of gradient change and loss variation for VGG networks Furthermore, BN allows a VGG network (different learning rates) to achieve acceptable results on the test data L 1 L 1 L 1 i l i l j i 1 xL ( i ) xl ( j )F ( xi , Wi ) The factor L 1 i i l In practice, BN does not carry out normalizing the entire training set at a time Instead, it splits the training set into mini-batches Next, BN calculates the mean and the variance in (5) can be exponentially large if all λi > 1, and the factor can be exponentially small if all λi < (i from l to L-1) This result will cause exploding or vanishing during carrying out back-propagation On the contrary, if all 13 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Fig A residual block λi = 1, the gradient will not vanish in each layer when the weights are arbitrarily small [2] Different techniques not perform better than identity shortcuts For example, the use of exclusive gating generates more test errors than identity shortcuts in the ResNet-110 The authors also investigate 1×1 convolutional shortcuts giving the poor performance of the ResNet-110 Identity shortcuts not take extra parameters and not increase too much computational complexity ResNets are able to be trained by optimization algorithms, and they are easy to be implemented with basic libraries without needing much modification III EXPERIMENT AND COMPARISON A The Application and Network Model To build the ResNet and the experimental application, this paper uses Python language and the necessary libraries such as PyTorch, Keras, etc As shown in Fig.3a, the application consists of the data augmentation, training, and classification functions Before performing the training phase, the patterns in the training set are augmented to minimize the difference between patterns of the training and validation data After finishing the training phase, the classifier can predict the output for the test set Fig (a) The functions of application; (b) the ResNet model integrate with an optimizer, which allows the training process to decrease the number of training errors and validation errors This leads to an increase in terms of accuracy on the test set In this model, the application chooses Adam [4] The function of image manipulation combines transformations including horizontal flipping, random cropping, random rotation, and color jitter Geometric transformations change the directions and shapes of patterns while color jitter is used to adjust the brightness, saturation, and hue of patterns Patterns are randomly rotated with small angles in the range of -5o to 5o The application uses the vision library of PyTorch to implement image manipulations on the training data, and this function takes a short time to finish the task B The Experiments and Comparison The application experiments with the ResNet classifier on CIFAR-10 It contains 60000 samples, which are divided into three sets (the training, validation, and test data) in the ratio of 4:1:1 The validation set is used for tuning hyperparameters in training phases and making the performance of the test set better The effectiveness of the classifier is evaluated by the loss and the accuracy on both the training set and the test set The position of BNs in the blocks makes slightly different outcomes on the validation set: If the BNs are first executed in the blocks, the accuracy of the ResNet stably increase during the training phase; if the BNs are executed after the convolutional layers, the ResNet generates the fluctuating accuracy in the middle epochs, as illustrated in Fig.5 Nonetheless, the first choice does not give better overall results of the validation data, and the accuracy of the test data also decreases a slight amount Fig.3b represents the ResNet containing six convolutional blocks and three residual blocks, it seems like an abridged version of ResNet-18 Although the ResNet has a smaller number of residual blocks, two models have an insignificant difference in the number of filters on each convolutional layer Besides, when training CIFAR-10, the ResNet takes less time than ResNet-18 In the ResNet, each convolutional block includes one convolutional layer while each residual block includes two convolutional layers and an identity shortcut Every block has at least a BN layer and a ReLu activation, but only several blocks have pooling layers In each block, convolutional layers and BN layers are placed alternatively The 3×3 convolutional layers have from 64 to 512 filters The last layer of the model acts as one fully connected layer, which converts data in the previous layers into one-dimensional data From that, the classifier will estimate the output labels As shown in Fig.4, the training error reduces quickly, so after finishing the phase, the training loss is approximately 6% In other words, the ResNet gives very high classification accuracy on the training set (over 98%) This result does not reflect the benefits of image manipulation because the nonaugmentation ResNet also gives an extremely low loss (below 0.5%) In Table I, the results show a little increase in the accuracy of the ResNet on the validation data Furthermore, the loss of the ResNet is much lower than that of the non-augmentation ResNet on the validation set (0.26 Like other learning networks, the ResNet needs to 14 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) TABLE I IMAGE MANIPULATION IMPROVES THE ACCURACY OF CLASSIFICATION Classifier Loss on Training data Accuracy on Validation data Non-augmentation ResNet 0.005 0.895 ResNet 0.060 0.918 TABLE II COMPARING THE RESNET WITH THE CNN Fig The loss of the ResNet during a training phase Classifier Loss on Training data Accuracy on Test data ResNet 0.06 0.930 CNN 0.14 0.870 TABLE III APPLYING MIXING IMAGES TO RESNET-34 Method Accuracy on Test data Attentive CutMix 0.9040 CutMix 0.8875 Mixup 0.8870 sets of CIFAR-10) Both classifiers take the augmented images as the training data The CNN has convolutional layers, max-pooling layers, fully connected layer, and some BN layers The 3×3 convolutional layers of this network also have from 64 to 512 filters Fig The accuracy of the ResNet on the validation set In the experiment, the ResNet outperforms the CNN on both the training and test data, as shown in Tables II Although the CNN classifier has a quick convergence in the first half of the training phase, its accuracy on the validation data gets saturated in the last epochs Finally, it achieves 87% accuracy on the test data Meanwhile, after 30 epochs, the ResNet gets around 93% accuracy Applying mixing images to ResNet improves the accuracy of classification on CIFAR-10 From the reports in a recent study [8], the method of Mixup has the most ineffective performance, but it also gains 1.58% accuracy improvement over the baseline method Attentive CutMix is able to capture the most informative regions on images, so its accuracy improvement exceeds that of CutMix (3.28% with 1.63%) Consequently, CutMix ResNet-34 only gains 88.75% accuracy while Attentive CutMix ResNet-34 gains 90.40% accuracy (Table III) However, mixing images does not bring more advantages than geometric transformation and color jitter Fig The confusion matrix of the ResNet on the test data IV CONCLUSION This paper aims at building an application combining a ResNet model and image manipulation to improve the accuracy of classification on CIFAR-10 The experiments in the paper and the reports from recently published studies show that the use of geometric transformation and color jitter is a suitable alternative to mixing images The ResNet achieves the high accuracy of image classification, with around 93% on the test data The classifier obtains an accuracy increase of 3.4% over the non-augmentation ResNet Additionally, this growth outweighs that of Attentive CutMix ResNet-34 with 0.45), and the accuracy of the ResNet increases by 3.4% on the test data (from 89.6% to 93.0%) According to the confusion matrix in Fig.6, incorrectly classified rates of ten classes are really low Generally, the maximal correct classification rate belongs to the class of automobile (over 0.96) In contrast, the minimal correct classification rate usually belongs to the class of cat (over 0.84) because the ResNet confuses many cat objects with dog objects To make the comparison fair, the application compares the ResNet with the CNN (unchanged proportions for three 15 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) [9] REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun “Deep residual Learning for Image Recognition”, Conference on Computer Vision and Pattern Recognition, June 2016 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Identity Mappings in deep residual Networks”, European Conference on Computer Vision, September 2016 Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu, Xingxing Wang, Li Wang, Gang Wang, Jianfei Cai, and Tsuhan Chen, “Recent Advances in convolutional neural Networks”, Elsevier, October 2017 Diederik P.Kingma and Jimmy LeiBa, “Adam: a Method for Stochastic Optimazation”, ICLR, 2015 Sergey Ioffe and Christian Szegedy, “Batch Normalization: accelerating Deep Network Training by Reducing internal covariate Shift”, vol.37, Proceedings of the 32nd International Conference on Machine Learning, July 2015 Connor Shorten and Taghi M Khoshgoftaar, “A survey on Image Data Augmentation for Deep Learning”, Journal of Big Data, 2019 Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo, “CutMix: Regularization Strategy to train strong Classifiers with localizable Features”, International Conference on Computer Vision, August 2019 Devesh Walawalkar, Zhiqiang Shen, Zechun Liu, and Marios Savvides, “Attentive Cutmix: An enhanced Data Augmentation Approach for deep Learning based Image Classification”, International Conference on Acoustics, Speech and Signal Processing, May 2020 [10] [11] [12] [13] [14] [15] [16] [17] 16 Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David LopezPaz, “Mixup: Beyond Empirical Risk Minimization”, ICLR, April 2018 Terrance DeVries and Graham W.Taylor, “Improved Regularization of convolutional neural Networks with Cutout”, arxiv.org/abs/1708.04552, November 2017 Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang, “Random Erasing Data Augmentation”, arxiv.org/abs/1708.04896, November 2017 Sonika Dahiya , Rohit Tyagi , and Nishchal Gaba, “Comparison of ML Classifiers for Image Data”, https://easychair.org/publications/preprint_open/KnC4, July 2020 Karttikeya Mangalam and Vinay Prabhu, “Do deep neural Networks learn shallow learnable Examples First?”, Proceedings of the Workshop on Identifying and Understanding Deep Learning Phenomena at 36th International Conference on Machine Learning, 2019 George Philipp, Dawn Song, and Jaime G Carbonell, “Gradients explode - Deep Networks are Shallow - ResNet explained”, International Conference on Learning Representations, 2018 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: A Simple Way to prevent neural Networks from Overfitting”, Journal of Machine Learning Research, 2014 Yoshua Bengio, Patrice Simard, and Paolo Frasconi, “Learning longterm Dependencies with Gradient Descent is difficult”, IEEE Transactions on Neural Networks, February 1994 Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander M adry, “How does Batch Normalization help Optimization?”, 32nd Conference on Neural Information Processing Systems, 2018 ... Table I, the results show a little increase in the accuracy of the ResNet on the validation data Furthermore, the loss of the ResNet is much lower than that of the non-augmentation ResNet on the validation... effectiveness of the classifier is evaluated by the loss and the accuracy on both the training set and the test set The position of BNs in the blocks makes slightly different outcomes on the validation... saturation, and hue of patterns Patterns are randomly rotated with small angles in the range of -5o to 5o The application uses the vision library of PyTorch to implement image manipulations on the training