In this research, we used Convolutional Neural Network [1][2] (CNN) to the task of Traffic Sign Recognition. This research is foundation for us to continue our research on self-driving. Convolutional Neural Network is a multistage architectures. It can be automatically learn features.
Kỹ thuật điều khiển & Điện tử A TRAFFIC SIGN RECOGNITION SYSTEM WITH CONVOLUTIONAL NEURAL NETWORK Luong Cong Duan1,*, Nguyen Hong Kiem2, Nguyen Ngoc Minh1 Abstract: In this research, we used Convolutional Neural Network [1][2] (CNN) to the task of Traffic Sign Recognition This research is foundation for us to continue our research on self-driving Convolutional Neural Network is a multistage architectures It can be automatically learn features We have used Tensorflow library and Python as main tool for test our research After conducting research and testing, the results of the architectures reached 91.1% accuracy Keywords: Traffic Sign Recognition, Convolution Neural Network, CNN, Self-Driving INTRODUCTION Our long-term goal in this research is self-driving vehicles and research on traffic sign identification is is one of the first researches Traffic sign identification can apply many areas of traffic as: Notification signal information changes on the road, reminder about wrongful when joining traffic and automated driving Traffic signals often have clear differences but their quantity of type is quite large In addition, the quality of image signals is greatly affected by the angle of view, the light, the obscurity, colors fading and speed of movement In this paper, our aims are building a test identifier that ignores conditions that are too difficult, it will be conducted further research In this paper, we have used a basic dataset called: German Traffic Sign [3] This is a dataset be used in GTSRB (German Traffic Sign Recognition Benchmark) competition It provides more than 50,000 sample pictures including 43 different classes: speed limits, dangerous curves, slippery road… This dataset was used in a competition a few years ago The best result for the competition correctly guessed 99.46% of the signs that was designed by the IDSIA team using the Committee of the CNNs method [3] Traditional methods for traffic sign recognition generally consists two task: detection and classification Detection is first handled with computationally inexpensive, handcrafted algorithms Classification is subsequently performed on detected candidates with more expensive, but more accurate, algorithms Hand-crafted features are also called shallow features, are not discriminative enough as databases become larger and larger and generic deep features should push the recognition performance even further Classification has been approached with a number of popular classification methods such as Neural Networks [4], Support Vector Machines [5]… In global sign shapes are first detected with various heuristics and color thresholding, then the detected windows are classified using a different Multi-Layer Neural Net for each type of outer shape These neural nets take 32x32 inputs and have at most 30, 15 and 10 hidden units for each of their layers While using a similar input size, the networks used in the present work have orders of magnitude more parameters Current popular algorithms mainly use convolutional neural networks to execute both feature extraction and classification[6] Experiments have shown that CNN has many advantages in recognition problems There are a variety of CNN variants having been proposed in GTSRB Pierre Sermanet and Yann LeCun [7] fed both the high-level and low-level features extracted by different convolutional layers to the fully-connected layers This method combined global invariant features with the local detailed ones and the accuracy record was 99.17% 118 L C Duan, N H Kiem, N N Minh, “A traffic sign recognition system… neural network.” Nghiên cứu khoa học công nghệ From those information we decided to choose CNN as the basic method for traffic sign recognition task CNN is a biologically-inspired, multilayer feed-forward architecture that can learn multiple stages of invariant features using a combination of supervised and unsupervised learning Each stage is composed of a (convolutional) filter bank layer, a non-linear transform layer, and a spatial feature pooling layer The spatial pooling layers lower the spatial resolution of the representation, thereby making the representation robust to small shifts and geometric distortions, similarly to “complex cells” in standard models of the visual cortex [8] CNN are generally composed of one to three stages, capped by a classifiercomposed of one or two additional layers Figure Typical CNN architecture (Wikipedia) After building architecture, we used a method to optimize the loss function One of the most popular methods is Gradient Descent [1][9] Gradient descent is a way to minimize an objective J ( ) function parameterized by a model’s parameters d by updating the parameters in the opposite direction of the gradient of the objective function J ( ) to the parameters The learning rate determines the size of the steps we take to reach a (local) minimum In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley Currently, there are many libraries and programming languages that support user programming and training machine learning With its machine learning background, Google has created an open source library called Tensorflow It has flexible architecture that allows user to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API [10] We have decided to use this libraries for our project Figure Gradient descent on a series of level sets NETWORK ARCHITECTURE The architecture used in the present work departs from traditional CNN[5] by the use of connections that skip layers, and by the use of pooling layers with different subsampling ratios for the connections that skip layers and for those that not Tạp chí Nghiên cứu KH&CN quân sự, Số 53, 02 - 2018 119 Kỹ thuật điều khiển & Điện tử We have run the test a number of times and by this time we have temporarily selected the architectures include stage as follows: Name Inputs data 1st stage 2st stage 3st stage Output - 4st stage Describe [batch, 32, 32, 3] YUV data Input = inputs data Conv1 + ReLU : kernel size = 5, layer width = 108 channel Y connect 100 kernel channel UV connect kernel Max pooling : kernel size = Output = “conv1” Input = “conv1” Conv2 + ReLU : kernel size = 3, layer width = 200 Max pooling : kernel size = Output = “conv2” Combine “conv1(flatten)” with “conv2(flatten)” Input = concat "conv1(flatten)" and “conv2(flatten)” Fully network + ReLU : layer width = 300 Output = “fc1” Input = “fc1” Out : layer width = 43 Figure Network architecture Figure Diagram of netwoek architecture EXPERIMENT A Data Preparation Currently, GTSRB dataset has about 50.000 sample pictures of 43 class However, the number of images for each class is uneven Below is the detailed information on the distribution of the dataset: Figure Number of inputs per class before balancing data 120 L C Duan, N H Kiem, N N Minh, “A traffic sign recognition system… neural network.” Nghiên cứu khoa học công nghệ It can be sent that are differences between the classes We should create some data to balance the number of inputs We have used an easy method to increment number of images That is rotating images by a few degrees This is the distribution after this operation: Figure Number of inputs per class before balancing data The data is more balanced, and each class has at least 500 images This new dataset will help to train our network better Additionally, all images are down-sampled or upsampled to 32x32 (dataset samples sizes vary from 15x15 to 250x250) and converted to YUV space The Y channel is then preprocessed with global and local contrast normalization while U and V channels are left unchanged B Network optimization After preparing the input data, we conducted the training using the Gradient Descent optimization with simple dataseet with purpose of optimizing our network We use 200 training epochs to test and calibration them During training, we have tried to change the order of “Batch Normalization” and “Max Pooling” to compare differences in training speed (BP means: “Conv Batch Normalization Max Pooling” and PB means: “Conv Max Pooling Batch Normalization”) Two ways to arrange the results are as follows: Figure Compare between BP and PB The chart clearly shows that the PB architectures is better than the BP architectures So in this paper we use PB to desgin our architecture After that, we tried the difference of the Tạp chí Nghiên cứu KH&CN quân sự, Số 53, 02 - 2018 121 Kỹ thuật điều khiển & Điện tử network when it has difference number of fully layer We have assumed that the network has one more fully layer will better But the reality is the opposite Figure Compare Fully Layer number With our data, the network with one Fully Layer is better than no and two It suggests that in each case, complex architecture is not meant good results We need to test and find the suitable architecture After optimization network, we have selected the network architectures as mentioned in section II C Trainning and Result After choosing the architecture and parameters, we conducted training with the dataset that was developed above The program was trained with 39.209 samples with label and tested with 12.630 without label The final result is as follows: >> >> >> >> Time to trainning: 4673.0710661411285s Validation accuracy: 0.9854 Test accuracy: 0.9260 Time to process a picture: 0.253s Figure Loss and Accuracy of training process The result shows that after training and testing, the match rate of the training data with our architecture is 98.54% and the match rate of testing data with our architecture is 92.6% The tranning process is conducted in nearly 40,000 steps but the graph shows that from about 10,000th steps, the loss rate and accuracy of the network changes very slowly, 122 L C Duan, N H Kiem, N N Minh, “A traffic sign recognition system… neural network.” Nghiên cứu khoa học công nghệ this is the phase of completion of the coefficients Sometimes, the loss rate increases and the accuracy decreases very fast then returns to the old value range This is an anomaly, so during training, the programmer should check the change of these parameters to ensure stability before the training stops for the best training results In this paper, we conducted experiments with no GPU machine The results show that processing time of each image is about 0.253s (3.95 fps) That is a good parameter for our next research GPU supports parallel computing so the current processing speed can be upgraded to realtime processing SUMMARY In this paper, a simple architecture for traffic sign recognition is proposed We have conducted trials to change the order of processes and find out the best choice With the same number of elements, the arrangement of elements is very important for CNN In addition, complexity is not always good, with each type of data we need to change accordingly to have the most appropriate network architecture Although the design architecture is simple, it gives a good result This architecture has the following advantages: simple, easy to deploy in both high and low language; uses less system resources, high processing speed The accuracy of our architecture is 92.6% This result is not really high but the architecture is much simpler than other architectures We can use it with low-profile computers such as embedded computers or FPGAs However, before doing it, we will be using some filter and image processing tools as a pre-processing for better input quality In the next phase of research, we will rebuild our architectures with C/C++ language more optimized for speed and continue to further optimize the architectures and continue to solve the next problem as: sensor problems, case handling, automatically control… to build a model of self-driving vehicles Finally, after solving the component problems, we will try to employ it into some embeded computers and FPGA to run testing device and evaluate performance REFERENCES [1] Ian Goodfellow and Yoshua Bengio and Aaron Courville, “Deep Learning”, MIT Press, 2016 [2] Jianxin Wu, LAMDA Group, National Key Lab for Novel Software Technology, “Introduction to Convolutional Neural Networks”, on May 2017 http://benchmark.ini.rub.de/?section=gtsrb&subsection=news [3] J Torresen, J W Bakke and L Sekanina, "Efficient recognition of speed limit signs," Proceedings The 7th International IEEE Conference on Intelligent Transportation Systems (IEEE Cat No.04TH8749), 2004, pp 652-656 [4] De la Escalera, A, Moreno, L, Salichs, M, and Armingol, J “Road traffic sign detection and classification” Industrial Electronics, IEEE Transactions, on 848 –859, 1997 [5] R Girshick, J Donahue, T Darrell and J Malik, "Region-Based Convolutional Networks for Accurate Object Detection and Segmentation," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 38, no 1, pp 142-158, Jan 2016 [6] Sermanet, Pierre, and Yann LeCun, “Traffic sign recognition with multi-scale convolutional networks” Neural Networks (IJCNN), The 2011 International Joint Conference on IEEE, 2011 [7] LeCun, Y, Bottou, L, Bengio, Y, and Haffner, P “Gradient-based learning applied to document recognition” Proceedings of the IEEE, 86(11):2278–2324, November 1998 Tạp chí Nghiên cứu KH&CN quân sự, Số 53, 02 - 2018 123 Kỹ thuật điều khiển & Điện tử [8] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, Greg Hullender, “Learning to Rank using Gradient Descent”, Proceeding ICML '05 Proceedings of the 22nd international conference on Machine learning Pages 89 – 96, August 2005 [9] https://www.tensorflow.org/ TÓM TẮT NHẬN DIỆN BIỂN BÁO GIAO THƠNG VỚI MẠNG NORON TÍCH CHẬP Trong nghiên cứu này, chúng tơi sử dụng mạng tích chập (CNN) thực nhiệm vụ xây dựng chương trình nhận diện biển báo giao thông Đây tảng để thực nghiên cứu xe tự lái Mạng tích chập mạng noron có kiến trúc nhiều lớp áp dụng thêm thuật toán nhân chập lớp Mạng có khả tự động học đặng tính đối tượng Sau xây dựng kiến trúc mạng sử dụng thư viện Tensorflow ngơn ngữ lập trình Python cơng cụ để thử nghiệm Và kết thử nghiệm cho thấy kiến trúc mạng đơn giản gồm lớp đạt độ xác 92,6% Từ khóa: CNN, Nhận diện biển báo giao thơng, Mạng tích chập, Xe tự lái Received date, 11th November, 2017 Revised manuscript, 10th December, 2017 Published, 26th February, 2018 Author affiliations: Post and Telecommunication Institute of Technology, Km10, Nguyen Trai, Ha Đong, Ha Noi; Telecommunication University, No.11 Mai Xuan Thuong, Nha Trang, Khanh Hoa * Corresponding author: duanlc@ptit.edu.vn 124 L C Duan, N H Kiem, N N Minh, “A traffic sign recognition system… neural network.” ... balancing data The data is more balanced, and each class has at least 500 images This new dataset will help to train our network better Additionally, all images are down-sampled or upsampled to... reach a valley Currently, there are many libraries and programming languages that support user programming and training machine learning With its machine learning background, Google has created... that can learn multiple stages of invariant features using a combination of supervised and unsupervised learning Each stage is composed of a (convolutional) filter bank layer, a non-linear transform