Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 91 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
91
Dung lượng
3,19 MB
Nội dung
VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING GRADUATION THESIS DESIGN AN EFFICIENT FPGA-BASED ACCELERATOR FOR REAL-TIME PARKING OCCUPANCY DETECTION Major: COMPUTER ENGINEERING THESIS COMMITTEE : COMPUTER ENGINEERING SUPERVISOR(s) : ASSOC PROF DR TRAN NGOC THINH MR HUYNH PHUC NGHI MEMBER SECRETARY: ASSOC PROF DR PHAM QUOC CUONG -o0o STUDENT : NGUYEN VU THANH NGUYEN - 1652437 HO CHI MINH CITY, 01/2023 ĐẠI HỌC QUỐC GIA TP.HCM -Độc lập - Tự - Hạnh phúc TRƯỜNG ĐẠI HỌC BÁCH KHOA KHOA: KH & KT Máy tính BỘ MƠN: KT Máy tính CỘNG HỊA XÃ HỘI CHỦ NGHĨA VIỆT NAM NHIỆM VỤ LUẬN ÁN TỐT NGHIỆP Chú ý: Sinh viên phải dán tờ vào trang thuyết trình HỌ VÀ TÊN: Nguyễn Vũ Thành Nguyễn MSSV: 1652437 NGÀNH: Kỹ thuật Máy tính LỚP: Đầu đề luận án: Design an efficient FPGA-based accelerator for real-time parking occupancy detection Nhiệm vụ (yêu cầu nội dung số liệu ban đầu): • Research on a BNN approach in image classification task with CNRPark dataset • Implementation of an encoder for input images and a set of weight parameters for parking solutions • Build and evaluate a hardware accelerator run on Ultra96v2 SoC with a pre-encoded 32x32 image set Ngày giao nhiệm vụ luận án: 19/09/2022 Ngày hoàn thành nhiệm vụ: 09/01/2023 Họ tên giảng viên hướng dẫn: 1) PGS TS Trần Ngọc Thịnh 2) KS Huỳnh Phúc Nghị Phần hướng dẫn: Nội dung yêu cầu LVTN thông qua Bộ môn Ngày tháng năm 2023 CHỦ NHIỆM BỘ MÔN (Ký ghi rõ họ tên) PHẦN DÀNH CHO KHOA, BỘ MÔN: Người duyệt (chấm sơ bộ): Đơn vị: Ngày bảo vệ: Điểm tổng kết: Nơi lưu trữ luận án: GIẢNG VIÊN HƯỚNG DẪN CHÍNH (Ký ghi rõ họ tên) Commitment I hereby declare that I worked on and is the sole author of this bachelor thesis and that I have not used any sources other than those listed in the bibliography and identified as references Other than that, the work presented is entirely my own Nguyen Vu Thanh Nguyen Acknowledgements This thesis has completely reached its result thanks to the continuous efforts of myself, the support and encouragement of our lecturers, friends and families I would like to express our sincere attitude to those who have helped us throughout the study, research and during working on the thesis I would first like to express my sincere gratitude to my supervisors, Assoc Professor Tran Ngoc Thinh and BEng Huynh Phuc Nghi have consistently helped me out by providing me with not only the necessary tools to complete the thesis but also with a wealth of information and direction so that I could move in the best route They never stopped inspiring me and provided me the chance to participate in a really fascinating work that was built using a lot of the information I had acquired during our time at university They have always been a kind, understanding teacher who encourages me to make changes to the work and this thesis as needed Working with them and gaining expertise under their guidance was an honor for me In addition to thanking the supervisors, I also like to acknowledge the councilors at the thesis defense for their wise criticism and suggestions that helped me improve my work Once more, I also want to express our gratitude to all of the instructors in the Faculty of Computer Science and Engineering, as well as to all of the instructors at Ho Chi Minh City University of Technology, for their commitment to teaching and helping me learn the fundamentals of engineering I want to express my gratitude to everyone of my friends for being my mentors and helpers during our time at the institution Finally, I would like to thank my parents for always providing a positive environment for our development and for supporting me when I face difficulties in both our academic and personal live Nguyen Vu Thanh Nguyen Abstracts This thesis proposes, studies and examines an approach on developing an edge-ai smart parking solution, including hardware and software components by implementation of the FracBNN+CNRPark model with hardware acceleration on the Ultra96-V2 board This approach allows end-users to be able to monitor and detect busy and free parking spaces automatically via security cameras The image classification model runs entirely on the edge, on the Ultra96-V2 board, without the help of a server workstation Contents Commitment Acknowledgements Abstracts List of Figures 10 List of Table 11 Terms 12 Introduction 1.1 Purpose and Motivation 1.2 Scope and Objectives 1.2.1 Problem Statements 1.2.2 Objectives Structure of Thesis 1.3 Background knowledge and Terminology 2.1 Software - Artificial Intelligence 2.1.1 The development of AI 2.1.2 Neural Network (NN) 2.1.3 Convolutional Neural Network (CNN) 2.1.4 Binary Neural Network (BNN) 15 11 Contents 2.2 2.3 Smart Parking concepts 16 2.2.1 Smart Parking 16 2.2.2 Edge AI 17 Hardware and constraints 18 2.3.1 FPGA and SoC 18 2.3.2 GPU 21 2.4 Tools and Frameworks 22 2.5 2.4.1 Pytorch 22 2.4.2 Vivado and Vivado HLS 23 2.4.3 PYNQ and BNN-PYNQ 24 Performance Criteria 25 2.5.1 Recall, Precision & F1-score 25 2.5.2 Average IoU & mean Average Accuracy (mAP) 26 2.5.3 Power consumption 27 2.5.3.1 Literature Review 3.1 29 Smart Parking Related works 29 3.1.1 3.1.2 3.1.3 3.2 FPS & Latency 27 International Solutions 29 3.1.1.1 Moscow Parking (Source: https://parking.mos.ru/) 29 3.1.1.2 SENSIT 30 3.1.1.3 SFPark 30 3.1.1.4 Cisco Smart+Connected City Parking 30 Solutions in Vietnam 31 3.1.2.1 My Parking 31 3.1.2.2 IParking 31 Smart Parking systems with Image Processing 31 Previous Group’s Thesis Result 33 3.3 Vien’s Thesis 34 Contents 3.3.0.1 License Plate Dataset 35 3.3.0.2 YOLOv3 Object Detection Model 35 3.3.0.3 Implementing Vien’s approach 38 3.3.0.4 Comparing YOLOv3 on Ultra96-V2 and JetsonNano 40 3.4 The Original BNN Model 41 3.5 Improved BNN Models 43 3.6 FracBNN (Dec 2020) 45 Methodology 49 4.1 The proposed solution - Previous thesis group 49 4.2 4.3 FracBNN+CNRPark 49 4.2.1 Model Architecture 50 4.2.2 Hardware Accelerator Architecture 52 Dataset 54 4.3.1 CNRPark+EXT Implementation 5.1 5.3 56 FracBNN+CNRPark model on Pytorch 56 5.1.1 5.2 54 Training the model 56 5.1.1.1 Training dataset 56 5.1.1.2 Training routine 58 Hardware Acceleration on Ultra96-V2 60 5.2.1 Weights and Bias processing 61 5.2.2 Thermometer Encoding 63 5.2.3 Building model on Vivado HLS and Vivado 64 5.2.4 Inference on the Ultra96-V2 66 Evaluation 66 5.3.1 Training result 66 5.3.2 Hardware acceleration result 67 5.2 Hardware Acceleration on Ultra96-V2 Figure 5.2: Generating an FPGA accelerator from trained FracBNN – weights_fracnet_64.h Contains weights and biases from each layer in layer.h type FP32 – conv_weights.h Contains convolutional weights, fixed-point encoded with data type of hex 64bit, smaller weights (32 or 16bits) are padded with 0s to the LSB • Makefile and TCL scripts The first step is to generate IP from Vivado HLS based on the above C files Changes are to be made in bnn.h and bnn_tiled.cc in terms of the fully connected layer as the FracBNN+CNRPark model only has output classes of ’free’ and ’busy’ 5.2.1 Weights and Bias processing In order to generate IP from Vivado HLS, weights and biases in weights_fracnet_64.h and conv_weights.h needs to be processed as this contains the training result of the model It needs to be integrated into the model on 61 5.2 Hardware Acceleration on Ultra96-V2 the hardware for inference tasks of detecting the ’busy’ and ’free’ classes Weights and biases of the weights_fracnet_64.h are composed of lists of x 16 weights, datatype FP32 These weights are then fed into the layers in layer.h To this, weights and biases needed to be extracted from the ’.pt’ file The sizes of the layers in the pretrained model also matches the ones in weights_fracnet_64.h of x 16 with datatype FP32 Therefore, I only need to write a script to extract the weights from the pretrained model The extraction is done by loading the saved state dictionary through torch.load() As the state dictionary only contains the weights and biases and other information such as ’epochs’, ’optimizer’ and ’scheduler’, I have to input the said weights and biases through the model to obtain the model information, which layer contains what weights or biases Then running on eval() mode is necessary so that the model doesn’t resume training on PyTorch After that, weights and biases are extracted through tensors of name, data in model.named_parameters() method of PyTorch Name is the name of each layer and data is the weights or biases values These data is then appended to a ’.h’ file for converting to C syntax All these tasks are done by hand (Python script) The weights in conv_weights.h are a bit different as the datatype in the state dictionary are of FP32, different to the fixed-point encoded with data type of hex 64bit, smaller weights (32 or 16bits) are padded with 0s to the LSB type The weights in conv_weights.h are also sized different: 45 x 16 x x 3, this means that each 16 x x sized convolutional weights are looped 45 times in the code After investigating the state dictionary, the convolutional weights and loop assigned are as followed: The weights data type is determined by the second size attribute (16 bits, 32 bits and 64 bits) except for the first convolution layer conv1.weight It has the size of 16 x 96 x x which has to be transposed to 16 x x x 96 The weight are then divided into of sizes 16 x x x 32, meaning each elements has 32 bits The subsequent layers are also divided to fit into the loops, 32 is divided into and 64 is divided into The elements from loop to is 32 bits, loop to 11 is 16 bits, loop 12 to 25 is 32 bits, and the rest is 64 bits ( x = Sign(x) = b if x > 0 otherwise (5.1) After determining the size, the elements from state dictionary (data type 62 5.2 Hardware Acceleration on Ultra96-V2 name conv1.weight layer1.0.conv1.weight layer1.0.conv2.weight layer1.1.conv1.weight layer1.1.conv2.weight layer1.2.conv1.weight layer1.2.conv2.weight layer2.0.conv1.weight layer2.0.conv2.weight layer2.1.conv1.weight layer2.1.conv2.weight layer2.2.conv1.weight layer2.2.conv2.weight layer3.0.conv1.weight layer3.0.conv2.weight layer3.1.conv1.weight layer3.1.conv2.weight layer3.2.conv1.weight layer3.2.conv2.weight data torch.Size([16, torch.Size([16, torch.Size([16, torch.Size([16, torch.Size([16, torch.Size([16, torch.Size([16, torch.Size([32, torch.Size([32, torch.Size([32, torch.Size([32, torch.Size([32, torch.Size([32, torch.Size([64, torch.Size([64, torch.Size([64, torch.Size([64, torch.Size([64, torch.Size([64, 96, 16, 16, 16, 16, 16, 16, 16, 32, 32, 32, 32, 32, 32, 64, 64, 64, 64, 64, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]) 3]) 3]) 3]) 3]) 3]) 3]) 3]) 3]) 3]) 3]) 3]) 3]) 3]) 3]) 3]) 3]) 3]) 3]) loops to 10 and 11 12 and 13 14 and 15 16 and 17 18 and 19 20 and 21 22 to 25 26 to 29 30 to 33 34 to 37 38 to 41 42 to 45 FP32) are quantized into and 1, threshold of Then the weights are converted from binary of into hexadecimal representation of 64 bits For smaller elements in term of bits, the subsequent digits are padded to 0s The result are then appended to the conv_weights.h to input to the IP generation on Vivado HLS 5.2.2 Thermometer Encoding The input features from the evaluation images also has to be thermometer encoded This encoding is done via my custom written function in Python In thermometer encoding, the features are encoded as a binary strings of 1s and 0s The difference between thermometer encoding and fixed-point binary is that the 1s and 0s are divided into part of only 1s and only 0s, the divider is called the threshold The features are then saved to an ’.npy’ with size of (n x x 32 x 32) and l With n is the number of test images, is the color RGB, 32 x 32 is the height x width, ’l’ is the label of the images First step is loading images Using Pytorch’s ImageFolder and DataLoader, the images are resized to sizes of 32x32 and then toTensor() Each tensor of the image are then converted to 0-255 pixel value for encoding A loop is ran 63 5.2 Hardware Acceleration on Ultra96-V2 Figure 5.3: Thermometer encoder workflow through each image to encode the images via Thermometer encoding described in the FracBNN paper A resolution is determined based on the how "accurate" the encoded images would be For a resolution=32, each tensor are encoded into bit strings but with a resolution=8, each tensor is encoded into 32bit thermometer The chosen resolution was 32bit to keep consistency with the paper results Seccondly, using a placeholder array of from to round(255/resolution), in this case 32bit, the array of (3 x 32 x 32 x 32) was created Using einsum() function from numpy, the 1s are kept very efficiently and 0s are ommitted since it’s not needed Afterwards, 0s are padded to the encoded 1s until the length of 64bit, to keep consistent with the paper MSBs are 1s and LSBs are 0s As a result, arrays of size (n x x 32 x 32) of elements strings 64bit are formed This method encodes 1000 images and write the result to a "input_images.npy" Labels are also saved with binary in a "labels.bin" Now the input features are ready to act as input for inference on the board 5.2.3 Building model on Vivado HLS and Vivado For version suitability, I have decided to use Vivado 2019.2 with Vivado HLS rather than the latest version of Vivado 2022.2 with Vitis HLS The C files are imported via Vivado HLS for C synthesis and RTL IP export When creating the project, the top function is chosed as ’FracNet_T’ with the clock of 4ns and clock uncertainty of 12.5% After C synthesis is run and RTL exported, the resulting IP exported has the following estimated hardware resources utilization 64 5.2 Hardware Acceleration on Ultra96-V2 Figure 5.4: Vivado HLS Utilization Estimate Table 5.1: Resources utilization on Ultra96v2 SiteType (Available) LUTs (70560) LUTRAM (28800) CLB Registers (141120) DSPs (360) BRAM (216) FracBNN CNRPark (1/1.5) 49324 (69.9%) 1553 (5.4%) 38849 (27.5%) 139 (38.6%) 216 (100%) FracBNN CIFAR10 (1/1.4) 51475 (72.9%) 1564 (5.4%) 39618 (28.1%) 39173 (27.7%) 216 (100%) FINN SVM (1/1) 28229 (40.0%) 1184 (4.1%) 40498 (28.7%) 24 (6.7%) 124 (57.41%) FINN SVM (1/2) 44067 (62.45%) 2080 (7.22%) 50741 (35.96%) 26 (7.22%) 131 (60.88%) FINN SVM (2/2) 34609 (49.1%) 3496 (12.1%) 47794 (33.9%) 32 (8.9%) 154 (71.5) The IP appeared to have the problem of resource over utilization as it has exceeded the LUTs and BRAM available on the Ultra96-V2 However this number is later to be learned as only the estimated values and the resulting build is far from this The exported IP was then ran through the script prepared by the author of the FracBNN solution with only minor configurations in C files of the fully connected layers was set to output classes instead of 10 (of the CIFAR-10 acceleration solution) By running the FracNet.tcl script, the prepared IP was prepared into bitstream for board deployment The top module was called design_1_wrapper.v, with the timescale of 1ps / 1ps The module only consists of the prepared IP through Vivado HLS without any extensions The bitstream generation resulted in the bitstream file design_1_wrapper.bit and a hardware definition file design_1_wrapper.hwh for board deployment The actual resources utilization was found in utilization_placed.rpt as followed: Based on table [? ] we can see that the FracBNN+CNRPark when comparing to the previous group has somewhat worse resources utilization This could be to the ineffective systhesis procedure when pushing the C++ code to Vivado HLS As looking into the estimated table, the resources require exceeds the available resources and the bitstream generation process optimized it to the point of 100% 65 5.3 Evaluation 5.2.4 Inference on the Ultra96-V2 The model deployment is done via PYNQ on the Ultra96v2 After flashing PYNQ into SD Card, the FPGA Ultra96v2 is ready for model deployment The board is connected to the PC via a USB-microUSB cable The bitstream, hardware definition files as well as the input features are copied to the board for inference A deployment script in Python is written for this purpose The allocate library of PYNQ is used for the next steps The resgisters "image_V" is used for loading the encoded images, which is allocated with the physical address of the encoded images array, with shape of 5000 x x 32 x 32 (5000 images are used for evaluation) Similarly, the "output_r" registers are allocated with the result array physical address, in order to compare the predictions with labels 5.3 5.3.1 Evaluation Training result The initial training of the model with the arguments of 20 epochs, and batch size of 32, along with default value for the rest of the arguments The model achieved an accuracy of 99.1% on the validation dataset The trained model is suspected to be overfitted, meaning we need to further investigate the learning rate and loss rate to see if the model is overfitted Another metrics which is unique to this FracBNN+CNRPark model is the sparsity rate This rate is essence to the fractional convolution activations as it represents the fractional ratio between 2-bit and 1-bit activations The fractional convolution is basically a binary convolution if the sparsity is 100%; however, if the sparsity is 0%, it degenerates to a 2-bit convolution In the trained model, it is observed that the sparsity is 50,2% meaning about half of the activations are binary and the other half are 2-bits This is the best scenario for the FracBNN model as it keeps the accuracy potential with 2-bits and also is optimized with 1-bit activations 66 5.3 Evaluation Table 5.2: Accuracy comparison between models of input size 32x32 Input precision size 32x32 FP16 FP16 FP16 Thermometer encoded uint64 Bits (W/A) 1/1 1/2 2/2 FracBNN (on GPU) FracBNN Model CNV CNV CNV FracBNN 5.3.2 Training dataset Testing dataset Accuracy CNRPark CNRPark CNRPark CNRPark CNRPark CNRPark 83.35% 78.7% 79.37% 1/1.4 CIFAR-10 CIFAR-10 89.1% FP32 1/1.5 CNRPark CNRPark 99.1% Thermometer encoded uint64 1/1.4 CNRPark CNRPark 73.5% Hardware acceleration result The inference was run successfully on the Ultra96v2 board with prediction results The performance metrics used for evaluation was Latency, Throughput and Accuracy Timing on the board inference was done via the Python time library Registers CTRL.AP_START and CTRL.AP_IDLE was used for calculating the start of inference and idle period of inference Latency was calculated with time / num_tests x 1000 for miliseconds Throughput = 1/(t/num_tests) A prediction variable was checked during each iteration for comparison with the labels array Accuracy = correct predictions/num_tests x 100 for percentage The result of inference is as depicted in table [? ] 67 5.3 Evaluation Table 5.3: Inference results of prospective models on Ultra96v2 Model (input 32x32) CNV CNV CNV Inception-v1[28] Resnet50[5] YOLOv3[19] FracBNN+CIFAR10 (without SD card loading time) FracBNN+CIFAR10 FracBNN+CNRPark (without SD card loading time) FracBNN+CNRPark (on GPU) FracBNN+CNRPark Bits (W/A) 1/1 1/2 2/2 FP32 FP32 FP32 Latency (ms) 19.41 18.02 21.38 25.14 47.71 354.74 Throughput(fps) 51.53 55.49 46.76 39.77 20.96 2.82 1/1.4 0.3563 2806.90 1/1.4 1.6567 603.62 1/1.5 0.3100 3225.32 1/1.5 1.0120 988.14 1/1.5 0.4689 2132.75 Table 5.4: Power consumption on Ultra96v2 of different models Model CNV CNV CNV Bits (W/A) 1/1 1/2 2/2 Input images FP16 FP16 FP16 Power (W) 2.954 4.216 3.017 FracBNN+ CIFAR10 1/1.4 UINT64 thermometer encoded 4.1 68 Conclusion 6.1 Summary In this thesis, I have further improved on the Edge layer of the proposed Edge-AI solution by the previous thesis group They have made a very plausible solution for Smart Parking system with layers such as Edge AI computing on SoC devices, Cloud-based server for parking services and Android mobile application and management website In order to improve the Edge layer on the SoC, I have utilized the ideas of the FracBNN paper and achieved results on both software and hardware implementation My contributions are as followed: 6.1.1 Comparing to Vien’s thesis Vien’s thesis was to implement a different solution to mine His solution was to build a license plate detection model for Smart Parking purpose The solution was built on the now obsolete YOLOv3 object detection model, achieving acceptable accuracy and real-time capabilities Vien’s thesis has drawbacks as license plates could be hidden when viewing a car park by a security camera, thus not very viable for a Smart Parking solution This solution could work if a camera were to be installed at the entrance and monitor/count each car as they go in or out of an indoor parking lot and then compared to a number of free spaces known before hand This solution however could easily be replaced by electronic ticketing system This also would not work for outdoor parking spaces (on the roadside) as you can’t control if people go in and out of the parking lot The only way to make it work is to use an array of cameras all pointing to a fixed location in a controlled way of which to always see 69 6.1 Summary car license plates This result, I think, is highly implausible in a real world scenario of a Smart Parking system 6.1.2 Trained and implemented FracBNN+CNRPark model on Pytorch, run on GPU The FracBNN paper has proposed an experimental solution of their own architecture/topology of FracBNN The model was trained on both a small dataset (CIFAR-10) and big dataset (ImageNet) The FracBNN+CIFAR10 model achieved very good result of 89.1% for a BNN This solution is very suitable for my custom dataset of CNRPark and custom purpose of classify the classes busy and free I have successfully configured and trained the FracBNN+CNRPark model The configuration was done mainly on the Fully connected layer, changing 10 output classes to output classes The result achieved when implementing on Pytorch was a 99.1% accuracy when testing and validating on the same CNRPark dataset Latency and throughput was also measured to be 1.01ms and 988fps when running on the GPU 6.1.3 Implemented hardware acceleration via Vivado HLS on Ultra96v2 With the pretrained model, I have successfully implemented the hardware acceleration on Ultra96v2 The hardware acceleration architecture was also proposed by the FracBNN paper I have only configured the fully connected layer to outputs, suitable for the need of classifying free and busy parking spaces Using the C++ files provided by the FracBNN paper, I used Vivado HLS to create an IP for the model on hardware This process needs weights and biases of float precision to be extracted from the pretrained model on software The task was also completed using parameter checking functions on Pytorch With Vivado, a bitstream file is created using the generated IP from Vivado HLS Ready for to go through inference on the Ultra96v2 70 6.2 Future Improvements 6.1.4 Achieve inference result using built thermometer encoder on Ultra96v2 As instructed in the FracBNN paper, thermometer encoded input of 64-bit and labels were needed for inference testing purpose on the Ultra96v2 To satisfy this need, I have built a parallel thermometer encoding using numpy and Pytorch libraries The encoder encoded floating-point precision pixel value to thermometer encoded value of 64-bit The binarized input, as described and tested on the paper has greatly reduced the latency and hugely increased the throughput of the inference process The results are 0.4689ms and 2132.75fps, respectively This is a 2x performance increase from when running on GPU and almost 42x performance increase from the CNV 1-bit/1-bit performed by the previous thesis group without much accuracy drawback 6.2 6.2.1 Future Improvements Accuracy degredation A degredation in accuracy is found after implementing from software to hardware acceleration, from 99.1% to 73.5% This could be due to the result of weights and bias handling and thermometer encoding Especially in Thermometer encoding step, the input features were encoded from float32 to binary32 and then pad zeros to binary64 This has ommitted a lot of accuracy in the features for inference This however, is an inherent problem in BNN in general which observed binarizing inputs via fixed-point encoded With thermometer encoding, the weights are also preserved, therefore this is a better result than a fixed point encoding would have had The same accuracy degredation are observed when implementing in a controlled testing environment I have tried my thermometer encoder on the CIFAR10 dataset along with the pretrained model of FracBNN+CIFAR10 The expected accuracy was of 89.1% as reported in the paper However when running inference on the Ultra96v2, the result only achieved 87% accuracy Over 2% was lost, therefore further inspection into my thermometer encoder is needed 71 6.2 Future Improvements 6.2.2 Real-time application The hardware acceleration model has observed very low latency of 0.3067 ms and very high throughput of 3260 fps This however is due to the width and size of the images being only 32x32 in the image classification problem In order for real-world usage, an object detection solution must be devised for parking spots detection This would involve the use of sliding windows on each frame of security camera images Even for a security camera of 720p, with padding=1 and stride=1, a sliding window would need 900 iterations for each frame in a video, which the FPS of the solution would be reduced to around 36 fps Further acceleration technique in the object detection problem is needed in this case Furthermore, the encoding images step was done outside of the FPGA In a real-time application, each images, i.e camera frames would have to encoded sequentially, which would further slow down the application The power consumption of the system is not yet measured for this customized purpose of classes 72 Bibliography [1] L Pettersson, “Convolutional neural networks on fpga and gpu on the edge: A comparison,” Ph.D dissertation, 2020 [Online] Available: http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-414608 [2] M Courbariaux and Y Bengio, “Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1,” CoRR, vol abs/1602.02830, 2016 [Online] Available: http://arxiv.org/abs/1602.02830 [3] Y Zhang, J Pan, X Liu, H Chen, D Chen, and Z Zhang, “FracBNN: Accurate and FPGA-Efficient Binary Neural Networks with Fractional Activations,” The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2021 [4] T N Thinh, L Tan Le, N H Long, H Le Thuc Quyen, N Q Thu, N La Thong, and H P Nghi, “An edge-ai heterogeneous solution for real-time parking occupancy detection,” in 2021 International Conference on Advanced Technologies for Communications (ATC), 2021, pp 51–56 [5] K He, X Zhang, S Ren, and J Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016 [6] Z Liu, Z Shen, M Savvides, and K.-T Cheng, “Reactnet: Towards precise binary neural network with generalized activation functions,” in European conference on computer vision Springer, 2020, pp 143–159 [7] K P Shung (2018) Accuracy, precision, recall or f1? [Online] Available: https: //towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9 [8] S Darabi, M Belbahri, M Courbariaux, and V P Nia, “BNN+: Improved binary network training,” 2019 [Online] Available: https: //openreview.net/forum?id=SJfHg2A5tQ 73 Bibliography [9] T N Thinh, L Tan Le, N H Long, H Le Thuc Quyen, N Q Thu, N La Thong, and H P Nghi, “An edge-ai heterogeneous solution for real-time parking occupancy detection,” pp 51–56, 2021 [10] S Saha (2018) A comprehensive guide to convolutional neural networks [Online] Available: https://towardsdatascience.com/ acomprehensive-guide-to-convolutional-neuralnetworks-the-eli5-way-3bd2b1164a53 [11] T YEUNG (2022) What is edge-ai [Online] Available: //blogs.nvidia.com/blog/2022/02/17/what-is-edge-ai/ https: [12] Y Umuroglu, N J Fraser, G Gambardella, M Blott, P H W Leong, M Jahre, and K A Vissers, “FINN: A framework for fast, scalable binarized neural network inference,” CoRR, vol abs/1612.07119, 2016 [Online] Available: http://arxiv.org/abs/1612.07119 [13] “F-score,” May 2022 [Online] Available: https://en.wikipedia.org/wiki/ F-score Hofesmann (2020) Iou a better detection evalua[14] E tion metric [Online] Available: https://towardsdatascience.com/ iou-a-better-detection-evaluation-metric-45a511185be1 [15] R Bonghi [Online] Available: https://github.com/rbonghi/jetson_stats [16] N Corporation (2019) Nvidia drive os 5.1 documentation [Online] Available: https://docs.nvidia.com/drive/drive_os_5.1.6.1L/nvvib_ docs/index.html#page/DRIVE_OS_Linux_SDK_Development_Guide/ Utilities/util_tegrastats.html [17] N T Vien, “An edge hardware acceleration solution for smart parking,” 2022 [18] T H Quan [Online] Available: https://github.com/winter2897/ Real-time-Auto-License-Plate-Recognition-with-Jetson-Nano [19] J Redmon and A Farhadi, “Yolov3: An incremental improvement,” arXiv, 2018 [20] J Redmon, “Darknet: Open source neural networks in c,” http://pjreddie com/darknet/, 2013–2016 [21] A Bochkovskiy, C.-Y Wang, and H.-Y M Liao, “Yolov4: Optimal speed and accuracy of object detection,” 2020 74 Bibliography [22] M Courbariaux, Y Bengio, and J David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” CoRR, vol abs/1511.00363, 2015 [Online] Available: http://arxiv.org/abs/1511.00363 [23] T Simons and D.-J Lee, “A review of binarized neural networks,” Electronics, vol 8, no 6, 2019 [Online] Available: https://www.mdpi.com/2079-9292/8/ 6/661 [24] S Zhou, Z Ni, X Zhou, H Wen, Y Wu, and Y Zou, “Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients,” CoRR, vol abs/1606.06160, 2016 [Online] Available: http://arxiv.org/abs/1606.06160 [25] Y Liu, H Chen, C Shen, T He, L Jin, and L Wang, “Abcnet: Real-time scene text spotting with adaptive bezier-curve network,” CoRR, vol abs/2002.10200, 2020 [Online] Available: https://arxiv.org/abs/2002.10200 [26] G Amato, F Carrara, F Falchi, C Gennaro, C Meghini, and C Vairo, “Deep learning for decentralized parking lot occupancy detection,” Expert Systems with Applications, vol 72, pp 327–334, 2017 [27] A Krizhevsky, V Nair, and G Hinton, “Cifar-10 (canadian institute for advanced research).” [Online] Available: http://www.cs.toronto.edu/~kriz/ cifar.html [28] S Yadav, R Rathod, S Pawar, V Pawar, and S More, “Application of deep convulational neural network in medical image classification,” 04 2021 75