A Novel Hardware Architecture for Human Detection using HOG-SVM Co-Optimization Ngo-Doanh Nguyen, Duy-Hieu Bui, Xuan-Tu Tran SISLAB, VNU University of Engineering and Technology, 144 Xuan Thuy road, Cau Giay, Hanoi, Vietnam Corresponding author’s email: tutx@vnu.edu.vn limits the throughput of the system and increase memory size for storage and reuse of HOG features [3] There have been many research works focusing on optimizing HOG feature generation and SVM for hardware implementation of human detection To increase the throughput of HOG feature generation, the approximation of trigonometric functions can be done by using CORDIC [4], [5] or other approximate algorithms such as the works in [5]–[8] Approximate computations can increase the throughput and reduce hardware complexity, but it reduces the accuracy of HOG features Throughput can also be increased by reusing the generated HOG features [4], [5] of the overlapped cells of the detection windows Finally, a multi-core setup along with parallel SVM computation can be used to increase the throughput further [4], [9] However, in these works, HOG feature generation and SVM are still calculated separately In this paper, we propose a novel architecture for calculating HOG features and performing detection using SVM by cooptimizing the two processes to increase the throughput The throughput is improved by using a fast, highly-parallel and low-cost HOG feature generation in combination with a modified datapath for parallel computation of SVM and HOG feature normalization with data reuse The SVM calculation for a block is performed on the unnormalized HOG features in parallel with the HOG normalization Seven SVM modules are utilized for fast classification on the reused data of the overlapped cells of two windows In addition, the area is optimized by using Sequential Multiply-and-Accumulate (SMAC) for SVM and HOG normalization The proposed architecture has been implemented using TSMC 65nm technology with a maximum operating frequency of 500MHz and a maximum throughput of 139fps for Full-HD video with a hardware cost of 145kGEs and about 242kb SRAMs The rests of this paper are organized as follows Section II is the current state of the art of HOG-SVM for hardware implementation Our proposal hardware architecture to improve the throughput of HOG-SVM is described in Section III Section IV presents our hardware implementation results Finally, there are some conclusions and perspectives in Section V Abstract—Histogram of Oriented Gradient (HOG) in combination with Supported Vector Machine (SVM) has been used as an efficient method for object detection in general and human detection in particular Human detection using HOG-SVM in hardware shows high classification rate at higher throughput when compared with deep learning methods However, data dependencies and complicated arithmetic in HOG feature generation and SVM classification limit the maximum throughput of these applications In this paper, we propose a novel highthroughput hardware architecture for human detection by cooptimizing HOG feature generation and SVM classification The throughput is improved by using a fast, highly-parallel and lowcost HOG feature generation in combination with a modified datapath for parallel computation of SVM and HOG feature normalization The proposed architecture has been implemented in TSMC 65nm technology with a maximum operating frequency of 500MHz and throughput of 139fps for Full-HD resolution The hardware area cost is about 145kGEs along with 242kb SRAMs Index Terms—Artificial Intelligence, Histogram of Oriented Gradient, Support Vector Machine, HOG, SVM I I NTRODUCTION Human detection has been a crucial part in surveillance and automobile systems such as smart cars and security To detect a human, each frame from the camera will be analyzed to decide if it contains a person or not These embedded-vision applications are often implemented in hardware to maximize the throughput to meet the real-time requirement However, the current hardware implementations of human detection such as HOG-SVM and Deep Convolution Neural Network (CNN) have limitations in terms of throughput because of the data dependencies and a large number of operations required Between the two methods for human detection, HOG-SVM shows its advantages over CNN because HOG-SVM requires fewer operations and data dependencies [1] In contrast, CNN provides more robust solutions with increasing accuracies [1] Therefore, HOG-SVM is more suitable for embedded-vision applications than CNN since it has high throughput and lowpower consumption for hardware implementation Human detection using HOG-SVM first describes the detection windows using HOG descriptors [2], then applies SVM classification to decide if it contains a person The original HOG feature generation process contains many complicated arithmetic functions such as inverse tangent, square, square root, and division SVM classification can only start when the HOG features are available This data dependency further 978-1-7281-2940-2/19/$31.00 ©2019 IEEE II S TATE OF T HE A RT A Overview of HOG-SVM algorithm Human detection using HOG-SVM first proposed by Dalal and his colleague in [2] Their processing flow is presented 33 APCCAS2019 For L2-norm implementation, Chen et al [7] use the fast inverse square root using IEEE-754 number format Square root function can also be implemented using Newton-Raphson method [5] In general, these approximations help reduce hardware area and increase throughput, but lower the accuracy of the generated HOG features In addition, throughput has been increased by reusing cell histograms overlapped in multiple detection windows Many works such as Mizuno et al [5] and Takagi et al [4] use data-reuse methods to improve the throughput of the system Finally, parallelism can be used to push the throughput further For examples, Takagi et al used two HOG-SVM cores [4] Suleiman et al [9] utilized HOG-SVM detectors for multi-scale support Their work also uses multiple MAC module in parallel for SVM to improve the throughput They also presented the preprocessing using gradient function to reduce the bit width before calculating the histogram The aforementioned works have investigated various aspects of hardware implementations of HOG-SVM However, HOG feature generation and SVM classification are optimized separately In this work, we propose a novel architecture with the co-optimization of HOG feature generation and SVM classification After the block histogram is generated, SVM classification is executed directly on unnormalized data and in parallel with L2-norm function Then, SVM results will be divided by the normalized coefficients The hardware area is minimized for L2-norm and SVM by using SMAC The proposed architecture will be presented in the next section in Fig The algorithm works on a detection window of 64 × 128 pixels After pre-processing, the detection window is divided into multiple cells with a size of × pixels Cell histogram is generated based on the gradient of each pixel Gradient includes its angle and magnitude The magnitudes of the gradients in each cell are accumulated into nine bins to construction the cell histogram based on their angles Four adjacent cells (2 × 2) are grouped into one block Cell histograms are normalized based on block data Detection window (64x128 pixels) Cell histogram (9 bins) Magnitude Cell Cell 1 9 Cell Preprocessing Image level Gradient Vector Generation Cell Histogram Cell level 3778 3779 3780 Block (4 cells) normalization Histogram Normalization SVM Classification Block level Window level 9 Angle Cell histogram Generation Person? Gradient Vectors in Cell (8x8 pixels) … Collect 3780 HOG features in a single detection window Image inputs Cell Fig HOG-SVM algorithm for human detection [9] One of the most complicated parts for implementing cell histogram is gradient computation, which uses inverse tangent, square and square root as in equation (1) and (2) m(x, y) = fx (x, y)2 + fy (x, y)2 θ(x, y) = arctan fy (x, y) fx (x, y) (1) (2) where fx (x, y) and fy (x, y) are horizontal and vertical vectors; m(x, y) and θ(x, y) are magnitude and angle, respectively Block normalization is done for four adjacent cells L2 normalization is used on 36 bins to create the block histogram as described in (3) vi (3) vin = v 22 + ǫ2 III P ROPOSED H ARDWARE A RCHITECTURE As stated in Section II, HOG-SVM contains many complicated arithmetics in cell histogram generation and block histogram normalization Conventionally, SVM classification is performed on the normalized block data To increase the throughput of HOG-SVM for human detection, we propose a novel hardware architecture which uses the co-optimization of HOG feature normalization and SVM classification The proposed architecture is summarized in Fig The pixels of each window are stored in an SRAM In each clock cycle, pixels are read to generate gradient vectors These vectors are accumulated to create one cell histogram every clock cycles The cell histogram buffer is used to form the block histograms In our architecture, SVM and block normalization are executed in parallel Block data containing 36 histogram values of cells are sent to both SVM and L2-norm The SVM results are normalized by the square root values The final results are accumulated into the window accumulation registers to decide if there is a person in the search windows where vi is a bin in a block histogram and ǫ is used to evade zero-division Finally, an SVM classifier is used to decide the presence of a person in a detection window using the block histograms B Related works Hardware implementation of complicated arithmetic functions in calculating HOG features requires large area and high power consumption Many researchers have been trying to optimize HOG generation for this purpose For examples, the inverse tangent can be calculated by using CORDIC as proposed by [4], [5] Some other works [6]–[8] avoid the inverse tangent by using an approximation function and converting it into the comparison of the angles Ho et al in [6] show a very fast and low-cost approximation method to calculate the angles and the corresponding magnitudes with an area of 3.5 kGEs For block normalization, L2 norm function can be avoid by using L1-norm [9] which is much simpler than L2-norm as described in (4) vi vin = (4) v 1+ǫ A Cell Histogram Generation The first step in HOG feature generation is to calculate the angles and the magnitudes of the gradient vectors In this work, we use the method in [6] to generate the gradient vectors This low-cost and high-throughput method enables the parallel processing of pixels in one clock cycle The gradient vectors then used by the bin accumulator to generate cell histograms 34 APCCAS2019 Block data Window acc reg SVM SRAM Divider Stage SQRT Stage Stage SQRT Block normalization Square & Acc Block SAC Acc SQRT buffer Square & Acc Stage Contain person? Pipeline regs Pipeline regs Sequential MAC 0–6 SVM Cell histogram buffer Gradient gen Bin accumulator Dx, Dy … Input image Window SRAM (64x128) Gradient gen Gradient gen Stage Fig Proposed block diagram of object detection bit of wi using a shift register If the current bit is ‘1’, the second multiplicands (Ai ) is shifted, then accumulated into the results An adder tree is used to add all Ai which has the current bit of wi equal to ‘1’ A barrel shifter shifts the final result instead of each Ai separately to save hardware area The throughput of SMAC depends on the number of input pairs and the number of bits used to represent wi SVM modules use a SMAC with 36 input pairs which correspond to a block histogram To calculate SAC, input pairs are put with the same values Our architecture can generate a cell histogram every clock cycles The outputs are then stored into the cell histogram buffer The cell histogram buffer is a circular buffer which stores 128 cells of a window for reuse It provides a block histogram for SVM calculation B Parallel computation of SVM and block normalization In conventional HOG-SVM classification, SVM is done on the normalized HOG features of the search window as described in equation (5) n (ωi × vin ) + b D= (5) W 1(n) A1 i=1 vin wp / n Start Shift Reg - Reg W p(n) - / m T R E E m / / m+n+ log(p) / m+n+ log(p) / Accumulate Reg + ǫ2 W 2(n) A D D E R Barrel Shifter 2 Reg - W p(n) Ap / m m / / Shift Reg W 2(n) pSum v ωi × v i / n - / m w i=1 w2 W 1(n) m / / D= w1 / n A2 / m Reg - is where ωi is weights obtained after training process and the normalized HOG features However, the normalization process especially using L2norm (equation (3)) is time-consuming because it needs to collect all the information of a block and the square and square root operations In this work, we propose to parallelize SVM and HOG block normalization SVM is performed on unnormalized data and then divided by the normalized block coefficient Equation (6) is the new SVM equation on unnormalized block data By using this equation, the normalization process can be done in parallel with SVM classification / m SoP / m Done Shift Reg Bit counter Done reg / Fig The architecture of the Sequential MAC (SMAC) +b (6) To optimize the hardware area and speed further, we propose to use SMAC to calculate SVM and the sum of squared values (SAC) The SMAC with highly parallel inputs enables the computation of SAC of a cell histogram in twelve cycles, while SVM performed on a block takes cycles The detail of this MAC architecture is presented in the next subsection D Data reuse and pipeline For high-speed designs, data reuse is an important factor In this work, we reuse the generated cell histograms by storing 128 × bin values in the cell histogram buffer After the generation of the first ten cell histograms of the first window, a block histogram is generated in clock cycles The square-root values for block normalization are reused in SQRT buffer At the frame level, after the first window is processed the second window are processed by only calculated the cell historgram of the non-overlapped cells When the window reaches the frame boundary, it is moved down and then move to the left This leads to the fact that SVM has to process the overlapped windows again To solve this problem, we use additional SMACs to calculate the SVM of the overlapped area and C SMAC architecture for SAC and SVM Instead of using normal MAC module, this work uses SMACs to reduce the hardware cost and to increase the operating frequency The architecture of the SMAC is described in Fig Our design is based on bit-serial multipliers and a parallel accumulator which have been used in convolutional neuron network [10] The number of multiplicand pairs can be changed at design time This architecture loops through each 35 APCCAS2019 the normalization is done by reusing SQRT values in SQRT buffers and the pipelined divider In our proposed architecture, data processing is performed at different levels For example, the cell histogram generation processes gradient vectors per clock cycle, and a cell histogram is generated in clock cycles In contrast, SAC works on a cell histogram and needs 12 clock cycles to finish Therefore, to increase the throughput and the data utilization on the design, we double the units when it is necessary For instance, SAC and SQRT modules are doubled because they cannot process a cell histogram and a square root of a block SAC in clock cycles The two units work alternatively to meet the timing of the system The timing and activation of each unit are described in Fig cycles Bin Bin Cell SAC 8 Bin Cell 10 Cell 11 SAC Cell SQRT 12 cycles Cell 13 SAC Cell 10 SQRT SQRT Block SQRT Block Sum SAC block SVM DIV SVM Blk cycles SVM Blk SVM Blk SVM Blk SVM Blk cycles DIV block DIV block [8] 65nm SOTB 500kGEs* 0.602Mbit 200MHz 30fps 1024 × 1616 [9] 45nm SOI 490kGEs 0.538Mbit 270MHz 60fps Full HD This work 65nm 145kGEs 0.242Mbit 500MHz 139fps Full HD V C ONCLUSIONS Human detection has a wide range of applications such as robotics, automobile, and video surveillance One of the efficient methods to perform human detection is HOG-SVM However, HOG-SVM with its complexity and data dependency limits its throughput in hardware implementation In this paper, we proposed a novel hardware architecture with the cooptimization of HOG normalization and SVM classification Along with the data reuse strategy, the proposed hardware architecture can run at the maximum frequency of 500MHz in TSMC 65nm technology with a throughput of 139fps for FullHD resolution This hardware design can be used for veryhigh-speed human detection systems SAC Cell 12 12 cycles [4] 65nm 502kGEs 1.22Mbit 42.9MHz 30fps Full HD * The HOG core area is calculated from the original paper based on the best of our knowledge SAC Cell 13 SQRT Block Technology Gate Count Memory Frequency Frame Rate Resolution Cell 14 SAC Cell 11 SAC Cell SAC TABLE I T HE C OMPARISON OF H ARDWARE E FFICIENCIES 8 Cell 12 framerate when compared with the previous works in [4], [8] and [9] Our design also uses fewer SRAMs for data reuse DIV block Fig Data pipeline of the proposed architecture for the first window ACKNOWLEDGEMENT This research is partly supported by Ministry of Science and Technology (MoST) of Vietnam under grant number 28/2018/TL.CN-CNC With the proposed method, our design needs 1.12K cycles to compute HOG-SVM for the first window For other windows, the data reuse scheme is activated In the worst case, only 15 new blocks need doing HOG feature generation In this case, the first SVM SMAC is assigned for these blocks The other SVM SMACs are utilized for recalculating of the SVM classification for the overlapped blocks The total number of cycles for a new window with data reuse is 128 cycles with data pipeline For a Full-HD image, our design requires about 3.58M cycles to finish, which leads to a peak throughput of 139fps at the frequency of 500MHz R EFERENCES [1] A Suleiman, Y.-H Chen, J Emer, and V Sze, “Towards closing the energy gap between hog and cnn features for embedded vision,” in ISCAS, May 2017, pp 1–4 [2] N Dalal and B Triggs, “Histograms of oriented gradients for human detection,” in IEEE-CVPR, vol 1, June 2005, pp 886–893 [3] M Hiromoto and R Miyamoto, “Hardware architecture for highaccuracy real-time pedestrian detection with cohog features,” in IEEEICCV Workshops, Sep 2009, pp 894–899 [4] K Takagi, K Mizuno, S Izumi, H Kawaguchi, and M Yoshimoto, “A sub-100-milliwatt dual-core hog accelerator vlsi for real-time multiple object detection,” in IEEE-ICASSP, May 2013, pp 2533–2537 [5] K Mizuno, Y Terachi, K Takagi, S Izumi, H Kawaguchi, and M Yoshimoto, “Architectural study of hog feature extraction processor for real-time object detection,” in IEEE SIPS, 2012, pp 197–202 [6] H.-H Ho, N.-S Nguyen, D.-H Bui, and X.-T Tran, “Accurate and low complex cell histogram generation by bypass the gradient of pixel computation,” in IEEE-NICS, Nov 2017, pp 201–206 [7] P.-Y Chen, C.-C Huang, C.-Y Lien, and Y.-H Tsai, “An efficient hardware implementation of hog feature extraction for human detection,” IEEE-TITS, vol 15, no 2, pp 656–662, April 2014 [8] F An, X Zhang, A Luo, L Chen, and H J Mattausch, “A hardware architecture for cell-based feature-extraction and classification using dual-feature space,” IEEE Transactions on Circuits and Systems for Video Technology, vol 28, no 10, pp 3086–3098, Oct 2018 [9] A Suleiman and V Sze, “Energy-efficient hog-based object detection at 1080hd 60 fps with multi-scale support,” in IEEE-SiPS, Oct 2014, pp 1–6 [10] P Judd, J Albericio, T Hetherington, T M Aamodt, and A Moshovos, “Stripes: Bit-serial deep neural network computing,” in IEEE/ACM MICRO, Oct 2016, pp 1–12 IV H ARDWARE IMPLEMENTATION RESULTS The proposed architecture has been implemented in Matlab with the training and test dataset from [2] The proposed hardware architecture has been modeled in VHDL, simulated and synthesized using Synopsys VCS and Design Compiler The hardware model has minor differences in accuracy when compared with the software model in Matlab using the INRIA person dataset The final RTL has been synthesized by Design Compiler using TSMC 65nm standard cell library and SRAM model from ARM The implementation results are summarized in Table I With our optimizations, this design can run at 500MHz with a hardware area of 145kGEs At this frequency, the proposed design can process Full-HD images at the speed of 139fps The total size of SRAMs in this work is about 242kb Our design achieves the highest operating frequency and the highest 36 APCCAS2019 ... use additional SMACs to calculate the SVM of the overlapped area and C SMAC architecture for SAC and SVM Instead of using normal MAC module, this work uses SMACs to reduce the hardware cost and... complexity and data dependency limits its throughput in hardware implementation In this paper, we proposed a novel hardware architecture with the cooptimization of HOG normalization and SVM classification... Chen, and H J Mattausch, ? ?A hardware architecture for cell-based feature-extraction and classification using dual-feature space,” IEEE Transactions on Circuits and Systems for Video Technology, vol