INTRODUCTION
Overview
Image processing involves performing specific operations on images to generate a new dataset based on input features Its primary goal is to capture image characteristics, such as edge detection and segmentation, or to enhance image quality through improved brightness and resolution As a rapidly evolving technology, image processing enables us to obtain precise image data for various applications This thesis introduces an edge detection method, where changes in amplitude within an image reveal important subject information Local discontinuities in brightness are identified as edges, while global discontinuities are referred to as boundaries By isolating these boundaries from the background, we can reduce unnecessary data storage and enhance the efficiency of subsequent image processing steps.
The rise of computer vision has significantly advanced image processing, with applications gaining traction in fields like telecommunications, automatic control, intelligent transportation, and biomedical engineering While many developers rely on software to implement image processing algorithms—such as Gaussian Filter, Fourier transform, edge detection, and Wavelet equations—using open libraries like OpenCV, Scikit-image, Pillow, Numpy, and Mahotas primarily serves to verify and simulate algorithm efficiency However, this software-based approach is often inadequate for real-time image processing due to challenges like sequential execution and limited operational memory, which hinder the speed of reading and writing image frames.
The implementation of image processing algorithms is significantly enhanced by hardware, as it allows for the parallel execution of subprograms within the main program Hardware accelerators, such as field-programmable gate arrays (FPGAs), play a crucial role in reducing computational time and boosting processing speed The choice of hardware platform is determined by the specific algorithm in use, with various tools available for implementing sequential algorithms, including Xilinx's Integrated Synthesis Environment (ISE), PlanAhead, and Software Development Kit (SDK) In this thesis, the Vivado Design Suite, designed for 7-series FPGA families like Virtex-7, Kintex, Artix, ZYNQ, and HLS for C-based development, was utilized The synergy of software coding and hardware execution, known as FPGA co-design, results in a powerful image processing system.
Related work
This thesis argues that co-design enhances the efficiency of image processing algorithms compared to traditional software execution Previous projects, particularly in edge detection, support this claim For instance, the authors in [3] developed a License Plate Automatic Recognition model employing a template matching algorithm, utilizing the Canny Filter for edge detection to minimize data volume Implemented in MATLAB, this system achieved an accuracy of 84.28% during testing.
The automatic insect detection system, developed to assist farmers in enhancing their yields, effectively processes 70 images and operates reliably in low-resolution and low-contrast conditions, as well as in various weather scenarios like murky, sunny, or rainy environments Utilizing the Sobel Filter for image segmentation, the system improves the identification of insects' geometric shapes, which can vary from circles to ovals and rectangles Implemented on MATLAB 2015b, this innovative approach has yielded impressive results.
The challenges in optimizing algorithm performance in the aforementioned projects stem from their sequential software execution, leading to high latency and unstable system accuracy due to the heavy tasks involved To explore alternatives, various image processing projects have been developed on hardware In 2007, Abbasi introduced a Sobel filter model on the FPGA platform, which ultimately proved ineffective due to its low complexity and prolonged execution time Subsequently, Halder and his team proposed a similar model with a more compact architecture and improved execution speed; however, it still suffered from issues related to excessive pixel storage and inefficient use of system resources.
The integration of software and hardware platforms significantly enhances the development and optimization of image processing algorithms using hardware description languages like Verilog, VHDL, or C/C++ in High-Level Synthesis (HLS) This approach ensures high performance and low power consumption while conserving system resources Consequently, many recent image processing projects are now focusing on implementing solutions that leverage this powerful co-design system.
Objective
This thesis aims to develop an edge detection system utilizing the Sobel Filter, first on a software platform and subsequently on a co-design platform The project will analyze the results to demonstrate the advantages of implementing an image processing algorithm within a co-design framework Additionally, it will propose enhancements to address identified shortcomings.
Work content
From the content of the objective, we can make a list of all the things to do below
➢ Study on the image, image processing, Sobel algorithm, an overview of FPGA, HLS, IP, and support toolchains such as Vivado, Vivado HLS, and ZYNQ Board
➢ Implement Sobel Filter on OpenCV - Python, summarize all the results: output image, values of MSE, and PSNR
➢ Create IP core of Sobel Filter in C++ language using Vivado HLS and OpenCV libraries, simulate the test program to validate the Sobel operation
In this article, we outline the process of generating a block design for the Sobel Filter using Vivado and implementing it on a ZYNQ Board We present a comprehensive summary of the results, including the output image produced by the filter, as well as key performance metrics such as Mean Square Error, Peak Signal to Noise Ratio, throughput, and power consumption.
➢ Evaluate and compare the result indicators on software and co-design platforms Conclude and propose future work.
Outline
This first chapter mentions an overview of the entire project, the related work, the target, task, and work content
Chapter 2 includes the foundational knowledge about image definition, image processing algorithms, FPGA and SoC ZYNQ-7000 platform, standard AXI protocols, and support toolchains
This chapter outlines the comprehensive process for system development, focusing on the creation of the Sobel IP core and the design of the block diagram, as detailed in the manuals provided by Xilinx and Digilent for both software and co-design platforms.
This chapter presents the output results of both the software and co-design approaches, detailing the resource indicators utilized in the project, including the number of logic gates, FIFO, LUT, and BRAM Additionally, it compares the processing speed and PSNR across the two platforms Chapter 5 concludes with insights and outlines potential future work.
Eventually, this chapter presents the overall opinions about this project and proposes future work
LITERATURE REVIEW
An overview of image
An image serves as a visual representation of people, animals, or objects in our environment, and in the realm of information technology, it refers to a picture stored electronically Images play a vital role in our lives, enhancing our perception of the world and facilitating communication and data storage They can be categorized by two main definitions: image type and image format Various types of images include photographs, illustrations, drawings, and graphics, while common image formats consist of JPEG, GIF, PNG, SVG, and TIFF.
A pixel is defined as an element located at the (x, y) coordinates of a digital image, possessing a specific grayscale or color value Consequently, a digital image can be viewed as a matrix composed of numerous pixels Digital images are widely used across various electronic devices, including phones, cameras, and computers In academic terms, a digital image is represented by a two-dimensional function z = f(x, y), where x and y denote spatial coordinates, and z indicates the amplitude of f(x, y), which corresponds to the intensity or gray level of the image at the specified coordinates.
Color space is a mathematical model that numerically represents actual colors, with RGB and YUV being the most commonly used in digital image processing Other notable color spaces include CIE LAB, HSV, and CMYK.
The RGB color space is a fundamental model used in computers and digital devices, based on three primary colors: red (R), green (G), and blue (B) Each pixel in this color space is defined by three parameters (r, g, b), with values ranging from 0 to 255, following the 24 bits per pixel (bpp) standard This 24-bit model allows for a vast spectrum of colors, totaling 16,777,216 possible combinations.
The YUV color space is defined by the YUV color model, comprising one luminance value and two chrominance values It is primarily utilized in the PAL standard television broadcasting system, which is widely adopted in many countries This color space is designed to create a color range that aligns more closely with human perception compared to the RGB model commonly used in computer graphics In YUV, the Y value represents luminance, allowing for more efficient color representation in broadcasting.
In image processing, the luminance component is represented by Y, while the color components are denoted by U and V The values of red (R), green (G), and blue (B) are combined using specific weights to generate the luminance signal Y for each pixel Subsequently, U is derived by subtracting Y from the blue component (B) and applying a scaling factor.
V signal is calculated by removing Y from the red (R) of RGB and also multiplied by other pre-existing scaling factors The following formulas are used to determine Y, U, and V from R, G, and B [8]:
As stated by the value that is used to represent pixels, we commonly distinguish three main types of images
A color image consists of pixels, each represented by three values corresponding to red, green, and blue (RGB) The brightness levels of these colors can vary, with each value ranging from 0 to 255.
➢ Gray image: each pixel in this matrix has a chroma that varies from white to black and ranges from 0 to 255
A binary image consists solely of two colors, black and white, with pixel values represented as either 0 or 1 In digital processing, a grayscale image is often utilized to depict binary images, where the two values correspond to 0 for black and 255 for white.
Image processing
Image processing plays a crucial role in computer vision by executing mathematical operations on input images to produce outputs that meet system requirements The fundamental steps involved in image processing are illustrated in Figure 2.1.
Figure 2 1 Basis steps in an image processing process
Step 1: Image Fetching Images are collected by color (or black and white) cameras that integrate an image sensor The output image in this step is analog
Step 2: Pre-processing After step 1, the output image may have noise or low contrast due to the quality of the sensor and environmental conditions We need this step to improve the image quality The main functions of pre-processing are to filter out the noise and increase or decrease the contrast
Step 3: Image segmentation This step helps separate input images into different regions for image representation, analysis, and recognition
Step 4: Image representation All the regions after the segmentation step are sequences of pixels, so we need this step because we are interested in the internal characteristics of the image, such as the curve and shapes In brief, the image representation process means transforming available image data into an appropriate and necessary format for processing by computer
Step 5: Identification and interpolation The following process of this step does object classification based on the descriptive details for that object by comparing the image with the pre-stored template
The image processing steps outlined must be executed under supervision, emphasizing the importance of foundational knowledge in the field.
To effectively apply an algorithm to an image, it is essential to consider not just the target pixel but also its surrounding pixels, commonly referred to as the "neighborhood." There are three primary types of neighboring pixel configurations: 4-neighborhood, 8-neighborhood, and cross-neighborhood, as illustrated in Figure 2.2.
Figure 2 2 (a) 4-neighborhood, (b) 8-neighborhood, and (c) cross-neighborhood 2.2.2 Conversion of image format
In image processing, a key transformation method involves converting between color, binary, and multi-grayscale images to achieve the desired format for specific applications This thesis focuses on utilizing grayscale or binary images as input to minimize the data processing requirements.
Monochromatic images, commonly known as grayscale images, can be represented using different levels of gray For instance, an 8 gray-level image assigns pixel values ranging from 0 to 7, while a 256 gray-level image uses values from 0 to 255 Several methods, including mean and weighted approaches, are commonly used to convert RGB images into grayscale images.
The mean method takes the mean value of R, G, and B as the grayscale value [8]
This straightforward method may not perform as anticipated due to the varying levels of human perception towards the primary colors: red, blue, and green (RGB) To enhance effectiveness, it is essential to assign different weights to these colors, leading to the development of the weighted method.
The weight method employs gamma correction by adjusting the distribution of red, blue, and green colors based on their wavelengths Among various conversion standards, the ITU-R BT.601 standard is selected, which defines specific weights for these colors.
The weight distribution of colors in the human visual system is represented by the values K R = 0.299, K G = 0.587, and K B = 0.114 This distribution highlights the sensitivity of cone cells in human eyes, which are most responsive to green light, followed by red, with blue being the least sensitive.
8 blue Therefore, at one RGB pixel, we have this formula to calculate the light intensity of a gray image [10]:
We can also see that these coefficients satisfy with 0.299+0.587+0.114=1 Figure 2.3 is an example of RGB to grayscale image conversion
Figure 2 3 (a) color image and (b) grayscale image 2.2.3 Convolution operation
To apply a filter to an image, convolution is performed between the original image matrix and the filter matrix This technique is crucial in image processing and is widely used in various applications, including calculating image derivatives, smoothing images, and detecting edges.
In mathematics, convolution is a linear operation that combines two functions, f and g, to produce a new function This process is commonly used in image processing, where the convolution of a filter matrix with an original image matrix yields a denoised or blurred image Specifically, for an image function f(x,y) and a filter k(x,y), the convolution can be expressed using a defined mathematical formula.
In convolution, the filter, or kernel matrix, is a crucial element, with its anchor point typically positioned at the center This anchor point defines the specific area of the matrix used for convolution Each value in the kernel matrix serves as an association coefficient corresponding to the gray level of pixels within the image area it covers The convolution process involves sliding the kernel matrix across the original image, starting from the upper left corner, and aligning the anchor point with each pixel As the kernel moves, it computes a new value for the current pixel, replacing the previous one with each shift.
To extract bright-spot objects from a dark background in an image, we utilize the intensity histogram of the function f(x, y) By selecting a threshold value, T, we can generate different binary images, with each threshold revealing distinct object-background separations.
Now, any (x, y) point in the image which has a value f (x, y) > T is marked as a bright-spot object, otherwise, they are dark-spot backgrounds In other words, a segmented image g (x, y) is given by [7]:
In summary, applying this thresholding method results in a binary image from a grayscale image Figure 2.5 is an example
Figure 2 5 (a) grayscale image and (b) binary image 2.2.5 Edge detection algorithm
Developing toolchain
This section provides an overview of FPGAs and their role in image processing systems through the use of IP cores It introduces the Vivado HLS support tool for developing soft IP cores and highlights key HLS pragmas that optimize algorithms Additionally, the section presents a Python tool used for implementing the Sobel Filter, enhancing the overall understanding of image processing techniques in FPGA environments.
Field-Programmable Gate Arrays (FPGAs) are microchips featuring an array of programmable logic blocks, with a history spanning over 35 years since their introduction by Ross Freeman, founder of Xilinx, in 1984 FPGAs can accommodate from hundreds of thousands to billions of logic gates, enhancing programming versatility These chips allow users to reconfigure them for specific functions; however, as a subset of Application Specific Integrated Circuit (ASIC) technology, FPGAs have limitations in optimizing designs and handling complex tasks They require approximately 30 times more hardware area than ASICs, operate 3 times slower, and consume 10 times more power.
Despite some drawbacks, FPGAs offer significant advantages, including reduced manufacturing costs and the ability to be reconfigured post-silicon fabrication This flexibility allows for the implementation of various functionalities on the FPGA chip, making it particularly well-suited for research and development purposes compared to ASICs.
FPGAs excel in executing parallel operations, making them ideal for demonstrating the capabilities of co-design systems Typically, FPGA chips are configured using Hardware Description Languages (HDL) like VHDL or Verilog; however, High-Level Synthesis (HLS) tools allow for the creation of soft IP cores using C/C++ or System C The following sections will delve deeper into HLS and its applications.
To discuss ZYNQ Board in section 2.4, we mention some of the fundamental components of FPGA, which are the main factors used in the Xilinx 7 series FPGAs
The DSP48E1 components are specialized Digital Signal Processors (DSPs) designed for various arithmetic operations, including multiplication, addition, and logic functions such as NOT, AND, and OR These DSP slices can be cascaded to implement complex functions, making them essential for real-time systems where their performance can be a critical bottleneck.
Block Random Access Memory (BRAM) is essential in FPGA designs that handle large data volumes, offering customizable widths and depths of 4, 8, 16, or 32 kb The availability of BRAM increases with the size and cost of the FPGA, making it suitable for a wide range of applications.
FPGAs require significant data processing capabilities, necessitating components to store and define input and output values Instead of relying on multiple logic gates to create logic tables, a single Look-up Table (LUT) can be utilized, functioning similarly to a small RAM.
Flip-flops (FFs) are essential binary registers that maintain a logical state between clock cycles, capable of holding two states: 0 or 1 In FPGAs, flip-flops serve as vital components for both synchronous and asynchronous data storage Each slice of Xilinx technology utilizes these flip-flops to enhance data processing capabilities.
7 series has 8 flip-flops, 4 for synchronous and 4 for asynchronous When the 4 synchronous is configured, the 4 others are unusable and vice versa [17]
Modern FPGA applications often involve complex system designs, but they benefit from numerous standard blocks available on the circuit board By utilizing Intellectual Property (IP) cores, developers can significantly reduce the development time of FPGA systems IP cores are reusable design blocks that facilitate fundamental operations and can be integrated by any vendor into their chip designs Numerous IP cores are available, built around standardized protocol components such as Ethernet, SPI, DRAM controllers, PLLs, and DDR cores.
IP cores are categorized into three types: hard, firm, and soft Hard IP cores are pre-designed silicon layouts that offer high performance and power efficiency but lack customization options Firm IP cores, while also resembling a final layout with fixed logic blocks, allow for some configurability across different applications In contrast, soft IP cores are the most adaptable, created in Hardware Description Language (HDL) and existing as gate-level netlists, enabling users to tailor them for specific FPGA designs While utilizing IP cores can simplify design processes, a notable drawback is the potential high cost associated with purchasing multiple cores.
IP cores increases the cost
2.3.3 Vivado and High-Level Synthesis
2.3.3.1 An introduction to High-Level Synthesis
While hard IP cores and reconfigurable firm IP cores are well-known in the field, many users lack expertise in HDL, the specialized computer language used to create soft IP cores To address this gap, the Vivado HLS tool has emerged as a valuable resource, simplifying the construction process for amateur users.
In Vivado HLS, IP cores are synthesized from C/C++ to Register-Transfer Level (RTL), simplifying FPGA project implementation for beginners and minimizing the required coding effort Vivado HLS includes HLS pragmas, which serve four primary functions: configuring hardware functions, optimizing loops, setting up hardware interfaces, and managing hardware memory implementation It's crucial to place pragmas correctly—before loop blocks, at the start of function definitions, or before variable declarations—to avoid errors Utilizing HLS pragmas enhances the top-function code by reducing latency, improving throughput, and optimizing area and device resource utilization.
Using HLS pragmas in top-function code offers several advantages, including reduced latency, enhanced throughput performance, and decreased area and resource utilization on devices Vivado HLS provides an HLS Test Bench, which is essential for validating hardware implementations by simulating inputs and outputs In this thesis, we utilize the HLS test bench to verify the performance of the Sobel IP core, ensuring its proper functionality before hardware execution However, it is important to acknowledge the limitations of HLS tools, such as Vivado HLS, which struggles with reading and writing large data volumes through physical I/O pins due to limited on-chip memory availability.
16 by transferring data among the IP cores, and if additional external memory space is needed, HLS uses the malloc function in the C/C++ programming language library
Released in April 2012, the Vivado Design Suite primarily serves as a synthesis and analysis tool for HDL designs A key feature of Vivado is its High-Level Synthesis (HLS), which eliminates the need for manual RTL creation After the IP core is developed, it is synthesized into RTL and packaged efficiently.
Vivado can access the IP repository to retrieve essential information about various IP cores The subsequent step involves connecting these related IP cores to create a block diagram that outlines the system's primary architecture Finally, Vivado offers a validation tool for the block diagram, allowing users to synthesize, implement, and generate the project's bitstream file, which primarily configures the Processing System (PS) in an FPGA co-design system For further information on PS definitions, please refer to section 2.4.1.
ZYNQ-7000 platform
The advancement of technology has led to the emergence of System on Chip (SoC) designs, which integrate complex system components into a single chip, becoming a significant trend in electronics SoCs typically include hardware processor units, programmable logic, I/O interfaces, and specialized features The introduction of the ZYNQ architecture marks a pivotal development in Xilinx's SoC product line, combining a dual-core ARM Cortex-A9 processor with Artix-7 FPGA logic This architecture facilitates efficient communication between its components through standard AXI communication, ensuring high bandwidth and low latency The ZYNQ AP SoC consists of two main parts: the Processing System (PS) and Programmable Logic (PL), with specific features such as the absence of the PCIe Gen2 controller and Multi-Gigabit data transmitter in the ZYNQ-7010 model.
Figure 2 8 Overview of ZYNQ AP SoC Architecture
The Programmable Logic (PL) closely resembles the Xilinx Artix-7 FPGA Series, featuring additional ports and buses that enhance connectivity with the Processing System (PS) Unlike traditional 7-Series FPGAs, the PL requires configuration directly through the processor or via the JTAG port The PS comprises various components, including a dual-core Cortex-A9 Application Processing Unit (APU), an Advanced Microcontroller Bus Architecture (AMBA) interconnect, a DDR3 memory controller, and peripheral controllers with 54 multiplexed I/O pins (MIO) If a peripheral controller lacks MIO connections, it can alternatively interface with I/O through an Extended-MIO (EMIO) Peripheral controllers operate in slave mode through the AMBA interconnection, with control registers for read/write functions addressable in the processor's memory space The PL also connects as a slave to the interconnect interface, allowing multiple cores within a single FPGA structure, each with addressable control registers Furthermore, implemented cores in the PL can trigger interrupts for the processor and perform Direct Memory Access (DMA) to DDR3 memory.
To conduct an in-depth analysis of ZYNQ's Programmable Logic (PL), it is essential to highlight the APU, which features a dual-core ARM Cortex-A9 processor Each core is enhanced with NEON multimedia co-processing units, a Floating Point Unit (FPU), and a Memory Management Unit (MMU).
The APU features a Memory Management Unit (MMU) and L1 cache memory for efficient data and command processing Additionally, it incorporates L2 cache memory and on-chip memory (OCM) to support both processor cores, as illustrated in Figure 2.9.
The communication between the PS and peripheral devices is facilitated through a 54-pin MIO block or, alternatively, via the PL block when using EMIO The I/O interfaces feature two SPI (Serial Peripheral Interface) for 4-wire serial communication, two I2C (Inter-Integrated Circuit) for 2-wire communication, and two CAN (Controller Area Network) interfaces, commonly utilized in the automotive sector Additionally, there are two UART (Universal Asynchronous Receiver Transmitter) interfaces for low-level serial communication, along with four 32-bit parallel GPIO (General Purpose Input/Output) ports The system also includes two SD interfaces for SDCard connectivity, two USB 2.0 compatible ports, and two Ethernet interfaces that support speeds of 10Mbps, 100Mbps, and up to 1Gbps.
In the context of programmable logic (PL), the Configuration Logic Block (CLB) is a crucial element Each CLB comprises two slices, which serve as hardware resources for implementing both sequential and combinational logic circuits Within the ZYNQ architecture, each slice includes four Look-Up Tables (LUTs), eight Flip-Flops, and additional logical resources Furthermore, a switch matrix routes each CLB, offering flexible routing capabilities to establish connections between the elements within the CLB and other resources in the PL The ZYNQ architecture also features two functional components that enhance its capabilities.
21 hardware blocks, which include Block RAMs (BRAMs) and DSP48E1 Figure 2.10 shows the structure of CLB [24]
Figure 2 10 CLB’s structure 2.4.2 Advanced Extensible Interface protocol
To create a complex system that connects the Processing System (PS) with dedicated IP cores on the Programmable Logic (PL), a communication interface is essential, with AXI being the primary factor AXI, part of the AMBA (Advanced Microcontroller Bus Architecture) controlled bus developed by ARM, has evolved through various versions over the years The original AXI was introduced in 2003 as part of the AMBA 3.0 standard, followed by the announcement of the AMBA 4.0 standard in 2010, which included AXI4 AXI4 comprises three types: AXI4 for high-performance memory-mapped requests, AXI4-Lite for simple, low-bandwidth memory-mapped transactions, and AXI4-Stream, which facilitates high-speed data transmission.
The AXI protocol facilitates communication between an AXI master and an AXI slave, enabling the exchange of information between IP cores This interaction relies on a memory-mapped connection structured through an Interconnect block For instance, the Xilinx AXI Interconnect defines the communication standards between AXI master and slave, ensuring efficient routing during their interactions.
[25] Both AXI4 and AXI4-Lite protocols have 5 channels: AR (Read Address channel),
The AXI4 protocol comprises four main channels: R (Read Data), AW (Write Address), W (Write Data), and B (Write Response) It facilitates simultaneous data transfer in both directions between master and slave, accommodating varying data sizes In contrast to AXI4-Lite, which permits the exchange of only one data unit at a time, AXI4 can handle up to 256 data units in a single execution.
Figure 2.11 shows us one period of the AXI4 read execution, which uses the AR channel and R channel
Figure 2 11 Read data period of AXI4 protocol
Figure 2.12 shows us one period of the AXI4 write execution, which uses the AW channel, W channel, and B channel
Figure 2 12 Write data period of AXI4 protocol
The AXI4 and AXI4-Lite protocols require the AXI master to transmit a read or write address along with data to the AXI slave In contrast, the AXI4-Stream interface simplifies this process by eliminating the need for a read/write address, allowing for unidirectional data transmission Essentially, AXI4-Stream is specifically designed for efficient data stream transmission in one direction.
Figure 2.13 shows us the way how data be transferred from master to slave in the AXI4- Stream interface
Figure 2 13 AXI4-Stream data transferring
In AXI4-Stream, the TDATA width, representing the data transferred per clock cycle, initiates transmission upon the sender's TVALID signal being received by the receiver, contingent upon the sender receiving a response via the TREADY signal The TUSER signal indicates the transmission of the first byte of the data frame, known as the Start of Frame (SOF), while the TLAST signal marks the last byte of the data stream, referred to as the End of Line (EOL) Additionally, AXI4-Stream includes optional features like TKEEP and TSTRB signals for conveying data location, facilitating the association of data with its location on TDATA streams, and TID and TDIST signals for stream routing, which correspond to stream identifiers and destination identifiers.
[25] Figure 2.14 shows us the data transmission period of AXI4-Stream with the pulse status chart of ACLK, TDATA, TVALID, TREADY, SOF, and EOL signals
Figure 2 14 Pulse chart of AXI4-Stream data transmission period
SYSTEM DESIGN
Implementation of Sobel Filter based on software application
In this section, we implemented the edge detection algorithm using Python, utilizing the OpenCV library The algorithm follows several key steps: reading data from the input image, converting the color image to grayscale, applying the Sobel filter, performing thresholding, calculating the output results, and exporting the final image The OpenCV library facilitates this process with functions such as cv2.cvtColor for color-to-gray conversion and cv2.Sobel for applying the Sobel filter.
The "math.h" library is utilized to compute essential values, including output amplitude, Mean Squared Error (MSE), and Peak Signal-to-Noise Ratio (PSNR) Additionally, the functions cv2.imread and cv2.imshow are employed for data import and export The implementation flowchart is illustrated in Figure 3.1, with the results detailed in Chapter 4.
Figure 3 1 Flowchart over the Sobel Filter on OpenCV - Python
Implementation of Sobel Filter based on co-design platform
This section implements the Sobel algorithm on a co-design platform, aiming to achieve superior edge detection compared to traditional software applications The system is designed to deliver the clearest image edges while meeting essential criteria, including low Mean Squared Error (MSE), high Peak Signal-to-Noise Ratio (PSNR), low power consumption, and optimal hardware resource utilization Additionally, it targets mid-range applications with potential for future enhancements.
Vivado tools, including Vivado HLS and SDK, are essential for implementing the Sobel algorithm According to chapter 2, the HLS tool generates the Sobel IP core, facilitating the synthesis of projects from C/C++ algorithms to RTL Additionally, Vivado is tasked with creating the hardware block diagram, which is crucial for the overall design process.
PL of the entire co-design Finally, SDK is launched for the software platform configuration All steps are shown in Figure 3.2
Figure 3 2 Design process using Vivado tools
We can outline the step-by-step things to do
Step 1 – Open Vivado HLS, create a new project, then select the hardware platform and clock speed After that, write the code of the edge detection algorithm and its test program to simulate Once the validation is successful, synthesize and pack the IP core Sobel into an IP repository which can be read by Vivado
Steps 2 and 3 – In step 2, open Vivado, create a new project, then select the hardware platform, and add all necessary IP cores to the “IP Catalog” of Vivado In step 3, use the IP repository to pick up IPs and connect them Notice that we can connect ourselves or use the automatic connection tool
Step 4 – Add the associated “Constraints” file (.XDC) which contains hardware connection declarations and clock initialization definitions A design can include multiple Constraint files
Step 5 – After adding Constraints to the design, validate the block connection, then synthesize and implement the system
Step 6 – Create a Bitstream (.bit) file to configure the PL part This file contains binary strings that describe the entire design, including clock value, I/O connections,…
Steps 7, 8, and 9 – Export the hardware design to the SDK tool and create the Board
Support Package (BSP) Then write code and compile it to generate the binary file (.elf) for configuring the PS part
Step 10 – Load and execute the program on the ZYNQ board
This program is designed to retrieve input data through a High-Definition Multimedia Interface (HDMI) port and output it via a Video Graphics Array (VGA) port on the hardware While both ports are available, their corresponding blocks must be developed for functionality However, the system currently lacks the capability to display the output.
3.2.2.1 Flowchart of the edge detection algorithm
The flowchart in Figure 3.3 outlines the edge detection algorithm, which relies on two open-source libraries, OpenCV or HLS Video The process begins with converting the input image from AXI to matrix format, followed by a transformation from RGB to Grayscale Next, the Sobel Filter is applied within the Processing System (PS) of the co-design system to extract the image's boundaries Finally, the processed image is converted back to RGB and AXI format for output display.
Figure 3 3 Flowchart over the entire edge detection system 3.2.2.2 Overview of PL and PS parts cooperation
The edge detection algorithm operates in both the Processing System (PS) and Programmable Logic (PL) components, as illustrated in Figure 3.4 This flowchart highlights the transfer of image data between various hardware types, including the ZYNQ board, DDR memory card, and HDMI ports, represented by blue blocks The gray blocks depict the algorithm's execution in the PS part, while the orange blocks indicate its operation in the PL part.
Figure 3 4 Flowchart illustration for the overview of the PL and PS parts cooperation
In Figure 3.5, the same color coding is used on the hardware illustration for the overview of the PL and PS parts cooperation
Figure 3 5 Hardware illustration for the overview of the PL and PS parts cooperation
3.2.2.3 Flowchart over the PL part of IP core Sobel
Figure 3.6 details the PL aspect of the Sobel IP core, beginning with the conversion of input image data into a matrix using the hls::AXIvideo2Mat instruction, essential for performing convolution operations The RGB image is then converted to a Grayscale model via the hls::CvtColor function Following this, the HLS Video library stores images as pixel streams, necessitating image duplication to preserve the gray-scale image for MSE and PSNR calculations, referred to as golden_image Finally, the processed image undergoes pre-processing steps to revert to the RGB color model before being transformed back into AXI format.
Figure 3 6 Flowchart over the PL part of IP core Sobel 3.2.2.4 Flowchart of the IP core Sobel in Test Bench
Utilizing the Test Bench tool is essential for executing algorithms, as it allows for the simulation of hardware implementations The Sobel test program plays a vital role in validating the functions of the IP Once the IP is confirmed to work effectively, it can be embedded into actual hardware A flowchart illustrating the IP core Sobel in the HLS Test Bench is presented in Figure 3.7 Additionally, to utilize certain functionalities, installing a library such as OpenCV is necessary.
The Sobel module, referenced in the previous section, utilizes the Sobel algorithm, but since the input image is already in matrix format via the cv::Mat instruction, it must be converted to AXI stream format for compatibility This conversion is achieved using the cvMat2AXIvideo function, followed by a final transformation back to matrix format for display with AXIvideo2cvMat Additionally, the simulation mirrors hardware implementation, allowing for the calculation of MSE and PSNR using functions from the "math.h" library.
Figure 3 7 Flowchart of the IP core Sobel in HLS Test Bench 3.2.3 The PL part configuration
The PL part includes the following blocks shown in Figure 3.8
Figure 3 8 Edge detection system’s block diagram
In short, there are the main functions of the two blocks above
The pre-processing block decodes the input stream into 24-bit RGB data via the HDMI port and converts it into the AXI4-Stream data protocol for efficient system data transfer.
The central processing block executes the Sobel filter for edge detection, effectively extracting all boundaries from the input image and subsequently storing the output image in DDR memory.
Design the pre-processing block needs the following IPs: IP DVI to RGB, IP Constant, IP Clocking Wizard, IP Video In to AXI4-Stream, and IP Video Timing Controller
The initial IP is the IP DVI to RGB, which connects directly to the HDMI port in PL to decode the input stream into 24-bit RGB data Each data channel transmits an 8-bit value for the red, green, or blue bus.
The block diagram in Figure 3.9 illustrates the functionality of this IP, highlighting that the data channels can operate with significant skew while remaining dependent on one another This design eliminates concerns regarding phase deflection in relation to the pixel clock or serial clock.
The clock channel has one 10-bit character transmitted per one data channel period
A 10-bit character is segmented into 8 bits of useful data and 2 bits of control data Consequently, this IP's output provides a control signal during its idle period and pixel data during its active phase.
Figure 3 9 DVI to RGB converter block diagram
Execution of edge detection system
We need to assign the external ports of the system to package pins on the ZYNQ board for implementation
Firstly, ZYNQ Board (ZYBO) supplies an external reference clock, which value is
The ZYBO board features a 125 MHz clock connected to pin L16 on the Programmable Logic (PL), enabling independent operation from the Processing System (PS) This setup is particularly beneficial for simple applications that do not require a processor Figure 3.24 illustrates all the clocks supported by the ZYBO, highlighting the significance of the 125 MHz system clock.
Therefore, Figure 3.25 below outlines all the pins that we need to connect to this edge detection system
Figure 3 25 I/O Ports of edge detection system
After packaging the I/O pins, it is essential to run synthesis, implementation, and generate the bitstream to create the necessary execution files Next, connect the hardware board to an input image source, such as a laptop or camera, and an output monitor The final bitstream file is embedded in the hardware target, with all these steps facilitated by the Hardware Manager tool in Vivado.
RESULT
Block diagram of edge detection system
Figure 4.1 illustrates an overview of this edge detection system's block design created in Vivado It includes most premade IP blocks that were introduced in chapter 3.
Measurements of edge detection system based on HLS implementation
The edge detection system's power consumption, illustrated in Figure 4.2, is analyzed using Vivado Design Evaluation tools Dynamic power, which is the energy required during application operation, is determined by averaging the switching activity over time Within this dynamic power consumption, the Processing System 7 is the primary power consumer, while BRAM and DSP exhibit the lowest power usage Additionally, static power represents the essential minimum power needed for the system's operation.
Device static power is determined by the leakage of transistor on all connected voltage rails and the circuits required for the FPGA to operate normally post configuration
The static power of a device is measured by programming a blank bitstream into it, representing the steady-state intrinsic leakage, which is influenced by process, voltage, and temperature Conversely, design power refers to the dynamic power consumption of the device.
The power consumption of an FPGA is dynamic, changing with each clock cycle and influenced by voltage levels, logic, and routing resources It encompasses static current from I/O terminations, clock managers, and other active circuits The total on-chip power, also referred to as thermal power, is the sum of the device's static power and the power consumed by the design.
Figure 4 2 (a) Power summary (b) Power On-chip 4.2.2 Throughput value and Hardware utilization
In Vivado HLS, the edge detection system operates with a clock period of 13.5 ns, while a summary table indicates an estimated clock period of 12.49 ns, resulting in a timing margin of 1.01 ns, which falls short of the required minimum margin of 1.69 ns However, the actual timing, as depicted in Figure 4.5, meets the necessary specifications Additionally, the second table outlines the latency and interval, revealing that the maximum throughput latency is a crucial performance metric.
928503 clock cycles and it can start to process new input data after 928498 clock cycles This results in a throughput of ~80 frames/s (1/(928503 cycles/frame * 13.5 ns/cycle))
In Vivado HLS, setting the target clock period for the IP core is crucial and is determined by the system frequency A period of 10 corresponds to a frequency of 100MHz, while a design resolution of 1280p necessitates a specific frequency If the uncertain clock is not specified, it defaults to 27% of the 10ns clock period, leading to a calculated target of 13.5ns Additionally, latency refers to the number of cycles needed for the system to accept the next input; for instance, if a design accepts a new input every clock cycle but requires 10 cycles to propagate from input to output, the latency is defined as 10 The HLS Analysis Tool effectively recognizes these design parameters.
Latency in a system is defined as the duration from when the main function is initiated until it completes, specifically measured by the time interval between two ap_ready signals in High-Level Synthesis (HLS) To understand this concept, one can manually simulate the process by observing and recording the moment when the ap_ready signal transitions to a low state.
46 block has started), record when ap_ready goes high again (is ready to start on the next dataset0, and subtract one from the other, the result is latency
Figure 4 3 Measurements of timing and latency in the synthesis report
Figure 4.4 illustrates that the hardware utilization for the Sobel IP core meets the required usage volume, as the consumption of BRAM, DSP48E, FF, and LUT remains significantly low The accompanying table provides a detailed breakdown of how hardware resources are allocated across various functions in HLS, clearly indicating that the Sobel function utilizes the majority of these resources.
Figure 4 4 The estimated hardware utilization of IP core Sobel
The RTL exportation report reveals that the IP core Sobel implementation achieved a timing margin of 3.53 ns, surpassing the synthesis estimation of 1.47 ns Furthermore, the utilization of BRAM, FF, and LUT is notably lower than initially estimated, as illustrated in Figure 4.5.
Figure 4 5 The RTL exportation of IP core Sobel
Figure 4.6 illustrates the ordering of performing functions, which can be observed by using the analysis tool in HLS Performance
Figure 4 6 The ordering of performing functions by HLS Performance
Output result and comparison
The Sobel Filter's input and output images, as implemented on the ZYNQ co-design platform, are illustrated in Figure 4.7, showcasing results from the HLS Test Bench Additionally, Figure 4.8 presents the output generated by the Sobel Filter when executed using OpenCV in Python.
Figure 4 7 Result of edge detection algorithm on ZynQ-7000 platform
Figure 4 8 Result of edge detection algorithm on OpenCV – Python
The result parameters that need to be compared will be shown in Table 4.1 below
The ZYNQ-7000 platform outperforms OpenCV-Python in image boundary detection, providing smoother results while significantly reducing memory usage This co-design approach leads to lower power consumption Furthermore, the ZYNQ-7000 platform achieves a Mean Squared Error (MSE) that is 5.8 times lower and a Peak Signal-to-Noise Ratio (PSNR) that is 2.2 times higher than OpenCV-Python, indicating superior output image quality, as lower MSE and higher PSNR values are indicative of better processed images.
An optimization of the throughput indicator
In this thesis, throughput refers to the number of image frames processed by the system To enhance throughput, adjustments to the clock period can be made Two scenarios are analyzed to assess the impact of clock period on throughput In the first scenario, the target clock is increased to 26 ns, which is double the current value, while still meeting timing requirements This adjustment results in a reduction of SLICEs by 5, a decrease of LUTs by 6, and a reduction in the required FFs to less than 5.3%.
Despite that effective usage, the performance of the algorithm is worse than the 13.5 ns,
The target clock was reduced to 5 ns, as recommended by Xilinx designers, resulting in an estimated clock period of 4.36 ns and a margin of 0.63 ns, with a post-implementation clock period of 4.721 ns that met the timing requirements This optimization achieved a throughput of approximately 214 frames per second However, it increased hardware utilization, requiring over 15.73% of SLICEs, 3.5% of LUTs, and 30% of FFs Detailed results for both solutions are available in the Appendices, and Figure 4.9 illustrates the relationship between clock period and throughput.
Figure 4 9 Changing of throughput by the clock period
CONCLUSION AND FUTURE WORK
Conclusion
This thesis presents compelling evidence that implementing an image processing algorithm on a co-design platform is superior to using a software platform The analysis of MSE and PSNR metrics indicates that image quality from the Python application is inferior compared to the ZYNQ co-design The co-design approach not only optimizes memory usage but is also crucial for data storage efficiency in larger systems, especially after edge detection Additionally, with an energy consumption of 1.936 W and a throughput of approximately 80 frames per second, the power consumption stands at 0.024 J per frame, highlighting the potential for significant hardware resource savings with further design enhancements Future work should consider adjusting the target clock period to address the complexities of algorithm development, as noted in section 5.2.
However, it needs to take a step back, observe, and conclude the weaknesses of this thesis Firstly, the entire system has never been successfully implemented on the ZYNQ
The Sobel design in Vivado and SDK faces VGA configuration issues, preventing the output image and execution time from being displayed on the monitor Additionally, the design is not fully optimized, exhibiting a Worst Pulse Width Slack (WPWS) of 0.185 ns, although it remains within acceptable limits Furthermore, the Sobel algorithm lacks innovation in image processing compared to more advanced algorithms like Canny and LoG, which present significant implementation challenges due to their complexity.
Future work
To enhance the edge detection system, we propose two key solutions First, improving the Sobel IP core by adjusting the target clock period can optimize resource utilization, though increasing the clock period may not yield the best performance indicators While a 5 ns period can waste resources, a robust co-design hardware setup can achieve significantly higher throughput Second, integrating the edge detection system into a more advanced application can further enhance its functionality, demonstrating its versatility and potential for improved performance.
The automatic insect detection model enhances agricultural yields by overcoming the limitations of previous Matlab-only implementations, facilitating real-time harmful insect retrieval Additionally, the image edge detection system plays a crucial role in traffic applications, particularly in detecting traffic signs through a combination of edge detection and template matching algorithms, aiding drivers in unfamiliar situations Moreover, edge detection systems find diverse applications in fields such as medicine, security, and environmental monitoring.
[1] Nguyễn Thanh Hải, Giáo trình xử lý ảnh, Đại học Quốc Gia Thành phố Hồ Chí Minh, 2003
[2] Soma Prathap; Jatoth Ravi, "Hardware Implementation Issues on Image Processing Algorithms," National Institute of Technology Warangal, 2018
[3] Sagharichi Ha, Pooya; Shakeri, Mojtaba, "License Plate Automatic Recognition Based on Edge Detection," Faculty of Computer and IT Engineering, 2016
[4] Thenmozhi, K; Reddy U, Srinivasulu, "Image Processing Techniques for Insect Shape Detection in Field Crops," International Conference on Inventive Computing and Informatics, 2017
[5] Abbasi, Tanvir, "A Proposed FPGA Based Architecture for Sobel Edge Detection Operator," 2007
[6] Halder, Santanu; Hasnat, Abul; Khatun, Amina; Bhattacharjee, Debotosh; Nasipuri, Mita, "A Fast FPGA Based Architecture for Skin Region Detection," International
Journal of Innovative Technology and Exploring Engineering, 2013
[7] Rafael C Gonzalez; Richard E Woods, Digital Image Processing, New York:
[8] Rang M H Nguyen; Michael S.Brown, "Why You Should Forget Luminance
Conversion and Do Something Better," Computer Vision Foundation, IEEE Xplore,
[9] Nguyễn Quang Hoan, Giáo trình xử lý ảnh, Học viện Công nghệ Bưu chính Viễn thông, 2006
[10] Sung Kim; Riley Casper, "Applications of Convolution in Image Processing with MATLAB," University of Washington, 2013
[11] Ramesh Jain; Rangachar Kasturi; Brian G Schunck, "Edge Detection," in
MACHINE VISION, McGraw-Hill, 1995, pp 140-185
[12] Ansari, Mohd; Kurchaniya, Diksha; Dixit, Manish, "A Comprehensive Analysis of Image Edge Detection Techniques," International Journal of Multimedia and
[13] Oskar Mencer; Dennis Allison; Elad Blatt; Mark Cummings; Michael J Flynn; Jerry Harris; Carl Hewitt; Quinn Jacobson; Maysam Lavasani; Mohsen Moazami; Hal Murray; Masoud Nikravesh; Andreas Nowatzyk; Mark Shand; Shahram
Shirazi, "The History, Status, and Future of FPGAs: Hitting a nerve with field- programmable gate arrays," 2020
[14] Ian Kuon; Russell Tessier; Jonathan Rose, "FPGA Architecture: Survey and
Challenges," Foundations and Trends in Electronic Design Automation, 2008
[15] Xilinx, UltraScale Architecture DSP Slice, www.xilinx.com, 2021
[16] Rajewski, Justin, "How does an FPGA work?," Embedded Micro, 2015 [Online] Available: https://learn.sparkfun.com/tutorials/how-does-an-fpga-work/all
[17] Eastland, Nate, "FPGA - Configurable Logic Block," Digilent Inc Blog, 2015 [Online] Available: https://digilent.com/blog/fpga-configurable-logic-block/
[18] M Rouse, "IP core (Intellectual Property core)," TechTarget Contributor, March
2011 [Online] Available: https://www.techtarget.com/whatis/definition/IP-core- intellectual-property-core [Accessed 6 July 2022]
[19] K Karras; J Hrica, "Designing protocol processing systems with vivado high level synthesis," 2014 [Online] Available: https://docs.xilinx.com/v/u/en-US/xapp1209- designing-protocol-processing-systems-hls [Accessed 8 July 2022]
[20] Xilinx Inc., "Vivado HLS optimization methodology guide," 2017 [Online]
Available: https://docs.xilinx.com/v/u/2017.4-English/ug1270-vivado-hls-opt- methodology-guide [Accessed 8 July 2022]
[21] Xilinx Inc., "Accelerating OpenCV applications with ZYNQ-7000 all programmable SoC," 2015 [Online] Available: https://docs.xilinx.com/v/u/en-US/xapp1167 [Accessed 8 July 2022]
[22] Arthur H Veen, "Dataflow Machine Architecture," Center for Mathematics and
[23] Xilinx Inc., "Vivado Design Suite User Guide: High-Level Synthesis," 2017
[Online] Available: https://docs.xilinx.com/v/u/en-US/ug902-vivado-high-level- synthesis [Accessed 8 July 2022]
[24] Digilent, ZYBO FPGA Board Reference Manual, 2017
[25] Xilinx Inc., "AXI Reference Guide," 2012 [Online] Available: https://docs.xilinx.com/v/u/en-US/ug761_axi_reference_guide [Accessed 8 July
[26] Digilent, "DVI to RGB (Sink) 2.0 IP Core User Guide," 9 October 2019 [Online] Available: www.digilentinc.com [Accessed 12 July 2022]
[27] Xilinx Inc., "Clocking Wizard v6.0 LogiCORE IP Product Guide," 20 April 2022 [Online] Available: www.xilinx.com [Accessed 12 July 2022]
[28] Xilinx Inc., "LogiCORE IP Constant (v1.1)," 9 April 2018 [Online] Available: www.xilinx.com [Accessed 12 July 2022]
[29] Xilinx, "Video In to AXI4-Stream v4.0 LogiCore IP Product Guide," 18 November
2015 [Online] Available: www.xilinx.com [Accessed 23 June 2022]
[30] Xilinx, "Video Timing Controller v6.2 LogiCORE IP Product Guide," 26 February
2021 [Online] Available: www.xilinx.com [Accessed 23 June 2022]
[31] Xilinx Inc., "Processing System 7 v5.5 Product Guide," 10 May 2017 [Online] Available: www.xilinx.com [Accessed 12 July 2022]
Result indicators for IP core Sobel – a target clock period of 26 ns
Synthesis Report for 'edge_detect'
Version: 2016.4 (Build 1756540 on Mon Jan 23 19:31:01 MST 2017)
Clock Target Estimated Uncertainty ap_clk 26.00 21.17 3.25
Type min max min max
Name BRAM_18K DSP48E FF LUT
Instance Module BRAM_18K DSP48E FF LUT
Export Report for 'edge_detect'
Result indicators for IP core Sobel – a target clock period of 5 ns
Synthesis Report for 'edge_detect'
Version: 2016.4 (Build 1756540 on Mon Jan 23 19:31:01 MST 2017)
Clock Target Estimated Uncertainty ap_clk 5.00 4.36 0.63
Latency Interval Type max min max min max
Name BRAM_18K DSP48E FF LUT
Instance Module BRAM_18K DSP48E FF LUT
Export Report for 'edge_detect'