A Methodology For Translating Scheduled Software Binaries onto Field Programmable Gate Arrays

Binary to Hardware Translation

Implementing a hardware/software co-design from software binaries presents significant challenges, as the compiler must identify computational bottlenecks, partition these sections for hardware translation, create an appropriate hardware/software interface, and adjust the original software binary for seamless integration For the co-design to be valuable, it must demonstrate performance improvements over the original implementation Unlike traditional high-level synthesis, binary translation offers the advantage of compatibility with any high-level language and compiler flow, as binary code serves as the final format for specific processors Additionally, software profiling at the binary level provides greater accuracy than at the source level, facilitating more effective hardware/software partitioning.

Software Partitioned on new Processor

Hardware Implementation on FPGA/ASIC

Figure 1.1 Using binary translation to implement a harware/software co-design.

Translating software binaries from a fixed processor architecture to hardware systems like FPGAs or ASICs presents a complex challenge that goes beyond traditional high-level synthesis General-purpose processors have a limited number of functional units and physical registers, requiring advanced register-reuse algorithms to manage large data sets effectively This often leads to inefficiencies, as memory spilling optimizations can waste clock cycles due to excessive memory loads and stores The key challenge lies in reversing these optimizations by understanding the architectural limitations and leveraging fine-grain parallelism through a greater number of functional units, embedded multipliers, registers, and on-chip memories.

FPGA designers unfamiliar with DSP concepts

DSP designers not versed in FPGA design

FPGA designers unfamiliar with DSP concepts

DSP designers not versed in FPGA design

Figure 1.2 FREEDOM compiler bridging the gap between DSP and FPGA designs environments

The manual implementation of DSP applications on FPGAs poses significant challenges, as DSP engineers typically lack hardware implementation expertise, while hardware engineers are often unfamiliar with the intricacies of DSP and general-purpose processors To address this gap, an automated solution is frequently utilized to partition or translate software binaries In response to this need, we have developed the FREEDOM compiler, which efficiently converts DSP software binaries into hardware descriptions suitable for FPGAs The term "FREEDOM" stands for "Fabrication of Reconfigurable."

The FREEDOM compiler concept is illustrated in Figure 1.2, focusing on hardware environments optimized for DSP machine code This research evaluates the Texas Instruments TMS320C6211 DSP as a general-purpose processor platform and the Xilinx Virtex II FPGA as the hardware platform The following sections provide a brief overview of these architectures and their design flow.

Texas Instruments TMS320C6211 DSP Design Flow

The Texas Instruments C6211 DSP features eight functional units, including dual load and store memory data paths, data address paths, and register file data cross paths, enabling the execution of up to eight simultaneous instructions It supports 8, 16, and 32-bit data, along with 40 and 64-bit arithmetic operations, and includes two sets of 32 general-purpose registers, each 32-bits wide Additionally, the processor is equipped with two multipliers capable of performing two 16x16 or four 8x8 multiplies per cycle and offers special support for non-aligned 32/64-bit memory access The C6211 also accommodates bit-level algorithms and includes hardware for rotate and bit count operations.

Figure 1.3 Texas Instruments C6211 DSP architecture.

The application design flow for processors like the Texas Instruments C6211 begins with DSP engineers creating specifications and high-level design models using languages such as C/C++, MATLAB, or SIMULINK These models undergo simulation and verification with both known and randomized data Once verified, the high-level design is compiled into a binary for the processor, which is subsequently simulated again to ensure correctness and identify computational bottlenecks If the results do not meet specifications or timing constraints, DSP designers refine the high-level design or implement optimizations, and may resort to writing assembly code for further efficiency.

Figure 1.4 Texas Instruments C6211 DSP development flow.

Xilinx Virtex II FPGA Design Flow

The Xilinx Virtex II FPGA features up to 168 18x18 multipliers, enabling 18-bit signed and 17-bit unsigned calculations, with cascading capabilities for larger numbers It includes 3 Mbits of embedded Block RAM, 1.5 Mbits of distributed memory, and 100K logic cells Additionally, the Virtex II FPGAs support up to 12 Digital Clock Managers (DCMs) and deliver logic performance exceeding 300 MHz.

The implementation of an application on an FPGA typically starts with DSP engineers creating specifications and design models using high-level languages like C/C++, MATLAB, or SIMULINK These models undergo simulation and verification with known or randomized data before being handed over to hardware engineers, who either manually write hardware descriptions in VHDL or Verilog or utilize high-level synthesis tools for automatic translation The design is then simulated again to ensure bit-true accuracy and identify computational bottlenecks If the design fails to meet specifications or timing constraints, hardware engineers refine the descriptions or optimize the design for efficiency Once the simulation results align with the required specifications, the hardware descriptions are synthesized into a netlist of gates for the FPGA, followed by another round of simulation for verification Any errors or unmet specifications lead to a re-evaluation of the hardware descriptions After successful post-synthesis simulation, the netlist is placed and routed on the FPGA using back-end tools, with further timing analysis and verification Should any issues arise at this stage, the hardware engineers must once again reassess the hardware descriptions.

Figure 1.5 Xilinx Virtex II FPGA development flow

Motivational Example

To grasp the intricacies of translating software binaries and assembly code to hardware, we can examine the Texas Instruments C6211 DSP assembly code This processor features eight functional units, enabling it to execute up to eight instructions simultaneously In this code, MPY instructions take two cycles to complete, while all other instructions require just one cycle, leading to the inclusion of a NOP instruction after the MPY The || symbol indicates that certain instructions are executed in parallel with the preceding ones, resulting in a total execution time of seven cycles for this code section.

Figure 1.6 Example TI C6000 DSP assembly code.

Translating this code into an FPGA using a straightforward RTL finite state machine approach, with one operation per state, yields no performance improvement and necessitates eight cycles for execution due to the eight instructions, not counting NOPs To enhance performance, it is essential to investigate scheduling techniques and other optimizations that leverage the inherent fine-grain parallelism of FPGA architecture, ultimately minimizing design complexity and reducing execution clock cycles.

Dissertation Overview

This research significantly contributes by detailing the process and considerations for the automatic translation of software binaries from general-purpose processors to RTL descriptions for FPGA implementation It includes various optimizations and scheduling routines, alongside crucial concepts for translating software to hardware The effectiveness of this approach is validated through experimental evaluations of synthesized software binaries, assessing their area and performance on FPGAs compared to traditional processor implementations This dissertation represents a collaborative effort with Gaurav Mittal.

This dissertation is structured as follows: Chapter 2 surveys related work, while Chapter 3 offers a comprehensive overview of the FREEDOM compiler infrastructure, which converts software binaries into hardware descriptions for FPGAs Chapter 4 delves into the complexities of control and data flow analysis for scheduled and pipelined software binaries In Chapter 5, the optimizations within the FREEDOM compiler are outlined Chapter 6 introduces innovative scheduling and operation chaining techniques for FPGA designs, followed by Chapter 7, which presents a resource sharing optimization Chapter 8 explores hardware/software partitioning through structural extraction methods Chapter 9 features a case study of the MPEG-4 CODEC, illustrating the translation from software to hardware using the discussed methods Finally, Chapter 10 concludes the dissertation and suggests directions for future research.

This dissertation explores three key research areas: high-level synthesis, binary translation, and hardware-software co-design Central to this research is the FREEDOM compiler, which integrates these domains by enabling high-level synthesis of software binaries for FPGA architecture implementation within a hardware-software co-design framework Subsequent sections will highlight recent advancements in these interconnected fields.

High-Level Synthesis

The challenge of converting high-level behavioral language descriptions into register transfer level (RTL) representations has been widely studied in both research and industry One of the pioneering commercial tools for behavioral synthesis is Synopsys' Behavioral Compiler, which effectively translates behavioral VHDL or Verilog into RTL VHDL or Verilog Additionally, Princeton University has contributed to advancements in this area.

Synthesis System [67] is another system that translated behavioral VHDL models and processes into RTL implementations

In recent years, significant advancements have been made in the development of compilers that convert high-level programming languages into RTL VHDL and Verilog Numerous commercial products are now available from electronic design automation (EDA) companies, including synthesis tools from Adelante, Celoxica, and Cynapps that facilitate the translation of C code into RTL descriptions Notably, the MATCH compiler, which translates MATLAB functions to RTL VHDL for FPGAs, has been commercialized by AccelChip Additionally, there are system-level tools that convert graphical system representations into RTL descriptions.

Cadence's SPW, Xilinx's System Generator, and Altera's DSP Builder are prominent tools in high-level synthesis design flows Additionally, SystemC, a newer language, enables users to create hardware system descriptions through a C++ class library Another option is HardwareC, a hardware description language that features syntax akin to traditional programming languages.

C that models system behavior, and was used in the Olympus Synthesis System [15] for synthesizing digital circuit designs.

Recent advancements in high-level synthesis emphasize the integration of low-level design optimizations within the synthesis process Dougherty and Thomas introduced a unified approach that aligns behavioral synthesis with physical design, enabling simultaneous scheduling, allocation, binding, and placement Gu et al developed an incremental high-level synthesis system that merges high-level and physical design algorithms to enhance scheduling, resource binding, and floorplanning, effectively minimizing synthesis time, area, and power consumption Bringmann et al focused on partitioning and an enhanced interconnection cost model for multi-FPGA systems, aiming to achieve optimal performance while adhering to area and interconnection constraints Additionally, Peng et al presented techniques for the automatic synthesis of sequential programs and high-level asynchronous circuit descriptions into fine-grain asynchronous process netlists tailored for high-performance FPGA architectures.

In recent years, power and thermal optimizations have become crucial in high-level synthesis research Musoll and Cortadella evaluated various power optimizations during synthesis, while Jones et al focused on the PACT compiler for translating C code to RTL VHDL and Verilog, emphasizing power efficiency for FPGAs and ASICs Chen et al introduced the LOPASS high-level synthesis system, which reduces power consumption in FPGA designs by incorporating RTL power estimates that account for wire length Stammermann et al discussed simultaneous floorplanning, functional unit binding, and allocation to minimize interconnect power during high-level synthesis Additionally, Lakshminarayana and Jha proposed an iterative technique for optimizing hierarchical data flow graphs for power and area efficiency Mukherjee et al explored temperature-aware resource allocation and binding methods to minimize peak temperatures in designs, enhancing reliability and preventing hot spots in integrated circuits.

Research studies primarily concentrate on conventional high-level synthesis, where hardware implementations are derived from accessible high-level applications and source code In contrast, the FREEDOM compiler innovatively translates software binaries and assembly language into RTL VHDL and Verilog for FPGAs This method is particularly intriguing because it addresses the challenges of binary translation to hardware, especially when source code and high-level information for optimizations are not readily available.

Binary Translation

Extensive research has been conducted on binary translation and decompilation, with Cifuentes et al outlining methods to convert assembly or binary code between different instruction set architectures (ISAs) and to decompile software binaries into high-level languages Additionally, Kruegel et al introduced a technique for decompiling obfuscated binaries, while Dehnert et al developed a technique known as Code Morphing, successfully implementing a complete system-level version of the Intel x86 ISA on the Transmeta Crusoe VLIW processor.

Dynamic binary optimization has seen significant advancements, with notable contributions from various researchers Bala et al introduced the Dynamo system, which optimizes binaries for HP architecture at runtime Similarly, Gschwind et al created the Binary-translation Optimized Architecture (BOA) for PowerPC architecture Levine and Schmidt proposed HASTE, a hybrid architecture that dynamically compiles instructions from an embedded processor onto a reconfigurable computational fabric to enhance performance Additionally, Ye et al developed a compiler system for the Chimaera architecture, featuring a small, reconfigurable functional unit integrated into the pipeline of a dynamically scheduled superscalar processor.

Hardware-Software Co-Designs

Choosing the right hardware-software architecture involves carefully balancing various factors, including the distribution of tasks across processing elements, effective inter-processor communication, and overall device costs.

De Micheli et al [15][16][17] and Ernst [22] have discussed much of the fundamental aspects of hardware/software co-designs

Hardware/software co-designs are typically partitioned at the task or process level, with significant contributions from researchers in the field Vallerio and Jha developed a task graph extraction tool for hardware/software partitioning of C programs, while Xie and Wolf introduced an allocation and scheduling algorithm for data-dependent tasks in distributed embedded systems The co-synthesis process aims to create a distributed multiprocessor architecture that optimally allocates processes, ensuring task graph deadlines are met while minimizing system costs Gupta and De Micheli proposed an iterative improvement algorithm for partitioning real-time embedded systems based on a cost model that considers hardware, software, and interface constraints Additionally, Wolf presented a heuristic algorithm that synthesizes both hardware and software architectures of distributed systems, ensuring performance constraints are satisfied Xie and Wolf also introduced a co-synthesis algorithm that optimizes distributed embedded systems by selecting the best ASIC implementation for tasks, utilizing a heuristic iterative improvement approach to analyze and compare various implementations.

Li et al [40] introduced the Nimble compilation tool, which automates the compilation of system-level applications written in C for a hardware/software embedded reconfigurable architecture This architecture includes a general-purpose processor, an FPGA, and a memory hierarchy Their approach focuses on fine-grain hardware/software partitioning at the loop and basic-block levels, utilizing heuristics to identify and transfer frequently executed or time-intensive loop bodies to hardware for improved performance.

Research on hardware-software partitioning of software binaries has been advanced by Stitt and Vahid, who initially focused on manually translating software binary kernels from frequently executed loops on a MIPS processor for implementation on a Xilinx Virtex FPGA Their more recent work has introduced fast dynamic hardware/software partitioning, where they automatically map simple loop kernels onto reconfigurable hardware However, their approach utilized significantly simpler hardware than typical commercial FPGA architectures, limiting it to combinational logic structures, sequential memory addresses, pre-determined loop sizes, and single-cycle loop bodies.

Unlike traditional methods, the FREEDOM compiler can convert entire software binaries or specific sections into hardware implementations on FPGAs This capability allows for both standalone designs and hardware/software co-designs, enhancing flexibility and efficiency in development.

This chapter presents an overview of the FREEDOM compiler infrastructure, which is designed to provide a unified entry point for various assembly languages The front-end of the compiler utilizes a description of the source processor's Instruction Set Architecture (ISA) to configure the assembly language parser, with ISA specifications written in SLED from the New Jersey Machine-Code toolkit The parser converts the input source code into an intermediate assembly representation known as the Machine Language Syntax Tree (MST), where simple optimizations, linearization, and procedure extraction are performed Subsequently, a control and data flow graph (CDFG) is generated from the MST instructions, enabling the execution of more complex optimizations, scheduling, and resource binding before translating the CDFG into an intermediate high-level Hardware Description.

Language (HDL) that models processes, concurrency, and finite state machines.

Additional optimizations and customizations for the target architecture are performed on the HDL This information is acquired via the Architecture Description Language (ADL)

The HDL is directly converted into RTL VHDL and Verilog for FPGA mapping, accompanied by a testbench to ensure bit-true accuracy in the design Additionally, a graphical user interface (GUI) has been developed to facilitate project management and compiler optimizations for the designs.

This article offers a comprehensive overview of various components of the compiler, including the Minimum Spanning Tree (MST), Control Data Flow Graph (CDFG), and Hardware Description Language (HDL) Additionally, it includes a concise description of the Graphical User Interface (GUI) and outlines the verification techniques used for the generated designs.

RTL VHDL RTL Verilog Testbench

Optimizations, Linearization, and Procedure Extraction

Optimizations, Loop Unrolling, Scheduling, and Resource Binding

RTL VHDL RTL Verilog Testbench

Optimizations, Linearization, and Procedure Extraction

Optimizations, Loop Unrolling, Scheduling, and Resource Binding

Figure 3.7 Overview of the FREEDOM compiler infrastructure.

The Machine Language Syntax Tree

The Machine Language Syntax Tree (MST) serves as an intermediate language with a syntax akin to the MIPS Instruction Set Architecture (ISA) Its generic design allows it to encompass various ISAs, including those that utilize predicated and parallel instruction sets Each MST design consists of procedures, each containing a sequence of instructions For a comprehensive understanding of the MST grammar, please refer to Appendix A.

Advanced general-purpose processors incorporate predicated operations within their Instruction Set Architecture (ISA), leading to all MST instructions being structured as three-operand, predicated commands in the format: [pred] op src1 src2 dst, which can be interpreted as: if (pred=true) then op (src1, src2) → dst MST operands consist of four types: Register, Immediate, Memory, and Label Additionally, MST operators are categorized into six groups: Logical, Arithmetic, Compare, Branch, Assignment, and General For a detailed overview of the supported operations in the MST language, refer to Table 3.1.

Table 3.1 Supported operations in the MST grammar.

Logical Arithmetic Compare Branch Assignment General

ADD DIV MULT NEG SUB

CMPEQ CMPGE CMPGT CMPLE CMPLT CMPNE

BEQ BGEQ BGT BLEQ BLT BNEQ CALL GOTO JMP

An MST procedure consists of a self-contained set of instructions with an independent control flow Intra-procedural control flow can be modified using branch instructions like BEQ, GOTO, and JMP, with destination operands that may include Labels, Registers, or Immediate values For inter-procedural control, the CALL operation is utilized to transfer control from one procedure to another, requiring the destination operand to be a label that specifies the name of the procedure, function, or library.

The limited number of physical registers in a processor requires compilers to implement sophisticated register reuse algorithms These optimizations can create false dependencies linked to register names, complicating the identification of accurate data dependencies, particularly in scheduled or pipelined binaries and parallel instruction sets To address these issues, each MST instruction is given a timestep that defines a linear instruction sequence and an operation delay that corresponds to the number of execution cycles Each cycle initiates with an integer-based timestep, T, with each instruction n in a parallel instruction set assigned this timestep.

T n = T + (0.01 * n) Assembly instructions may be translated into more than one MST instruction Each instruction m in an expanded instruction set is assigned the timestep

The write-back time for an instruction is calculated using the formula Tm = Tn + (0.0001 * m), where wb represents the cycle in which the result is valid If the operation delay is zero, the data becomes valid immediately Conversely, if there is a delay greater than zero, the write-back time is rounded down to the nearest whole number, meaning valid data is available at the start of the write-back cycle.

Figure 3.8 demonstrates the role of instruction timestep and delay in identifying data dependencies within the MST The MULT operation in the first instruction introduces a delay slot, rendering the value in register A4 invalid until the start of cycle 3 Meanwhile, the LD instruction's result remains invalid until cycle 7, and the ADD instruction's result is also not valid until cycle 3 As a result, the ADD instruction in cycle 3 relies on the outcomes of both the MULT operation in cycle 1 and the ADD operation in cycle 2 Additionally, the first three instructions share a dependency on the same source register, A4.

TIMESTEP PC OP DELAY SRC1 SRC2 DST

Figure 3.8 MST instructions containing timesteps and delays for determining data dependencies.

The Control and Data Flow Graph

Each MST procedure is transformed into a distinct control and data flow graph (CDFG), where the control flow graph illustrates the flow of control through block-level representations marked by branch operations Meanwhile, the data flow graph consists of interconnected nodes that signify data dependencies within the procedure By utilizing the write-back times (wb = timestep + delay) for every operation, one can determine the data dependencies for each MST instruction, as detailed in Section 3.1.3.

MVK S1 2000,A3 LOOP: LDW D1 *A4++,A2 LDW D1 *A3++,A5 NOP 4 MPY M1 A2,A5,A6 SUB S1 A1,1,A1 ADD L1 A6,A7,A7 [A1] B S2 LOOP NOP 5

Figure 3.9 TI assembly code for a dot product function.

The nodes in the CDFG are distinguished in five different types: Constants,

In programming, variables and values play crucial roles, with constant and variable nodes serving as inputs for operations Value nodes are responsible for executing operations like addition and multiplication, while control nodes manage branching in the control flow Memory nodes handle read and write operations The TI assembly code for a dot product procedure is depicted in Figure 3.9, and Figure 3.10 showcases the CDFG representation produced by the FREEDOM compiler.

Figure 3.10 CDFG representation for a dot product function.

In the CDFG, memory read and write operations are depicted as single node operations that receive two input edges from source nodes For read operations, the source nodes include an address and a memory variable, while write operations consist of an address and the value intended for writing Each memory variable corresponds to distinct memory elements, allowing for clear identification of the memory being accessed Consequently, memory partitioning can be achieved simply by altering the name of the memory variable in these operations.

When rescheduling FPGA design operations, the sequence of memory operations may change, leading to potential memory hazards Some FPGAs restrict memory operations to one per cycle, necessitating careful scheduling to maintain the correct order of these operations To achieve this, virtual edges are introduced between each read and write operation and the subsequent memory tasks, as illustrated in block 2 of Figure 3.10.

Predicated instructions are operations executed based on a specific condition, known as a predicate In the Control Data Flow Graph (CDFG), these operations incorporate the predicate operand as a source node linked to the operation node To maintain data dependencies, an additional edge from the previous definition of the destination node is included as a source node This process effectively creates an if-then-else structure, where the previous definition is assigned to the destination operand if the predicate condition is unmet, as demonstrated in Figure 3.11.

Figure 3.11 Predicated operation in the CDFG.

Optimizing a Control Data Flow Graph (CDFG) presents challenges due to predicated operations, which create non-deterministic outcomes at compile time This non-determinism hinders various optimizations and leads to increased design area due to additional multiplexer logic Despite these limitations, predicated operations can enhance parallelism opportunities within the design.

To address the issue of predicates in operations, two methods were evaluated The first method involves replacing predicates with branch operations that skip instructions when conditions are unmet, leading to disruptions in data flow and reduced efficiency and parallelism The second method transforms predicates into a set of instructions resembling a multiplexer, where the predicate condition chooses between new and old values, resulting in an increase in instructions, area, and clock cycles Ultimately, it was concluded that the original method is the most efficient approach for managing predicates in the CDFG.

3.2.3 Procedure, Function and Library Calls

In the CDFG, procedure, function, and library calls are represented by a CALL node, with input and output edges indicating the I/O ports for the procedure Variables or memory read within the procedure become input ports, while those written to become output ports Memory operands are designated as both input and output nodes to maintain the correct sequence of operations An example CDFG illustrates the dot product procedure being called twice, highlighting a memory dependency that prevents parallel execution of the two functions and mitigates read/write hazards.

Figure 3.12 CDFG representation of a CALL operation.

The Hardware Description Language

The Hardware Description Language (HDL) serves as a low-level intermediate language that effectively models processes, concurrency, and finite state machines Its syntax closely resembles that of VHDL and Verilog, with detailed grammar specifications available in Appendix B.

In a design, each Control Data Flow Graph (CDFG) corresponds to a specific Hardware Description Language (HDL) entity, where each operation node is typically linked to a state in an HDL finite state machine By implementing scheduling techniques, designers can leverage parallelism, allowing multiple instructions to be assigned to each state, which minimizes the overall execution cycles Furthermore, optimizations and customizations are applied to the HDL to improve output efficiency and ensure compatibility with the target device's architecture.

Memory models in HDL are created to facilitate backend synthesis tools in automatically inferring synchronous and asynchronous RAMs These memory components serve as FIFO buffers, enabling communication between a co-processor and hardware To enhance throughput performance, pipelining techniques are implemented for memory operations in the design.

Each CDFG translated to HDL is represented as a synchronous finite state machine (FSM) within an Entity model, with procedures called from other procedures instantiated as HDL entities The HDL for the dot product procedure is illustrated in Figure 3.13(a), where the process is controlled by I/O signals, beginning with a high reset signal and cycling through states until the function completes, indicated by the done signal To avoid memory contention, an asynchronous multiplexer process manages memory control among all processes, activated by assigning a select value to the mem_mux_ctrl signal before a process starts, as depicted in Figure 3.13(b) The HDL model for the memory MUX control process is further detailed in Figure 3.14.

( clock critical_delay then

20 else if total_delay max_level then return

2 else if PathReconverges(node, nodeset, false)

6 if node != pset->root then return

7 else add pset->nodes to nodeset

8 else add node to nodeset

9 for each node n in nodeset do

10 for each input edge p of n outside nodeset do

11 BuildNodeSet(p,nodeset,level+1,max_level,P)

Figure 7.50 Pseudo-code for growing nodesets.

SADD32( S32, S32 ) SADD32( S32, SMULT32(S32,S32) ) SADD32( SMULT32(SADD32(S32,S32), SSUB32(S32,S32)), SSUB32(S32,S32) )

Figure 7.51 Generated nodesets and expressions.

The sequence in which templates are added is crucial to avoid cycles in a nodeset If a secondary path exists between a nodeset and a joining node, the intermediate node must be included first to prevent complications For instance, in Figure 7.51, adding the subtract node before the multiply node would create a cycle, as the multiply node would inadvertently serve as both an input and output Therefore, the multiply node is added in the second stage, followed by the subtract node in the third stage.

1 for each outgoing edge s of node do

5 else if diverged == true then

7 else if s has not been visited then

8 if PathReconverges(s, nodeset, true) then

Figure 7.52 Pseudo-code for reconverging paths.

The recursive algorithm illustrated in Figure 7.52 is designed to determine if a given node has a reconverging path to a specified nodeset It utilizes a depth-first traversal approach, recursively calling itself on each successor node If a node encountered during this traversal does not belong to the nodeset, it is classified as diverging Therefore, if any subsequent nodes along the same path lead to the nodeset, it confirms the existence of a reconverging path.

We introduce a technique for generating template matches that utilizes our previously described methods for growing nodesets and generating expressions for nodeset equivalence checking The procedure GenerateExprTable(), illustrated in Figure 7.53, processes an input graph G along with a hash table E, which links equivalent nodesets to hash expressions, and a table P that associates each node with its parent nodeset Additionally, it incorporates a look-ahead factor to enhance the generation of nodesets.

The procedure systematically processes each node in graph G to create nodesets based on the specified look-ahead value, considering all levels of look-ahead concurrently For example, a look-ahead value of 2 generates nodesets for look-aheads of 0, 1, and 2, while an infinite look-ahead examines all possible nodeset combinations for each node After constructing a nodeset, a hash expression is generated for the directed acyclic graph (DAG), serving as the key for adding the nodeset to table E To manage the table's size, the process halts upon encountering an empty hash value or a hash value that matches one from a previous stage Figure 5 illustrates the nodesets produced for a look-ahead of 2 along with the corresponding hash expressions at each level.

3 for i = 0 to look_ahead do

4 nodeset is a collection of nodes

5 N is a hash table of string to node

10 if nodeset is empty then break

11 else if key was already seen then break

12 else add nodeset to E[key]

13 for each element e in E do

Figure 7.53 Pseudo-code for template matching.

The procedure results in a comprehensive mapping of expressions to equivalent nodesets, but it is crucial to resolve local conflicts within each set of template matches Villa et al [66] present a method for approximating the maximal independent set using an adjacency matrix, which maximizes the number of nodesets by prioritizing those with minimal conflicts Templates with fewer than two matches are eliminated, as single-match templates provide no advantage Figure 7.54 demonstrates conflicting template matches alongside the adjacency matrix, highlighting that selecting nodeset T1 or T3 first yields the highest number of matches, while starting with T2 would lead to suboptimal outcomes by removing T1 and T3 from consideration.

Figure 7.54 Adjacency matrix for overlapping sets

After constructing the table of potential template matches, it is essential to identify the optimal templates for implementation, as each selection influences subsequent choices and may lead to conflicts between nodesets of different templates The SelectTemplateMatches() procedure, illustrated in Figure 7.55, iteratively identifies the best template matches from E using a cost function that considers the number of operations covered minus the implementation cost of replicated hardware (refer to section 7.2.6) The resulting list of matches is then compiled for further use.

All conflicting nodesets in E are eliminated, and unmatched sets are pruned This iterative process continues until E is empty or no appropriate template matches exist If a single template match remains in T after combining other matches into larger templates, that template is pruned, and its nodes are reintegrated into the pool for future matching.

When selecting a cost function, various techniques can be employed, particularly when accurate area measurements are accessible, which can lead to near-optimal area solutions In this context, we opted for a wiring cost due to several factors: estimating area costs at an abstract level is often challenging as it requires predicting the hardware translation of the resulting CDFG Additionally, prioritizing large, complex networks can minimize interconnects and streamline routing Ultimately, reducing the number of nets also contributes to a decrease in overall area.

5 if best_expr != NULL then

Figure 7.55 Pseudo-code for template selection.

To ensure consistent functionality across all isomorphic Directed Acyclic Graph (DAG) structures, a tree structure template must be utilized, as depicted in Figure 7.48 Although this method may lead to an increase in area, such growth is manageable through template pruning when implementation costs exceed available matches The construction of the template tree structure involves decomposing the DAG into a tree via post-order traversal, starting from the root node During this process, nodes with edges to external operations are added to the fanout list It may be necessary to replicate DAG input edges to accommodate split nodes within the expanded tree structure Figure 7.56 showcases the template tree structure derived from the DAG in Figure 7.51, featuring replicated input edges and multiple fanouts, where the subtract operation has been replicated Ultimately, the nodeset is extracted into a separate Control Data Flow Graph (CDFG) and replaced by a single call node to the function Template1(B,A,B,A,A,10).

Figure 7.56 Generated template for the DAG in Figure 7 51.

This section introduces a Resource Sharing algorithm, illustrated in Figure 7.57, which integrates techniques discussed in this chapter The algorithm processes an input graph G, along with a look-ahead value for creating a table of template expressions and a backtracking value As noted, each selected template match influences subsequent template choices, prompting the question of how alternative initial selections—such as the second, third, or fourth best template expressions—might alter future outcomes.

7 E maps exprs to nodeset lists

8 T maps templates to nodeset lists

9 P maps nodes to their parent nodeset

11 for i = 0 to min(backtrack+1, E.size) do

13 for each template t in T do

15 if template_cost > best_cost then

Figure 7.57 Pseudo-code for resource sharing.

Experimental Results

This section details the outcomes of ten DSP benchmarks utilizing the proposed resource sharing algorithm The CDFGs were created with the FREEDOM compiler and underwent multiple unrollings to enhance design complexity Table 7.12 illustrates the number of templates generated alongside the maximum template size based on different look-ahead and backtracking depths Table 7.13 highlights the quantity and percentage of resource reductions achieved, calculated as the cost of removed operations minus the requirements for template implementation Additionally, Table 7.14 summarizes the total runtime for resource sharing across various look-ahead and backtracking depths.

The examination of template sizes and resource reduction across various look-ahead depths reveals ambiguities in the results, primarily influenced by the sequence of template growth impacting future developments Notably, the absence of memory operations in the templates restricted the maximum coverage achievable Interestingly, as the look-ahead depth increased, the number of generated templates decreased, which aligns with the expectation that larger templates are identified earlier However, this increase in look-ahead depth also led to an exponential growth in hash value sizes, necessitating O(2^d) time for string matching, thereby explaining the significant rise in execution times associated with greater look-ahead depths.

The findings indicate that quality improvements plateau at a look-ahead of around 3, with resource usage reductions between 40% and 80% The final set of data in each table reflects a backtracking depth of 10, highlighting the algorithm's optimality Notably, the variation in resource usage remained within a 5% margin, demonstrating the efficiency of our resource-sharing algorithm, even with minimal incremental template growth Therefore, employing small, incremental template growth is advantageous for resource sharing, leading to nearly optimal results while minimizing CPU time.

Applying resource sharing techniques in hardware implementations using RTL models can lead to diminished quality gains due to the inefficiencies introduced by multiplexers and glue logic from logic synthesis tools The additional glue logic often outweighs the benefits of reducing extraneous hardware, undermining the advantages of resource sharing To effectively evaluate this resource sharing algorithm, it is essential to adapt the FREEDOM compiler to directly convert CDFGs into structural hardware models, creating a complete netlist of gates tailored for the target architecture, rather than relying on third-party tools for RTL synthesis.

Table 7.12 Number of templates generated and maximum template sizes for varying look-ahead and backtracking depths.

LA = 1 LA = 3 LA = 5 LA = 7 LA = INF

Max Size dot_prod 5 24 4 18 4 18 4 18 4 18 4 18 iir 9 18 6 48 6 48 6 48 6 48 6 48 matmul_32 10 24 9 18 6 18 6 18 6 18 9 20 gcd 7 5 6 5 5 5 5 5 5 5 5 5 diffeq 18 9 8 93 9 10 9 10 9 10 9 10 ellip 4 3 2 3 2 3 2 3 2 3 2 3 laplace 9 5 7 5 7 5 7 5 7 5 7 5 fir16tap 9 24 7 18 5 18 5 18 5 18 5 18 fir_cmplx 15 21 12 37 8 37 8 37 8 37 12 37 sobel 13 8 11 8 11 13 11 13 10 13 11 13

Table 7.13 Number and percentage resources reduced with varying look-ahead and backtracking depth.

Benchmark LA = 1 LA = 3 LA = 5 LA = 7 LA = INF

LA = INF dot_prod 202 69.2% 220 75.3% 220 75.3% 220 75.3% 220 75.3% 220 75.3% iir 531 78.4% 520 76.8% 520 76.8% 520 76.8% 520 76.8% 520 76.8% matmul_32 219 61.3% 231 64.7% 214 59.9% 214 59.9% 214 59.9% 220 61.6% gcd 51 43.6% 51 43.6% 48 41.0% 48 41.0% 48 41.0% 48 41.0% diffeq 161 55.5% 133 45.9% 146 50.3% 146 50.3% 146 50.3% 146 50.3% ellip 104 64.6% 101 62.7% 101 62.7% 101 62.7% 101 62.7% 101 62.7% laplace 229 67.4% 215 63.2% 215 63.2% 215 63.2% 215 63.2% 215 63.2% fir16tap 197 61.8% 215 67.4% 208 65.2% 208 65.2% 208 65.2% 208 65.2% fir_cmplx 514 71.3% 474 65.7% 500 69.3% 500 69.3% 500 69.3% 518 71.8% sobel 894 79.2% 870 77.1% 837 74.1% 868 76.9% 846 74.9% 844 74.8%

Table 7.14 Timing results in seconds for resource sharing with varying look-ahead and backtracking depth.

Benchmark LA = 1 LA = 3 LA = 5 LA = 7 LA = INF

LA = INF dot_prod 3.6 4.8 5.9 9.0 23.7 31.5 iir 19.5 29.8 40.4 47.6 70.8 120.6 matmul_32 4.2 7.1 8.7 11.3 20.7 36.9 gcd 0.8 0.8 0.8 0.8 0.8 2.1 diffeq 4.4 7.3 9.0 17.3 23.6 31.4 ellip 1.6 1.6 1.8 1.9 1.9 3.2 laplace 6.5 8.4 11.9 15.2 20.8 31.3 fir16tap 4.9 6.3 7.3 9.7 17.5 29.3 fir_cmplx 25.1 36.8 45.3 65.7 156.6 231.2 sobel 38.3 49.4 54.0 54.9 48.0 164.9

Summary

This chapter introduces a novel algorithm for extracting regularities in resource sharing within Control Data Flow Graphs (CDFGs) By employing a heuristic approach, the algorithm dynamically expands templates based on a cost function and the frequency of recurring patterns It utilizes backtracking and adjusts the look-ahead depth to balance solution quality with runtime efficiency Experimental results across ten benchmarks demonstrate that incremental template growth is advantageous for resource sharing, leading to near-optimal results while minimizing CPU time.

Partitioning is essential in the hardware/software co-design of embedded systems, as it significantly impacts performance trade-offs Designers face critical decisions regarding the optimal partitioning of a design to enhance efficiency, especially when synthesis and simulation tools struggle with complex systems By focusing on key components, designers can accelerate the design cycle, making partitioning a necessary strategy in current design technology.

Designers typically analyze application functions to identify bottlenecks and pinpoint the most computationally intensive code segments These segments, often comprising entire procedures or loop structures, are usually targeted for hardware implementation to achieve significant performance improvements.

Partitioning is typically conducted at high abstraction levels to simplify design structure analysis However, partitioning software binaries presents challenges due to the often unidentifiable structures and complex pipelining, which complicates the partitioning process Additionally, when loop structures are too small, larger structures may be necessary, or designers might prefer to transfer substantial blocks of sequential code to hardware for efficiency.

This chapter outlines a comprehensive approach to hardware/software partitioning of software binaries, focusing on the extraction of structures from procedures across various design hierarchy levels These self-contained structures can be independently optimized, transferred to hardware as standalone procedures, or integrated into a hardware/software co-design framework.

Related Work

Design partitioning is a crucial aspect that can be applied at various levels of granularity, including functional and structural levels, throughout the high-level synthesis process Early-stage partitioning is often complex due to critical performance tradeoff decisions made with incomplete information Research has demonstrated the effectiveness of partitioning at the Finite State Machine (FSM) level, with Feske et al proposing an FSM partitioning and state encoding method tailored for FPGA architectures Additionally, Vahid et al found that functional-level partitioning yields superior I/O and size constraints compared to structural netlist partitioning Notably, partitioning a design prior to synthesis can significantly enhance synthesis runtimes, achieving improvements by an order of magnitude.

Recent studies indicate that partitioning designs into smaller sections can significantly reduce power consumption Chung et al introduced a method for partitioning behavioral-level descriptions before scheduling, enabling each section to be managed by an independent gated clock that can be turned off when not in use This approach optimizes area and power while adhering to global timing constraints through high-level estimations Hwang et al further advanced power reduction techniques by partitioning finite state machines (FSMs) and deactivating inactive processes and datapaths to minimize unnecessary switching activity Additionally, Venkataraman et al developed the GALLOP tool for FSM partitioning, utilizing a genetic algorithm to simultaneously execute state assignment and partitioning, thereby enhancing power efficiency.

Partitioning software binaries is more complex than partitioning high-level applications due to the absence of functional boundary information in the design Initially, it requires coarse-grain partitioning at the procedural level, or procedure extraction Cifuentes and Simon have developed algorithms to identify function calls from assembly programs by utilizing predefined procedure call interfaces and introduced a procedure abstraction language for specifying calling conventions across different architectures They also highlighted the use of data flow analysis to pinpoint function arguments Additionally, Mittal et al presented a method for automatically extracting function bodies from linked software binaries by leveraging procedure-calling conventions along with limited control and data flow information, which was integrated into the FREEDOM compiler.

Recent studies on the structural partitioning of software binaries have primarily concentrated on basic loop structures or entire functions In contrast, our innovative approach enables both fine and coarse grain partitioning at procedural and structural levels This allows for the individual optimization of partitioned kernels, which can then be selectively transferred to hardware, either as independent procedures or through a hardware/software co-design strategy.

Structural Extraction

This section discusses structural extraction as a method for partitioning software binaries in hardware/software co-design Implemented in the FREEDOM compiler, structural extraction enables the identification of fine-grain and coarse-grain structures, such as loops and if-then-else blocks, from MST procedures at various hierarchical levels Designers can utilize the GUI interface to extract these structures, which are treated as independent procedures, facilitating their optimization and transfer to hardware Additionally, input and output ports are automatically generated during CDFG optimizations, as outlined in Section 5.2.1.

Effective structural extraction begins with identifying feasible structures for partitioning While a manual profiling approach can highlight computational bottlenecks, heuristics may also guide the selection of code sections suitable for hardware migration, particularly large loops Our method empowers designers to make structural selections, focusing on partitioning structures while adhering to a maximum inter-procedural cut-size for data dependencies As outlined in Section 5.2.1, we restrict the cut-size of the I/O interface to match the number of physical registers in the source general-purpose processor architecture, preventing virtual registers from crossing inter-procedural boundaries.

The Discover_Structures() procedure, as outlined in Figure 8.58, is designed to identify extractable structures from an MST procedure, P, using a Boolean mapping, M, to indicate extractability It starts by creating a linearized Control Flow Graph (CFG) from the MST procedure, following the guidelines in Chapter 4 The process involves generating reaching definitions for data flow analysis and conducting structural analysis to produce a minimized structural graph, as described in sections 5.1.1 and 5.1.2 The recursive function Discover_Extractable_Structures(), detailed in Figure 8.59, assesses each structure in the hierarchical tree for extractability A structure can only be extracted if there are no data dependencies from virtual registers that cross its structural boundaries Figure 8.60 illustrates the Can_Extract_Structure() procedure, which performs data dependency analysis to detect such dependencies It examines each basic block's instructions to ensure that no virtual operands have dependencies extending beyond the structure, as inter-procedural virtual data dependencies prevent independent extraction from the parent structure.

5 Discover_Extractable_Structures( CFG, M, struct_list )

Figure 8.58 Procedure for discovering structures for extraction.

Discover_Extractable_Structures( CFG, M, struct_list )

1 for each structure s in struct_list do

2 if ( Can_Extract_Structure(CFG, s) ) then

6 Discover_Extractable_Structures(CFG, M, s->GetStructures())

Figure 8.59 Recursive procedure for identifying extractable structures.

1 1 for each block b in s do

2 2 for each instruction i in block b do

3 3 for each virtual operand register r in i do

4 4 if r is has a definition outside of s then

6 6 else if r has a use outside of s then

Figure 8.60 Procedure for determining if a structure can be extracted.

The algorithm depicted in Figure 8.58 necessitates O(n) time complexity for the creation of the Control Flow Graph (CFG) as outlined in Chapter 4 Additionally, the process of generating reaching definitions, including DU-chains and UD-chains, requires O(n²) time complexity based on the number of instructions involved.

Structural analysis utilizing the graph minimization technique with depth-first search operates with a time complexity of O(b), where b represents the number of blocks The function Discover_Extractable_Structures() systematically traverses the structure hierarchy, examining all instructions to identify definitions that cross structural boundaries, as detailed in Can_Extract_Structure() The process of searching DU-chains for these definitions incurs a worst-case time complexity of O(n) for each structure In scenarios where each structure consists of two substructures, one with a single instruction, the overall traversal may escalate to O(n²) Therefore, the complete algorithm ultimately exhibits a worst-case time complexity of O(n²).

After identifying the extractable structures in the procedure, users can manually select them for extraction via the GUI interface A dialog box displays a tree model of the structural hierarchy, allowing exploration of the structures Users can select a structure and its children for extraction by checking the corresponding box, although some checkboxes are disabled for structures that cannot be extracted independently of their parent Additionally, the right-side window provides valuable information about the highlighted structure, including its type (such as loop or if-then-else) and the number and percentage of operations it contains.

Figure 8.61 GUI interface for selecting structures for extraction.

The recursive procedure outlined in Figure 8.62 is designed for extracting structures from a given procedure P It utilizes a Boolean mapping M to determine which structures to extract, along with the current list of structures at each hierarchical level When the mapping M identifies a structure as true, that structure and all its child structures are extracted If not, the function recursively processes the child structures The extraction process involves replacing the basic blocks and instructions within the identified structures with a call to a new procedure This algorithm has a worst-case run time complexity of O(s), where s represents the number of structures involved.

1 for each structure s in struct_list do

Figure 8.62 Recursive procedure fore extracting structures.

In the analysis of the fir16tap procedure, the outer loop is illustrated in Figure 8.63, where instructions from PC values 0x0704 to 0x0770 have been substituted with a call to the fir16tap_struct_0 procedure Figure 8.64 presents the Verilog code that facilitates this call from its parent within a finite state machine (FSM) The first state initializes the inputs and reset, followed by a second state that pauses until the process is completed Finally, the third state retrieves the output values generated by the procedure.

Figure 8.63 Extracted MST instructions replaced with a procedure call.

In the FIR16TAP finite state machine, the initial state, FIR16TAP_FSM_PROCESS_STATE_1, is characterized by resetting the fir16tap_struct_0_instance_0 and assigning various input signals from the shift registers Specifically, the reset signal is activated, and inputs A10 to A9 are assigned values from SR3, SR5, SR9, SR11, SR13, SR14, SR17, SR18, SR20, and SR21, respectively Following these assignments, the control transitions to FIR16TAP_FSM_PROCESS_STATE_2, indicating a progression in the processing sequence.

In the FIR16TAP FSM process state 2, the reset signal for the fir16tap_struct_0_instance_0 is set to low If the fir16tap_struct_0_instance_0 indicates completion, the control transitions to FIR16TAP_FSM_PROCESS_STATE_3; otherwise, it remains in FIR16TAP_FSM_PROCESS_STATE_2.

FIR16TAP_FSM_PROCESS_STATE_3 : begin

SR0

Tiêu đề	A Methodology For Translating Scheduled Software Binaries Onto Field Programmable Gate Arrays
Tác giả	David C. Zaretsky
Người hướng dẫn	Prith Banerjee, Robert Dick
Trường học	Northwestern University
Chuyên ngành	Electrical and Computer Engineering
Thể loại	dissertation
Năm xuất bản	2005
Thành phố	Evanston

Định dạng
Số trang	182
Dung lượng	1,92 MB