56 S. Aditya and V. Kathail the high degree of chip integration made possible by Moore’s law. For example, a cell-phone chip now contains multiple modems, imaging pipeline for a camera, video codecs, music players, etc. A video codec used to be a whole chip a few years back, and now it is a small part of the chip. Second, there is relentless pressure to reduce time-to-market and lower prices. It is clear that automation is the key to success. Automatic application engine synthesis (AES) from a high level algorithmic description significantly reduces both design time and design cost. There is a growing consensus in the design community that hardware/software co-design, high level synthesis, and high level IP reuse are together necessary to close the design productivity gap. 4.1.2 Application Engine Design Space Application engines like multi-standard video codecs are large, complex systems containing a significant number of processing blocks with complex dataflow and control flow among them. Externally, these engines interact with system CPU, sys- tem bus and other application engines. The ability to synthesize complex application engines from C algorithms automatically requires a careful examination of the type of architectures that lend themselves well to such automation techniques. Broadly speaking, there are three main approaches for designing application engines [4] (see Fig. 4.2). 1. Dedicated hardware accelerators: They provide the highest performance and the lowest power. Typically, they are 2–3 orders of magnitude better in power and performance than a general purpose processor. They are non-programmable but can provide limited amount of multi-modal execution based on configuration parameters. There are two approaches for automatic synthesis of dedicated hardware blocks: Hybrid Application Engines Behavioral Synthesis of Accelerators/FPGAs Architectural Synthesis of Accelerators/FPGAs Customizable or Configurable Processors Fig. 4.2 The application engine design space 4 Algorithmic Synthesis Using PICO 57 (a) Behavioral synthesis: This is a bottom-up approach in which individual blocks are designed separately. C statements and basic blocks are mapped to datapath leading to potentially irregular datapath and interconnect. The dat- apath is controlled by a monolithic state machine which reflects the control flow between the basic blocks and can be fairly complex. (b) Architectural synthesis: This is a top-down approach with two distinguishing characteristics. First, it takes a global view of the whole application and can optimize across blocks in order to provide high performance. Second, it uses an efficient, high performance architecture template to design datapath and control leading to more predictable results. PICO’s approach for designing dedicated hardware accelerators falls in this category. 2. Customizable or configurable processors: Custom or application-specific pro- cessors can give an order of magnitude better performance and power than general-purpose processor while still maintaining a level of programmability. This approach is well-suited for the following two cases (a) The performance requirements are not very high and power requirements are not very stringent. (b) Standards or algorithms are still in flux, and flexibility to make algorithmic changes after fabrication is needed. 3. Hybrid approach: Inour view, this is the right approach for synthesizing complex application engines. An efficient architecture for these engines is a combina- tion of (a) Programmable processor(s), typically custom embedded processor, for parts of the application that don’t require high performance (b) Dedicated hardware blocks to get high performance at low power and low area (c) Local buffers and memories for high bandwidth This approach allows a full spectrum of designs to be explored that trade-off among multiple dimensions of cost, performance, power and programmability. 4.1.3 Requirements of a Production AES System In addition to generating competitive hardware, a high level synthesis system needs to fit in a SoC design flow for it to be practically useful and of significant benefit to designers. We can identify a number of steps in the SoC design process. These steps, along with the capabilities that the synthesis system must provide for each step, are described below. 1. Architecture exploration for application engines: Architecture and micro- architecture choices have a great impact on the power, performance and area 58 S. Aditya and V. Kathail of a design, but there is no way to reliably predict this impact without actually doing the design. A high level synthesis system makes it possible to do design space exploration to find an optimal design. However, the system must be struc- tured to make it easy to explore multiple designs from the same C source code. For example, a system that requires users to control the scheduling of individual operations in order to get good results is not very useful for architectural explo- ration because of the amount of time it takes to do one design. Therefore, design automation is the key to effective exploration. 2. High level, multi-block IP design and implementation: This is, of course, the main purpose of a high level synthesis system. It must be able to generate designs that are competitive with manual designs for it to be widely acceptable in production environments. 3. RTL verification: It is unrealistic to expect that designers would write test- benches for the RTL generated by a synthesis system. They should verify their design at the C level using a C test bench. The synthesis system should then auto- matically generate either an RTL test bench including test vectors or a C-RTL co-simulation test bench. In addition, the synthesis system should provide a mechanism to test corner cases in the RTL that cannot be exercised using the C test bench. 4. System modeling and validation (virtual platform) support: Currently, designers have to manually write transaction level models (TLM) for IP they are designing in order to incorporate them in system level platforms. This is in addition to implementing designs in RTL. Generating transaction level models directly from a C algorithm will significantly reduce the development time for building these models. 5. SoC integration: To simplify the hardware integration of the generated IP into an SoC, the system should support a set of standard interfaces that remain invariant over designs. In addition, the synthesis system should provide software device drivers for easy integration into a CPU based system. 6. RTL to GDSII design flow integration: The generated RTL should seamlessly go through the existing RTL flows and methodologies. In addition, the RTL should close timing in the first pass and shouldn’t present any layout problems because it is unrealistic to expect that designers will be able to debug these problems for RTL they didn’t write. 7. Soft IP reuse and design derivatives: One of the promised benefits of high level synthesis system is the ability to reuse the same C source for different designs. Examples include designs at different performance points (low-end vs. high-end) across a product family or design migration from one process node to another process node. As an example of the requirement placed on the tool, support for process migration requires that there is a methodologyto characterize the process and then feed the relevant information to the tool so that it is retargeted to that process. 4 Algorithmic Synthesis Using PICO 59 4.2 Overview of AES Methodology Figure 4.3 shows the high level flow for synthesis of application engines following the hybrid approach outlined in Sect. 4.1.2. Typically, the first step in the appli- cation engine design process is high level partitioning of the desired functionality into hardware and software components. Depending on the application, an engine may consist of a control processor (custom or off-the-shelf) and one or more cus- tom accelerator blocks that help to meet one or more design objectives such as cost, performance, and power. Traditionally, the accelerator IP is designed block by block either by reusing blocks designed previously or by designing new hardware blocks by hand keeping in view the budgets for area, cycle-time and power. Then the engine is assembled together, verified, and integrated with the rest of the SoC platform, which usually takes up a significant fraction of the overall product cycle. The bottlenecks and the risks in this process clearly are in doing the design, verifi- cation and integration of the various accelerator blocks in order to meet the overall functionality specification and the design objectives. In the rest of the paper, we will focus our attention on these issues. In traditional hardware design flows, substantial initial investment is made to define a detailed architectural specification of various accelerator blocks and their interactions within the application engine. These specifications help to drive the manual design and implementation of new RTL blocks and their verification test benches. In addition, a functional executable model of the entire design may be used to test algorithmic coverage and serve as an independent reference for RTL verification. Fig. 4.3 Application engine design flow 60 S. Aditya and V. Kathail In design flows based on high level synthesis, on the other hand, an automatic path to RTL implementation and verification is possible starting from a high level, synthesizable specification of functionality together with architectural information that helps in meeting the desired area, performance and power metrics. The addi- tional architectural information may be provided to a HLS tool in various ways. One possible approach is to combine the hardware and implementation specific information together with the input specification. Some tools based on SystemC [5] require the user to model the desired hardware partitioning and interfaces directly in the input specification. Other tools require the user to specify detailed architectural information about various components of the hardware being designed using a GUI or a separate design file. This has the advantage of giving the user full control of their hardware design but it increases the burden of input specification and makes the specification less general and portable across various implementation targets. It also leaves the tool with very little freedom to make changes and optimizations in the design in order to meet the overall design goals. Often, multi-block hardware integration and verification becomes solely the responsibility of the user because the tool has little or no control over the interfaces being designed and their connectivity. 4.2.1 The PICO Approach PICO [6] provides a fully automated, performance-driven, application engine syn- thesis methodology that enables true algorithmic level input specification and yet is sensitive to physical design constraints. PICO not only produces a cost-effective C-to-RTL mapping but also guarantees its performance in terms of throughput and cycle-time. In addition, multiple implementations at different cost and performance tradeoffs may be generated from the same functional specification, effectively reusing the input description as flexible algorithmic IP. This methodology also reduces design verification time by creating customized verification test benches automatically and by providing a correct-by-construction guarantee for both RTL functionality and timing closure. Lastly, this methodology generates standard set of interfaces which reduces the complexity of assembling blocks into an application engine and final integration into the SoC platform. The key to PICO’s approach is to use an advanced parallelizing compiler in conjunction with an optimized, compile-time configurable architecture template to generate hardware as shown in Fig. 4.4. The behavioral specification is provided using a subset of ANSI C, along with additional design constraints, such as through- put and clock frequency. The RTL design creation can then be viewed as a two step process. In the first step, a retargetable, optimizing compiler analyzes the high level algorithmic input, exposing and exploiting enough parallelism to meet the required throughput. In the second step, an architectural synthesizer configures the architec- tural template according to the needs of the application and the desired physical design objectives such as cycle-time, routability and cost. 4 Algorithmic Synthesis Using PICO 61 ANSI C Algorithm (e.g. FDE) Design constraints: Throughput, clock frequency Application Engine Synthesis Verilog RTL HW + SystemC Models Configurable Architectural Template Advanced Parallelizing Compiler ANSI C Algorithm (e.g. FDE) Design constraints: Throughput, clock frequency Application Engine Synthesis Verilog RTL HW + SystemC Models Configurable Architectural Template Advanced Parallelizing Compiler Fig. 4.4 PICO’s approach to high level synthesis Fig. 4.5 System level design flow using PICO 4.2.2 PICO’s Integrated AES Flow Figure 4.5 shows the overall design flow for creating RTL blocks using PICO. The user provides a C description of their algorithm along with performance requirements and functional test inputs. The PICO system automatically generates the synthesizable RTL, customized test benches, synthesis and simulation scripts, as well as software integration drivers to run on the host processor. The RTL imple- mentation is cost-efficient and is guaranteed to be functionally equivalent to the algorithmic C input description by construction. The generated RTL can then be taken through standard simulation, synthesis, place and route tools and integrated into the SoC through automatically configured scripts. 62 S. Aditya and V. Kathail Along with the hardware RTL and its related software, PICO also produces SystemC-based TLM models of the hardware at various levels of abstraction – untimed programmer’s view (PV), and timed programmer’s view (PV+T). The PV model can be easily integrated into the user’s virtual SoC platform enabling fast validation of the hardware functionality and its interfaces in the system context, whereas the PV+T model enables early verification of the performance, the paral- lelism and the resources used by the hardware in the system context. The knowledge of the target technology and its design trade-offs is embed- ded as part of a macrocell library which the PICO system uses as a database of hardware building blocks. This library consist of pre-verified, parameterized, synthesizable RTL components such as registers, adders, multipliers, and intercon- nect elements that are carefully hand-crafted to provide the best cost-performance tradeoff. These macrocells are then independently characterized for various target technology libraries to obtain a family of cost-performance tradeoff curves for var- ious parametric settings. PICO uses this characterization data for its internal delay and area estimation. 4.2.3 PICO Results and Experience The PICO Express TM tool incorporating our approach has been used extensively in production environments. Table 4.1 shows a representative set of designs done Table 4.1 Some example designs created using PICO Express TM Product Design Area Performance Time vs. hand design DVD Horizontal–vertical filter 60–49 K gates, 40% smaller than target Met cycle budget and frequency target v1: 1 month v2: 3 days vs. 2–3 months Digital Camera Pixel scaler Met the target Multiple versions designed at different targets 2–3 weeks Multiple revisions within hours Set-top box HD video codec 200 K gates, 5% smaller than hand design Same as hand design <2 months to design and verify Camcorder High-perf. video compression 1 M gates, met the target Same as hand design Same design time with significantly less resources Video Processing Multi-standard deblocking, deringing and chroma conversion Same as hand design 30% higher than hand design 3–4× productivity improvement Multi- media cell phone High bandwidth 3Gwireless baseband 400 K gates, same as hand design Same as hand design 2 months vs. >9 months Wireless LAN LDPC encoder for 802.11n 60 K gates, 6% over hand design Same as hand design, low power <1monthto design and verify 4 Algorithmic Synthesis Using PICO 63 using PICO Express. These designs range from relatively small horizontal-vertical filter for a DVD player with ∼49 K gates to large designs with more than 1 M gates for high performance video compression. In all cases, designs generated using PICO Express met the desired performance targets with an area within 5–10% of the hand- design except in one case where the PICO design had significantly less area. In all cases, PICO Express provided significant productivity improvements ranging from 3–5× for the initial design and more than 20× for derivative designs. As far as we know, no other HLS tool can handle many of these designs because of their com- plexity and the amount of parallelism needed to meet performance requirements. Users’ experience with PICO Express is described in these papers [7,8]. 4.3 The PICO Technology In this section, we will describe the key ingredients of the PICO technology that help to meet the application engine design challenges and the requirements of a high level synthesis tool as outlined in Sect. 4.1. 4.3.1 The Programming Model The foremost goal of PICO has been to make the programming model for design- ing hardware to be as simple as possible for a large class of designs. PICO has chosen C/C++ languages as the preferred mode of input specification at the algorithmic level. The goal is not to replace Verilog or VHDL as hardware spec- ification languages necessarily, but to raise the level of specification to a point where the essential algorithmic content can be easily manipulated and explored without worrying about the details of hardware allocation, mapping, and scheduling decisions. Another important goal for PICO’s programming model is to allow the user to specify the hardware functionality as a sequential program. PICO automatically extracts parallelism from the input specification to meet the desired performance based on its analysis of program dependences and external resource constraints. However, the functional semantics of the hardware generated still corresponds to the input sequential program. On one hand, this has obvious advantages for under- standability and ease of design and debugging, while on the other hand, this allows the tool to explore and throttle the parallelism as desired since the input specifica- tion becomes largely independent of performance requirements. This approach also helps in verifying the final design against the input functional specification in an automated way. 64 S. Aditya and V. Kathail L1 L2 L3 time L1 Sequential L1 L2 L3 L1 L2 L3 Loop-level parallelism L1 L2 L3 L1 L2 L3 Task-level and Loop-level parallelism Iteration-level parallelism 1 2 3 Instruction-level parallelism Task-level parallelism L1 L2 L3 L1 L2 L3 L1 L2 L3 proc 1 2 3 4 5 6 L1 L2 L1 L2 L3 time L1 L1 L2 L3 time L1 Sequential L1 L2 L3 L1 L2 L3 L1 L2 L3 L1 L2 L3 Loop-level parallelism L1 L2 L3 L1 L2 L3 L1 L2 L3 L1 L2 L3 Task-level and Loop-level parallelism Iteration-level parallelism 1 2 3 1 2 3 Instruction-level parallelism Task-level parallelism L1 L2 L3 L1 L2 L3 L1 L2 L3 L1 L2 L3 L1 L2 L3 proc L1L1 L2L2 L3L3 proc 1 2 3 4 5 6 L1 L2 1 2 3 4 5 6 L1 L2 Fig. 4.6 Multiple levels of parallelism exploited by PICO 4.3.1.1 Sources of Parallelism A sequential programming model may appear to place a severe restriction to the class of hardware one can generate or the kind of parallelism one can exploit in those hardware blocks. However, this is not actually so. A very large class of consumer data-processing applications such as those in the fields of audio, video, imaging, security, wireless, networking, etc. can be expressed as a sequential C program that process and transform arrays or streams of data. There is tremendous amount of parallelism in these applications at various levels of granularity and PICO is able to exploit them all using various techniques. As shown in Fig. 4.6, a lot of these applications consist of a sequence of trans- formations expressed as multiple loop-nests encapsulated in a C procedure that is designated to become hardware. One invocation of this top level C procedure is called one task which processes one block of data by executing each loop-nest once. This would be followed by the next invocation of the code processing another block of data. PICO, however, converts the C procedure code to a hardware pipeline where each loop-nest executes on a different hardware block. This enables proce- dure level task parallelism to be exploited by pipelining a sequence of tasks through this system, increasing overall throughput considerably. At the loop-nest level, PICO provides a mechanism to express streaming data that is synchronized with two-way handshake and flow control in the hardware. In the C program, this manifests itself simply as an intrinsic function call that writes data to a stream and another intrinsic function call that reads data from that stream. Streams may be used to communicate data between any pair of loop-nests as long as temporal causality between the production and the consumption of data is maintained during 4 Algorithmic Synthesis Using PICO 65 sequential execution. The advantage of the fully synchronized communication in hardware is that the loop-nests can be executed in parallel with local transaction level flow control which exploits producer-consumer parallelism at the loop level. Within a single hardware block implementing a loop-nest, PICO exploits iter- ation level parallelism, by doing detailed dependence analysis of the iteration space and transforming the loop-nest to run multiple iterations in parallel even in the presence of tight recurrences. Subsequently, the transformed loop iteration code is scheduled using software-pipelining techniques that exploit instruction level parallelism in providing the shortest possible schedule while meeting the desired throughput. 4.3.1.2 The Execution Model Given the parallelism available in consumer application at various levels, the PICO compiler attempts to exploit this parallelism without violating the sequential seman- tics of the application. This is accomplished by following the well-defined, parallel execution model of Kahn process networks [9], where a set of sequential pro- cesses communicate via streams with block-on-read semantics and unbounded buffering. Kahn process networks have the advantage that they provide determin- istic parallelism, i.e., the computation done by the process network is unchanged under different scheduling of the processes. This property enables PICO to par- allelize a sequential program with multiple loop-nests to a Kahn process network implemented in hardware where each loop-nest computation is performed by a cor- responding hardware block that communicates with other such blocks via streams. Since the process network is derived from a sequential program, it still retains the original sequential semantics even under different parallel executionsof its hardware blocks. Each hardware block, in turn, runs a statically parallelized implementa- tion of the corresponding loop-nest that is consistent with its sequential semantics using software-pipelining techniques. In this manner, iteration level and instruc- tion level parallelism are exploited at compile-time within each hardware block, and producer–consumer and task level parallelism are exploited dynamically across blocks without violating the original sequential semantics. The original formulation of Kahn process networks captured infinite computa- tion using unbounded FIFOs on each of the stream links. However, PICO is able to restrict the size of computation and buffering provided on each link by impos- ing additional constraints on the execution model. These constraints are described below: • Single-task execution: Each process in a PICO generated process network is able to execute one complete invocation to completion without restarting. This corresponds to the single task invocation of the top level C procedure in the input specification, where each loop-nest in that procedure executes once and the procedure terminates. In actual hardware execution, multiple tasks may be overlapped in a pipelined manner depending on resource availability, but this . takes to do one design. Therefore, design automation is the key to effective exploration. 2. High level, multi-block IP design and implementation: This is, of course, the main purpose of a high level. consensus in the design community that hardware/software co-design, high level synthesis, and high level IP reuse are together necessary to close the design productivity gap. 4.1.2 Application Engine. The ability to synthesize complex application engines from C algorithms automatically requires a careful examination of the type of architectures that lend themselves well to such automation techniques. Broadly