158 P. Coussy et al. Fig. 9.9 Operator area vs. sizing approaches 40 slices 34 slices 24 slices * 9 9 * 9 8 * 4 9 Max(8,4,3, 9) Max(in 1 ,in 2 ) Best(in 1 ,in 2 ) 40 slices 34 slices 24 slices * 9 9 * 9 8 * 4 9 Max(8,4,3, 9) Max(in 1 ,in 2 ) Best(in 1 ,in 2 ) (a) (b) (c) 9.3.2.5 Storage Element Optimization Because currently there is no feed-back loop in the design flow, the registers opti- mization has to be done during the conception of the processing unit. The choice of the location of an unconstrained variable (user can define the location of variables) in a register or in a memory, has to be done according to the minimization of two contradictory cost criteria: • The cost of a register is higher than the cost of a memory point. • The cost to access data in a register is lower than the cost to access data in memory (because of the necessity to compute the address). Two criteria are used to choose the memorization location of the data: • A variable whose life time is inferior to a locality threshold is stored in a register. • The location of memorization depends on the class of the variable. Data are classified into three categories: • Temporary processing data (declared or undeclared). • Constant data (read-only). • Ageing data (which serves to express the recursivity of the algorithm to be synthesized, via their assignment after having been utilized). The optimal storage of a given data element depends upon its declaration and its life time. It can be either stored in a memory bank of the MEMU or in a storage element of the processing unit PU. The remaining difficulty lies in selecting an optimal locality threshold which results in minimizing the cost of the storage unit. The synthesis tool leaves the choice of the value of the locality threshold up to the user. In order to help the designer, GAUT proposes a histogram of the life time of the variables, normalized by the utilization frequency, which is calculated from the scheduled DFG. The architecture of the processing unit is composed of a processing part and a memory part (i.e. memory plan) and the associated control state machine FSM (Fig. 9.1). The memory part of the datapath is based on a set of strong seman- tic memories (FIFO, LIFO) and/or registers. Spatial adaptation is performed by an interconnection logic dealing with data dispatching from operators to storage elements, and from storage elements to operators. Timing adaptation (data-rates, different input/output data scheduling) is realized by the storage elements. Once the location of data has been decided, the synthesis of the storage elements located in 9 GAUT: A High-Level Synthesis Tool for DSP Applications 159 Fig. 9.10 Four-step flow RCG Construction Binding Optimization Generation Fig. 9.11 Resource compatibility graph a d e L F R F F F R L f c R R R RL F F b the PU is done. This design step inputs data lifetimes resulting from the scheduling step and spatial information resulting from the binding step of the DFG. The spa- tial information is the source and destination for each data. First, we formalize both timing relationships between data (thanks to data lifetimes) and spatial information through a Resource Compatibility Graph RCG. This formal model is then used to explore the design space. We named timing relationships and spatial information as Communication Constraints. This synthesis task is based on a four-step flow: (1) Resource Compatibility Graph (RCG) construction, (2) Storage resource binding, (3) Architecture optimiza- tion and (4) VHDL RTL generation (see Fig. 9.10). During the first step of the component generation, a Resource Constraints Graph is generated from the com- munication constraints. The analysis of this formal model allows both the binding of data to storage elements (queue, stack or register), and the sizing of each storage element. This first architecture is then optimized by merging storage elements that have non-overlapping usage time frames. Formal model: In order to explore the design space of such a component, the first step consists in generating a Resource Compatibility Graph, from the com- munication constraints. This RCG specifies through formal modeling the timing relationship between data that have to be handled by the datapath architecture. The vertex set V = {v 0 , ,v n } represents data, the edge set E = {(v i ,v j )} repre- sents the compatibility between the data. A tag t ij ∈ T is associated with each edge (v i ,v j ). This tag represents the compatibility type between the two data (i and j), T = {Register R, FIFO F, LIFO L},e.g. Fig.9.11. 160 P. Coussy et al. In order to assign compatibility tags to edges, we need to identify the timing relationship that exists between two data. For this purpose we defined a set of rules based on functional properties of each storage element (FIFO, LIFO, Register). The lifetime of data a is defined by Γ (a)=[ τ min (a), τ max (a)] where τ min (a) and τ max (a) are respectively the date of the write access of a into the storage element, and the last date of the read access to a. τ first (a) is the first read access to a, τ Ria is the ith read access to a, with first ≤ i ≤ max. Rule 1: Register compatibility If( τ min b ≥ τ max a ) then we create a “Register” tagged edge. Rule 2: FIFO compatibility If[( τ min b > τ min a ) and ( τ fisrt b > τ max a ) and ( τ min b < τ max a )] then we create a “FIFO” tagged edge. Rule 3: LIFO compatibility If[[( τ min b > τ min a ) and ( τ first a > τ max b )] or [( τ Ri a < τ min b < τ max b < τ Ri+1 a )]] then we create a “LIFO” tagged edge. Rule 4: Otherwise, No edge – No compatibility. An analysis of the communication constraints enables the RCG generation. The graph construction supposes edge creation between data, respecting a chronologi- cal order ( τ min ). If n is the number of data to be handled, the graph may contain: n(n −1)/2 edges, O(n 2 ). Storage element binding: The second step consists in binding storage elements to data thanks to the timing relations modeled by the RCG. Resource identification: The second step consists in binding storage elements to data by using the timing relations modeled by the RCG. The aim is to identify and to bind as many FIFO or LIFO structures as possible on the RCG. Theorem 1. If a is FIFO compatible with b and b is FIFO compatible with c, then a is transitively FIFO (or Register) compatible with c. As a consequence of Theorem 1, a FIFO compatible datapath, PF, is by construc- tion equivalent to a FIFO compatibility clique (i.e. the data of the PF path can be stored in the same FIFO). Theorem 2. If a is LIFO compatible with b and b is LIFO compatible with c, then a is transitively LIFO compatible with c. As a consequence of Theorem 2, a LIFO compatible datapath, P L , is by construc- tion equivalent to a LIFO compatibility clique (i.e. the data of the P L path can be stored in the same LIFO). Resource sizing: The size of a LIFO structure equals the maximum number of data stored by a LIFO compatible data path. So, we have to identify the longest LIFO compatibility path P L in a LIFO compatibility tree, and then the number of vertices in P L from the longest LIFO path in the tree equals the maximum number of data that can be stored in it. 9 GAUT: A High-Level Synthesis Tool for DSP Applications 161 d L L f b FIFO 3 ec FIFO 2 R a dd L L ff bb FIFO 3 ec FIFO 2 eecc FIFO 2 R aa (a) Resulting hierarchical graph a b c d time e f FIFO 3 FIFO 2 a b c d time e f FIFO 3 FIFO 2 (b) Resulting constraints Fig. 9.12 A possible binding for graph The size of a FIFO is the maximum number of data (of the considered path) stored at the same time in the structure. In fact, the aim is to count the maximum number of overlapped data (respecting I/O constraints) in the selected path P.These sizes can be easily extracted from our formal model. Resource binding: Our greedy algorithm is based on user plotted metrics (mini- mal amount of data to use a FIFO or a LIFO, average use factor, FIFO/LIFO usage priorityfactor )tobindasmanyFIFOorLIFOstructuresaspossibleontheRCG. A two-steps flow is used: (1) identification of the best structure, (2) merging all the concerned data in a hierarchical node. Each node represents a storage element, as shown on Fig. 9.12a (e.g. data a, b and f are merged in a three-stages FIFO). We say hierarchical node because merging a set of data in a given node, supposes adding information that will be useful dur- ing the optimization step: the lifetime of this structure (i.e. the time interval during which this structure will be used. e.g. Fig. 9.12b). Let P = {v 0 , ,v n } be a compatible data path, • If P is a FIFO compatible path, the structure lifetime will be [ τ min v 0 , τ max v n ]. • If P is a LIFO compatible path, the structure lifetime will be [ τ min v 0 , τ max v 0 ]. Storage element optimization: The goal of this final task is to maximize stor- age resource usage, in order to optimize the resulting architecture by minimizing the number of storage elements and the number of structures to be controlled. To tackle this problem, we built a new hierarchical RCG by using the merged nodes, and their lifetimes. In order to avoid any conflict, the exploration algorithm of the optimization step will only search for Register compatibility path, between same type vertices. When two structures of the same type are Register compatible, they can be merged. Let P = {v 0 v n } be a Register compatible data path, • The lifetime of the resulting hierarchical merged structure will be [ τ min v 0 , τ max v n ] U U [ τ min v n , τ max v n ]. The algorithm is very similar to the one used during binding step. When there is no more merging solution, the resulting graph is used to generate the RTL VHDL 162 P. Coussy et al. Fig. 9.13 Optimization of Fig. 9.11 graph a f b FIFO 3 d ec FIFO 2 a f b FIFO 3 aa ff bb FIFO 3 d ec FIFO 2 dd eecc FIFO 2 architecture. Figure 9.13 is a possible architectural solution for the Resource Com- patibility Graph presented in Fig. 9.11. Here, the resulting architecture consist in a three-stages FIFO that handles three data, and a two-stages FIFO that handles three data: one memory place has been saved. 9.3.3 Memory Unit Synthesis In this section, we present two major features of GAUT, regarding the memory sys- tem. First the data distribution and placement are formalized as a set of constraint for the synthesis. We introduce a formal model for the memory accesses, and an accessibility criterion to enhance the scheduling step. Next, we propose a new strat- egy to implement signals described as ageing vectors in the algorithm. We formalize the maturing process and explain how it may generate memory conflicts over sev- eral iterations of the algorithm. The final Compatibility Graph indicates the set of valid mappings for every signal. Our scheduling algorithm exhibits a relatively low complexity that allows to tackle complex problems in a reasonable time. 9.3.3.1 Memory Constrained Scheduling In our approach the data flow graph DFG first generated from the algorithmic speci- fication is parsed and a memory table is created. This memory table is completed by the designer who can select the variable implementation (memory or register) and place the variable in the memory hierarchy (which bank). The resulting table is the memory mapping that will be used in the synthesis. It presents all the data vertices of the DFG. The data distribution can be static or dynamic. In the case of a static placement, the data remains at the same place during the whole execution. If the placement is dynamic, data can be transferred between different levels in the memory hierarchy. Thus, several data can share the same loca- tion in the circuit memory. The memory mapping file explicitly describes the data transfers to occur during the algorithm execution. Direct Memory Address (DMA) directives will be added to the code to achieve these transfers. The definition of the memory architecture will be performed in the first step of the overall design flow. To achieve this task, advanced compilers such as Rice HPF compiler, Illinois Polaris or Stanford SUIF could be used [14]. Indeed, these compilers automatically perform data distribution across banks, determine 9 GAUT: A High-Level Synthesis Tool for DSP Applications 163 Fig. 9.14 Memory constraint graph x0 x1 x2 x3 h3 h2 h1 h0 x0 x1 x2 x3 h3 h2 h1 h0 which access goes to which bank, and then schedule to avoid bank conflicts. The Data Transfer and Storage Exploration (DTSE) method from IMEC and the associ- ated tools (ATOMIUM, ADOPT) are also a good mean to determine a convenient data mapping [15]. We modified the original priority list (see Sect. 9.3.2.2) to take into account the memory constraint: an accessibility criterion is used to determine if the data involved by an operation is available, that is to say, if the memory where it is stored is free. Operations are still listed according to the mobility and bit-width criterion, but all operations that do not match the accessibility criterion are removed. Every operation that needs to access a busy memory will not be scheduled, no matter its priority level. Fictive memory access operators are added (one access operator per access port to a memory). The memory is accessible only if one of its access oper- ators is idle. Memory access operators are represented by tokens on the Memory Constraint Graph (MCG): there are as many tokens as access ports to the memory or bank. Figure 9.14 shows two MCG, for signal samples x[0] to x[3] stored in bank 1, and coefficients h[0] to h[3] stored in bank 2 (in the case of a four points convolution filter for instance). If one bank is being accessed, one token is placed on the corresponding data. Only one token is allowed for a one port bank. Dotted edges indicate which follow- ing access will be the faster. In the case of a DRAM indeed, slower random accesses are indicated with plain edges and faster sequential accesses with dotted edges. Our scheduling algorithm will always favor fastest sequences of accesses whenever it has the choice. 9.3.3.2 Implementing Ageing Vector Signals are the input and output flows of the applications. A mono-dimensional signal x is a vector of size n,ifn values of x are needed to compute the result. Every cycle, a new value for x (x[n+ 1]) is sampled on the input, and the oldest value of x (x[0]) is discarded. We call x an ageing, or maturing, vector or data. Ageing vectors are stored in RAM. A straightforward way to implement, in hardware, the maturing of a vector, is to write its new value always at the same address in memory, at the end of the vector in the case of a 1D signal for instance. Obviously, that involves to shift every other values of the signal in the memory to free the place for the new value. This shifting necessitates n reads and n writes, which is very time and power consuming. In GAUT, the new value is stored at the address of the oldest one in the 164 P. Coussy et al. x(0)x(1)x(2)x(3) 3210 x[3]x[2]x[1]x[0] Iteration 0 x(1)x(2)x(3)x(4) 2103 x[3]x[2]x[1]x[0] Iteration 1 x(2)x(3)x(4)x(5) 1032 x[3]x[2]x[1]x[0] Iteration 2 x(3)x(4)x(5)x(6) 0321 x[3]x[2]x[1]x[0] Itération 3 Logical address @x[] Algorithm sample x[] Samples of vector x() x(0)x(1)x(2)x(3) 3210 x[3]x[2]x[1]x[0] x(0)x(1)x(2)x(3) 3210 x[3]x[2]x[1]x[0] x(0)x(1)x(2)x(3) 3210 x[3]x[2]x[1]x[0] x(1)x(2)x(3)x(4) 2103 x[3]x[2]x[1]x[0] x(1)x(2)x(3)x(4) 2103 x[3]x[2]x[1]x[0] x(1)x(2)x(3)x(4) 2103 x[3]x[2]x[1]x[0] x(2)x(3)x(4)x(5) 1032 x[3]x[2]x[1]x[0] x(2)x(3)x(4)x(5) 1032 x[3]x[2]x[1]x[0] x(2)x(3)x(4)x(5) 1032 x[3]x[2]x[1]x[0] x(3)x(4)x(5)x(6) 0321 x[3]x[2]x[1]x[0] Itération 3 x(3)x(4)x(5)x(6) 0321 x[3]x[2]x[1]x[0] x(3)x(4)x(5)x(6) 0321 x[3]x[2]x[1]x[0] Itération 3 Fig. 9.15 Logical addresses evolution for signal x @x[0] @x[1] @x[2] @x[3] 1, 1 1, 1 1, 1 -3, 1 @x[0] @x[1] @x[2] @x[3] 1, 1 1, 1 1, 1 -3, 1 @x[j] i @x[j] i+1 -1 @x[j] i @x[j] i+1 @x[j] i @x[j] i+1 -1 @x[0] @x[1] @x[2] @x[3] 1, 1, 0 1, 1, 0 1, 1, 0 -3, 1, -1 @x[0] @x[1] @x[2] @x[3] 1, 1, 0 1, 1, 0 1, 1, 0 -3, 1, -1 @x[0] @x[1] @x[2] @x[3] 1, 1, 0 1, 1, 0 1, 1, 0 -3, 1, -1 Fig. 9.16 LAG, AG and USG vector. Only one write is needed. Obviously, the address generation is more difficult in this case, because the addresses of the samples called in the algorithm change from one cycle to the other. Figure 9.15 represents the evolution of the addresses for a L = 4 points signal x from one iteration to the other. The methodology that we propose to support the synthesis of these complex log- ical address generators is based on three graphs (see Fig. 9.16). The logical address graph (LAG) traces the evolution of the logical addresses for a vector during the execution of one iteration of the algorithm. Each vertices correspond to the logical address where samples of signal x are to be accessed. Edges are weighted with two numbers. The first number, f ij , indicates how the logical address evolves between two successive accesses to vector x. f ij =(j −i)%L (% indicates the modulo). The second number g i, j indicates the number of iteration between those two successive accesses. To actually calculate the evolution of logical addresses of x from one iteration to the other, we must take into account the ageing of vector x. We introduce the ageing factor k as the difference between the logical address of element x[i] at the iteration o and the logical address of element x[i] at the iteration o+ 1, so that: @x[j] i+1 =(@x[j] i −k)%L. In our example, k = 1. The Ageing Graph (Fig. 9.16) is another representation of this equation. We finally combine the LAG and the ageing factor to get the Unified Sequences Graph (USG) (Fig. 9.16). A detailed definition of those three graphs may be find in [16]. By moving a token in the USG, and by adding to the first logical address for x the value of weight f i, j minus the ageing factor k, we get the address sequence for x during the complete execution of the algorithm. Then, the corresponding address generator is generated. If a pipelined architecture is synthesized, the ageing factor k is multiplied by the number of pipeline slices, and as many tokens as pipeline slices are placed and 9 GAUT: A High-Level Synthesis Tool for DSP Applications 165 moved in the USG. Of course, as much memory locations as supplemental tokens in the USG must be added to guarantee data consistency. Concurrent accesses to elements of vector x may appear in a pipelined architecture. While moving tokens in the USG, a Concurrent Accesses Graph is constructed. This graph is finally colored to obtain the number of memory banks needed to support access concurrency. 9.3.4 Communication and Interface Unit Synthesis 9.3.4.1 Latency Insensitive Systems Systems on a chip (SoCs) are the composition of several sub-systems exchanging data. SoC size increase is such that an efficient and reliable interconnection strat- egy is now necessary to combine sub-systems and preserve, at an acceptable design cost, the speed performances that the current very deep sub-micron technologies allow [20]. This communication requirement can be satisfied by a LIS communi- cation network between hardware components. The LIS methodology enables to build functionally correct SoCs by (1) promoting pre-developed components inten- sive reuse (IPs), (2) segmenting inter-components interconnects with relay stations to break critical paths and (3) bringing robustness to data stream latencies to com- ponents by encapsulating them into synchronization wrappers. These encapsulated blocks are called “patient processes”. Patient processes [21] are a key element in the LIS theory. They are suspendable synchronous components (named pearls) encap- sulated into a wrapper (named shell) which function is to make them insensible to the I/O latency and to drive the clock. The decision to drive or not the component’s clock is implemented with combinatorial logic. The LIS approach relies on a sim- plifying, but restricting, assumption: a component is activated only if all its inputs are valid and all its outputs are able to store a result produced at the next clock cycle. Now, it is frequent that only a subset of the inputs and outputs are necessary to execute one step of computation in a synchronous block. To limit the patient process sensitivity to a subset of the inputs and outputs, in [22] authors suggest to replace the combinatorial logic that drives the clock by a Mealy type FSM. This FSM tests the state of only the relevant inputs and out- puts at each cycle and drives the component clock only when they are all ready. The major drawbacks of FSMs are their difficult synthesis and large silicon size when communication scenarios are long and complex like for computing intensive digital signal processing applications. To reduce the hardware cost, in [23] the com- ponent activation static schedule is implemented with shift registers which contents drive the component’s clock. This approach relies on the hypothesis that there are no irregularities in the data streams: it is never necessary to randomly freeze the components. 166 P. Coussy et al. 9.3.4.2 Proposed Approach As (1) LIS methodology lacks the ability to dynamically sense I/O subsets, (2) FSMs can become too large as communication bandwidth does, and (3) shift regis- ters based synchronization targets only extremely rapid environments, we propose to encapsulate hardware components into a new synchronization wrapper model which area is much less than the FSM-based wrappersarea, which speed is enhanced (mostly thanks to area reduction) and synthesizability is guaranteed whatever the communication schedule is. The solution we propose is functionally equivalent to the FSMs. This is a specific processor that reads and executes cyclically operations stored in a memory. We name it a “synchronization processor” (SP). Figure 9.1 specifies the new synchronization wrapper structure with our SP. The SP communicates with the LIS ports with FIFO-like signals. These signals are formally equivalent to the voidin/out and stopin/out of [19] and valid, ready and stall of [22]. Number of input and output ports can be any. It drives the com- ponent’s clock with the enable signal. The SP model is specified by a three states FSM: a reset state at power up, an operation-read state, and a free-run state. This FSM is concurrent with the component and contains a data path: this a “concur- rent FSM with data path” (CFSMD). Operation’s format is the concatenation of an input-mask, an output-mask and a free-run cycles number. The masks specify respectively the input and output ports the FSM is sensible to. The run cycles num- ber represents the number of clock cycles the component can execute until the next synchronization point. To avoid unnecessary signals and save area, the memory is an asynchronous ROM (or SRAM with FPGAs) and its interface with the SP is reduced to two buses: the operation address and operation word. The execution of the program is driven by an operation “read-counter” incremented modulo the memory size. 9.4 Experiments Design synthesis results for Viterbi decoders are presented in this section. Results are based on a Virtex-E FPGA technology from the hardware prototyping platform that we used and that we present first. 9.4.1 The Hardware Platform The Sundance platform [24] we used as an experimental support is composed of the last generation of C6x DSPs and Virtex FPGAs. Communications between different functional blocs are implemented with high throughput SDB links [24]. We have automated the generation of communication interface for software and hardware 9 GAUT: A High-Level Synthesis Tool for DSP Applications 167 components which frees the user from designing the communication interfaces. At the hardware level the communication between computing nodes is handled by four-phases handshaking protocols and decoupling FIFOs. The handshaking pro- tocols synchronize computing with communication and the FIFOs enable to store data in order to overcome potential data flow irregularities. Handshaking protocols are used either to communicate seamlessly between hardware nodes or between hardware and software nodes. Handshaking protocols are automatically refined by the GAUT tool to fit with the selected (SDB) inter-node platform communication interfaces (bus width, signal names, etc). To end the software code generation, platform specific code has to be written to ensure the inter processing elements communication. The communication drivers of the targeted platform are called inside the interface functions introduced in the macro-architecture model through an API mechanism. We provide a specific class for each type of link available on the platform. 9.4.2 Synthesis Results The Viterbi algorithm is applicable to a variety of decoding and detection problems which can be modeled by a finite-state discrete-time Markov process, such as convo- lutional and trellis decoding in digital communications [25]. Based on the received symbols, the Viterbi algorithm estimates the most likely state sequence according to an optimization criterion, such as the a posteriori maximum likelihood criterion, through a trellis which generally represents the behavior of the encoder. The generic C description of the Viterbi algorithm allowed us to synthesize architectures using different values for the following functional parameters: state number and through- put. A part of synthesis results that have been obtained is given in Fig. 9.17. For each generated architecture, the table presents the throughput constraint and the com- plexity of both the algorithm (number of operations) and the generated architecture (amount of logic elements). In the particular case of the DVB-DSNG Viterbi decoder (64 states) different throughput constraints (from 1 to 50 Mbps) have been tested. Figure 9.18 present the synthesis results. State number 8 16 32 64 128 Throughput (Mbps) 44 39 35 26 22 Number of operations 50 94 182 358 582 Number of logic elements 223 434 1130 2712 7051 Fig. 9.17 Synthesis results for different Viterbi decoders . dispatching from operators to storage elements, and from storage elements to operators. Timing adaptation (data-rates, different input/output data scheduling) is realized by the storage elements goes to which bank, and then schedule to avoid bank conflicts. The Data Transfer and Storage Exploration (DTSE) method from IMEC and the associ- ated tools (ATOMIUM, ADOPT) are also a good mean to. call x an ageing, or maturing, vector or data. Ageing vectors are stored in RAM. A straightforward way to implement, in hardware, the maturing of a vector, is to write its new value always at