Retiming: Concepts, Algorithm, and Restrictions- 123docz.net

Part III: Mapping Designs to Reconﬁgurable Platforms 275

18.1 Retiming: Concepts, Algorithm, and Restrictions

The goal of retiming is to move the pipeline registers in a design into the optimal position. Figure 18.1 shows a trivial example. In this design, the nodes represent logic delays (a), with the inputs and outputs passing through mandatory, ﬁxed registers. The critical path is 5, and the input and output registers cannot be moved. Figure 18.1(b) shows the same graph after retiming. The critical path is reduced from 5 to 4, but the I/O semantics have not changed, as three cycles are still required for a datum to proceed from input to output.

As can be seen, the initial design has a critical path of 5 between the internal register and the output. If the internal register could be moved forward, the critical path would be shortened to 4. However, the feedback loop would then be incorrect. Thus, in addition to moving the register forward, another register would need to be added to the feedback loop, resulting in the ﬁnal design.

Additionally, even if the last node is removed, it could never have a critical path lower than 4 because of the feedback loop. There is no mechanism that can reduce the critical path of a single-cycle feedback loop by moving registers:

Only additional registers can speed such a design.

Retiming’s objective is to automate this process: For a graph representing a circuit, with combinational delays as nodes and integer weights on the edges, ﬁnd a new assignment of edge weights that meets a targeted critical path or fail if the critical path cannot be met. Leiserson’s retiming algorithm is guaranteed to ﬁnd such an assignment, if it exists, that both minimizes the critical path and ensures that around every loop in the design the number of registers always remains the same. It is this second constraint, ensuring that all feedback loops

(a)

in 1 1

1 1

2 2 out

(b)

in 1 1

1 1

2 2 out

FIGURE 18.1 IA small graph before retiming (a) and the same graph after retiming (b).

18.1 Retiming: Concepts, Algorithm, and Restrictions 385 TABLE 18.1 I The constraint system used by the retiming procsess

Condition normal edge fromu →v Constraintr(u) −r(v) ≤w(e) Edge fromu→vmust be registered r(u)−r(v)≤w(e)−1

Edge fromu→vcan never be registered r(u)−r(v)≤0 and r(v)−r(u)≤0 Critical paths must be registered r(u)−r(v)≤W(u,v)−1 for all u,v

such thatD(u,v)>P

are unchanged, which ensures that retiming doesn’t change the semantics of the circuit. In Table 18.1, r(u) is the lag computed for each node (which is used to determine the ﬁnal number of registers on each edge),w(e) is the initial number of registers on an edge, W(u,v) is the minimum number of registers between u andv, and D(u,v) is the critical path between uand v.

Leiserson’s algorithm takes the graph as input and then adds an additional node representing the external world, with appropriate edges added to account for all I/Os. This additional node is necessary to ensure that the circuit’s global I/O semantics are unchanged by retiming.

Two matrices are then calculated, W and D, that represent the number of registers and critical path between every pair of nodes in the graph. These matrices are necessary because retiming operates by ensuring that at least one register exists on every path that is longer than the critical path in the design.

Each node also has a lag valuerthat is calculated by the algorithm and used to change the number of registers that will be placed on any given edge. Con- ventional retiming does not change the design semantics: All input and output timings remain unchanged while minor design constraints are imposed on the use of FPGA features. More details and formal proofs of correctness can be found in Leiserson’s original paper [4].

The algorithm works as follows:

1. Start with the circuit as a directed graph. Every node represents a computational element, with each element having a computational delay. Each edge can have zero or more registers as a weight w. Add an additional dummy node with 0 delay, with an edge from every output and to every input. This additional node is to ensure that from every input to every output the number of registers is unchanged and therefore the data input to output timing is unaffected.

2. Calculate W and D. D is the critical path for every node to every other node, and W is the initial number of registers along this path. This requires solving the all-pairs shortest-path problem, of which the optimal algorithm, by Dijkstra, requires O(n2lg(n)) time. This dominates the asymptotic running time of the algorithm.

3. Choose a target critical path and create the constraints, as summarized in Table 18.1. Each node has a lag value r, which will eventially specify thechange in the number of registers between each node. Initialize all nodes to have a lag of 0.

4. Since all constraints are pairwise integer inequalities, the Bellman–Ford constraint solver is guaranteed to ﬁnd a solution if one exists or to terminate if not. The Bellman–Ford algorithm performsNiterations (N= the number of constraints to solve). In each iteration, every constraint is examined. If a constraint is already satisiﬁed, nothing happens. Otherwise, r(u) orr(v) is decremented to meet the particular constraint. Once an iteration occurs where no values change, the algorithm has found a solution. If there is no solution, afterNiterations the algorithm terminates with a failure.

5. If the constraint solver fails to ﬁnd a solution, or a tighter critical path is desired, choose a new critical path and return to step 3.

6. With the ﬁnal set of constraints, a new set of registers is constructed for each edge, wãw(e) =w(e)−r(u) +r(v).

A graphical example of the algorithm’s results is shown in Figure 18.1. The initial graph has a critical path of 5, which is clearly nonoptimal. After retiming, the graph has a critical path of 4, but the I/O semantics have not changed, as any input will still require three cycles to affect the output. To determine whether a critical path P can be achieved, the retiming algorithm creates a series of constraints to calculate the lag on each node (Table 18.1).

The primary constraints ensure correctness: No edge will have a negative number of registers, while every cycle will always contain the original number of registers. All I/O passes through the intermediate node, ensuring that input and output timings do not change. These constraints can be modiﬁed so that a particular line will contain no registers, or a mandatory minimum number of registers, to meet architectural constraints without changing the com- plexity of the equations. But it is the ﬁnal constraint, that all critical paths above a predetermined delay P are registered, that gives this optimization its effectiveness.

If the constraint system has a solution, the new lag assignments for all nodes will allocate registers properly to meet the critical path P. But if there is no solution, there cannot be an assignment of registers that meets P. Thus, the common usage is to ﬁnd the minimum Pwhere the constraints are all met.

In general, multiple constraint-solving attempts are made to search for the minimum critical path P. The constraints for P are the ﬁnal retimed design.

There are two ways to speed up this process. First, if the Bellman–Ford algorithm can find a solution, it usually converges very quickly. Thus, if there is no solution that satisfies P, it is usually effective to abandon the Bellman–Ford algorithm early after 0.1Niterations rather thanNiterations. This seems to have no impact on the quality of results, yet it can greatly speed up searching for the minimum Pthat can be satisfied in the design.

A second optimization is to use the last computed set of constraints as a starting point. In conventional retiming, the Bellman–Ford process is invoked multiple times to find the lowest satisfiable critical path. In contrast, fixed-frequency repipelining or C-slow retiming uses Bellman–Ford to discover the minimum number of additional registers needed to satisfy the constraints. In both cases,

18.1 Retiming: Concepts, Algorithm, and Restrictions 387 keeping the last failed or successful solution in the data structure provides a starting point that can signiﬁcantly speed up the process if a solution exists.

Retiming in this way imposes only minimal design limitations: Because it applies only to synchronous circuits, there can be no asynchronous resets or similar elements. A synchronous global reset imposes too many constraints to allow effective retiming. Local synchronous resets and enables only produce small, self loops that have no effect on the correct operation of the algorithm.

Most other design features can be accommodated simply by adding appropriate constraints. For example, an FPGA with a tristate bus cannot have registers placed on this bus. A constraint that says that all edges crossing the bus can never be registered (r(u)−r(v)≤0 and r(v)−r(u)≤0) ensures this. Likewise, an embedded memory with a mandatory output ﬂip-ﬂop can have a constraint (r(u)−r(v)≤w(e)−1) that ensures that at least one register is placed on this output.

Memories themselves can be retimed similarly to any other element in the design, with dual-ported memories treated as a single node for retiming pur- poses. Memories that are synthesized with a negative clock edge (to create the design illusion of asynchronicity) can be either unchanged or switched to operate on the positive edge with constraints to mandate the placement of registers.

Some FPGA designs have registers with predeﬁned initial values. If retiming is allowed to move these registers, the proper initial values must be calculated such that the circuit still produces the same behavior.

In an ASIC model, all flip-flops start in an undefined state, and the designer must create a small state machine in order to reset the design. FPGAs, however, have all flip-flops start in a known, user-defined state, and when a dedicated global reset is applied the flip-flops are reset to it. This has serious implications in retiming.

If the decision is made to utilize the ASIC model, retiming is free to safely ignore initial conditions because explicit reset logic in state machines will still operate correctly—this is reflected in the I/O semantics. However, without the ability to violate the initial conditions with an ASIC-style model, retiming quality often suffers as additional logic is required or limits are placed on where flip- flops may be moved in a design.

In practice, performing retiming with initial conditions is NP-hard. Cong and Wu [3] have developed an algorithm that computes initial states by restricting the design to forward retiming only so that it propagates the information and registers forward throughout the computation. This is because solving initial states for all registers moved forward is straightforward, but backward move- ment is NP hard as it reduces to satisﬁability.

Additionally, global set/reset imposes a huge constraint on retiming. An asynchronous set/reset can never be retimed (retiming cannot modify an asynchronous circut) while a synchronous set/reset just imposes too high a fanout.

An important question is how to deal with multiple clocks. If the interfaces between the clock domains are registered by clocks from both domains, it is a simple process to retime the domains separately, with mandatory registers

TABLE 18.2 I The results of retiming four benchmarks Benchmark Unretimed Automatically retimed

AES core 48 MHz 47 MHz

Smith/Waterman 43 MHz 40 MHz

Synthetic datapath 51 MHz 54 MHz

LEON processor 23 MHz 25 MHz

on the domain crossings—the constraints placed on the I/Os ensure correct and consistent timing through the interface. Yet without this design constraint, retiming across multiple clock domains is very hard, and there does not appear to be any clean automatic solution.

Table 18.2 shows the results for a particular retiming tool [13]—the Xilinx Virtex family of FPGAs—on four benchmark circuits: an AES core, a Smith/Waterman systolic cell, a synthetic microprocessor datapath, and the LEON-I synthesized SPARC core. This tool does not use a perfectly accurate delay model and has to place registers after retiming, so it sometimes creates slightly suboptimal results.

The biggest problem with retiming is that it is of limited beneﬁt to a well- balanced design. As mentioned earlier, if the clock cycle is deﬁned by a single- cycle feedback loop, retiming can never improve the design, as moving the register around the feedback loop produces no effect.

Thus, for example, the Smith–Waterman example in Table 18.2 does not bene- ﬁt from retiming. The Smith–Waterman benchmark design consists of a series of repeated identical systolic cells that implement the Smith–Waterman sequence alignment algorithm. The cells each contain a single-cycle feedback loop, which cannot be optimized. The AES encryption algorithm also consists of a single- cycle feedback loop. In this case, the initial design used a negative-edge Block- RAM to implement theS-boxes, which the retiming tool converted to a positive edge memory with a “must register” constraint.

Nevertheless, retiming can still be a beneﬁt if the design consists of multiple feedback loops (such as the synthetic microprocessor datapath or the LEON SPARC–compatible microprocessor core) or an initially unbalanced pipeline.

Still, for well-designed circuits, even complex ones, retiming is often only a slight beneﬁt, as engineers have considerable experience designing reasonably optimized feedback loops.

The key beneﬁt to retiming occurs when more registers can be added to the design along the critical path. We will discuss two techniques, repipelining and C-slow retiming, which ﬁrst add a large number of registers that general retiming can then move into the optimal location.

Retiming: Concepts, Algorithm, and Restrictions

Reconﬁgurable Processing Fabric Architectures

Independent Reconﬁgurable Coprocessor Architectures