Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C033 Finals Page 662 29-9-2008 #19 662 Handbook of Algorithms for Physical Design Automation worst case when the victim net is fully coupled from both sides by two aggressor nets. Of course, more optimistic modeling, for example, based on some distribution assumption, is also applicable. Nevertheless, it will be clear that the technical conclusion, say with the distribution assumption, remains similar, and thus we should focus on the worst-case scenario for easier presentation. With the worst-case scenario, we have c c = 2l v c f d min . Furthermore, 2c c is adopted in the model to account for the worst case coupling effect when all the aggressor nets have different signal transitions from that of the victim net, for which the Miller effect makes the coupling phenomenon more significant by doubling the coupling effect. Again, the technique to be presented readily applies to other models with less pessimistic estimation. Consider a wire e = (u, v),whereu and v are two nodes in a buffered tree. Let the length of the wire segment e be l e ,andT(v) be the subtree rooted at v. I T (v) is the total downstream current seen at v and is the current induced by aggressor nets on downstream wires of v . The current on a unit-length wire induced by aggressor nets is i 0 = λpc [24], where c is the unit-length wire capacitance, λ is the fixed r atio of coupling to total wire capacitance, p is the slope (i.e., power supply voltage over input rise time) of all aggressor nets’ signals, and c c is m odeled as some fraction of the unit-length wire capacitance of the victim net. Let χ(u, v) be the noise on the wire segment between two neighboring buffers u and v. The resulting noise χ(u, v) induced from the coupling current is the voltage pulse coupled from aggressor nets in the victim net for a wire segment e = (u, v). Using an Elmore-delay like noise metric [24] to model χ(u, v) (see Chapter 3), we can express the noise constraint as χ(u, v) = R b I T(v) + rl e i 0 l e 2 + I T(v) ≤ M v , (33.1) where R b is the output resistance of a minimum size buffer, and M v is the noise margin for a buffer or a sink v, which is the maximum allowable noise without incurring any logic error. The width W IFR(N) i of the independent feasible region IFR(N) i for the ith buffer that satisfies the noise constraint is given by W IFR(N) i ≤ R b r 2 + I T(v) i 0 2 + 2M v i 0 r − R b r − I T(v) i 0 . (33.2) For this noise model, the four factors that determine the size of a feasible region are noise margin M v , buffer resistance R b , unit-length wire resistance r, and crosstalk-induced unit current i 0 . The feasible region under noise constraint, denoted by IFR (N) i is the maximum allowable length in each net satisfying the noise margins after buffer insertion. To estimate the feasible region under noise constraint, IFR(N) i , the noise formulas [11] below can be applied. The induced noise current on wire segment e = (u, v) is computed by I e = i 0 l e . To satisfy the noise constraint, a buffer can be inserted at u as in Equation 33.1, where W IFR(N) i the width of the feasible region IFR(N) i for buffers satisfying the noise constraint, is computed from Equation 33.2. χ(u, v) = R b I T(u) + rW IFR(N) i i 0 W IFR(N) i 2 + I T(v) ≤ M v . (33.3) Given two-pin nets as inputs, the method is to scan from the sink s i with th e given M s i to the source s 0 . Because the accumulated crosstalk-induced current I T(v) is zero for pins of two-pin nets, the noise formula is given by Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C033 Finals Page 663 29-9-2008 #20 Global Interconnect Planning 663 Source IFR(D) i IFR(N) i Intersection Obstacle Sink FIGURE 33.9 Respective feasible regions IFR(D) i ,IFR(N) i ,andIFR(D) i ∩ IFR(N) i for inserting a buffer that satisfy the d elay, noise, and both delay and noise constraints. χ(u, v) = R b I T(u) + rW IFR(N) i I e 2 + I T(v) = R b I e + I T(v) + rW IFR(N) i I e 2 + I T(v) = R b I e + rW IFR(N) i I e 2 . (33.4) On the basis of Equations 33.3 and 33.4, W IFR(N) i , can be computed by W IFR(N) i ≤ R b r 2 + 2M v i 0 r − R b r . In the preceding equation, W IFR(N) i is the maximum length from the next buffer B i+1 back to B i without causing any logic error. To handle the transition time, delay, and noise constraints simultaneously, we first compute the respective feasible regions IFR(R) i ,IFR(D) i ,andIFR(N) i for inserting buffer i to satisfy the transition time, delay, and noise constraints, and then find the intersection of IFR(R) i ,IFR(D) i ,and IFR(N) i to derive the feasible region for buffer i that meets all these constraints (see Figure 33.9 for an illustration). Furthermore , the buffer block planning algorithm presented in section 33.3.3.3 still works by additionally considering the noise constraint. 33.4 FLIP-FLOP AND BUFFER PLANNING (WIRE RETIMING) Although buffer insertion is very effective in improving the d elay performance (and noise toler- ance) of interconnects, the timing constraints may be so tight that they are beyond the maximum performancedeliverableby buffer insertion, makingthe insertion of flip-flopsorlatchesforpipelined signal transmission necessary. In the case of modern high-performance microprocessors [13], it is not unusual for global signals to take several clock cycles to travel across the chip to reach their destinations. In fact, the wire delay can be as long as about ten clock cycles in the near future [25]. It has been shown in Ref. [26] that under an aggressive scaling scenario where the frequency of microprocessors approximately doubles and die size increases by about 25 percent in every process generation, the number of flip-flops (referred to as clocked repeaters) increases by 7 times every process generation. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C033 Finals Page 664 29-9-2008 #21 664 Handbook of Algorithms for Physical Design Automation As the number of flip-flops and buffers increases in an exponential fashion, the planning and design of pipelined interconnects are very important emerging problems. Several design challenges can be posed: 1. What is the minimum latency required between two communicating functional blocks of a design? 2. Given the latency constraints between two communicating functional blocks of a design, where should flip-flops and buffers be inserted to minimize, for example, the total flip-flop and buffer area? 3. How does interconnect latency affect the system behavior? Arbitrary interconnect latency may destroy the functionality of a sequential circuit. How can functional blocks and interconnects be simultaneously retimed to achieve the desired circuit performance while maintaining its functionality? 4. How can buffer planning take into consideration the retiming of logic blocks and interconnects, as well as the placement of those flip-flops relocated by retiming? 33.4.1 MINIMIZING LATENCY In the initial stages of the design of high-performance microarchitectures, the minimum latency that can be achieved on long interconnects gives microarchitects and circuit designers an accurate prediction of the timing and routing demands required of the design. There are two approaches to the problem of latency minimization: (1) using analytical formulas [27]; and (2) using a van Ginneken-style dynamic programming approach [26]. 33.4.1.1 Two-Pin Net Optimization Using Analytical Formulas Consider a wire with length L,driverR d , and sink C s . On the basis of the optimal delay formula obtained when we insert n buffers into the wire [6], the optimal delay for an interconnect properly inserted with an ideal optimal number of buffers is D opt (L) = R b c +rC b + 2rc(R b C b + T b ) ·L +(l r + l c ) · 2rc(R b C b + T b ) + l r rC b + l c R b c − rc 2 (l 2 r + l 2 c ) −T b , where l r = R d − R b r l c = C s − C b c Here, the ideal optimal number of buffers is defined as n opt (L) = rc 2(R b C b + T b ) · (L +l r + l c ) −1, which may not be an integer. Therefore, the maximum length of a wire inserted with the ideal optimal number of buffers that can meet a given delay constraint D tgt is L max (R d , C s , D tgt ) = D tgt + T b + rc 2 (l 2 r + l 2 c ) − l r rC b − l c R b c −(l r + l c ) √ 2rc(R b C b + T b ) R b c +rC b + √ 2rc(R b C b + T b ) . Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C033 Finals Page 665 29-9-2008 #22 Global Interconnect Planning 665 Although the ideal optimal number of buffers n opt may not be an integer, which is not realizable, the actual optimal number of buffers of the interconnect is either n opt or n opt .LetL N (n) denote the maximal length for an interconnect N with n buffers under a given timing requirement D tgt .(L N (n) can be obtained by solving for L in the op timal delay formula for a given n and D tgt .) The maximum wire length of the interconnect inserted with buffers that can meet a g iven target delay D tgt is L max (R d , C s , D tgt ) = max{L N (n opt ), L N (n opt )}. With flip-flops inserted, we have to define targetdelays for the first segment, the middle segments, and the last segment of the pipelined interconnects separately. The timing constraint for any middle segment, denoted D tgt,M , is the clock p eriod less the setup time and the flip-flop propagation delay. The timing constraint for the first segment, denoted D tgt,F , should ensure that the maximum delay from those source flip-flops before the driver to the first flip-flop along the pipelined interconnect is smaller than one clock period less the setup time and the flip-flo p propagation delay. Similarly, the timing constraint for the last segment, denoted D tgt,L , should ensure that the maximum delay from the last flip-flop along the pipelined interconnect to the flip-flops after the sink is smaller than one clock period less the setup time and the flip-flop propagation delay. Therefore, the minimum latency or the least number of flip-flops required to meet the d elay and clock period constraints is N FF = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ 0ifL ≤ L max (R d , C s , D tgt ), 1ifL max R d , C s , D tgt < L ≤ L L + L F , L−L F −L L L M + 1otherwise, where L F = L max (R d , C F , D tgt , F) L L = L max (R F , C s , D tgt,L ) L M = L max (R F , C F , D tgt,M ) with R F and C F being respectively the output resistance and input capacitance of a flip-flop. In the context of flip-flop and buffer planning, of greater interest is the feasible regions (or independent feasible regions) of flip-flops and buffers. Let n be the number of flip-flops inserted in an interconnect and f i be the location of the ith (1 ≤ i ≤ n) flip-flop. With f ∗ i denoting the central location of the ith flip-flop in its feasible region, and W FR the uniform width of the feasible regions, we define the FR for the ith flip-flop as FR i = f ∗ i − W FR 2, f ∗ i + W FR 2 ∩ (0,L), such that (f 1 , f 2 , , f i , f n ) ∈ FR 1 × FR 2 × × FR n , f 1 ≤ L F , f i − f i−1 ≤ L M for 2 ≤ i ≤ n,and L −f n ≤ L L . The following inequalities must hold for a flip-flop solution to be feasible: f ∗ 1 + W FR /2 ≤ L F , f ∗ i − f ∗ i−1 + W FR ≤ L M for 2 ≤ i ≤ n,andL −f ∗ n + W FR /2 ≤ L L . The largest W FR that satisfies these inequalities is W FR = ( L F + L L + (n − 1)L M − L ) / n. Correspondingly, the central locations f ∗ i are f ∗ i = L F + (i − 1)L M − (i − 1/2)W FR for 1 ≤ i ≤ n. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C033 Finals Page 666 29-9-2008 #23 666 Handbook of Algorithms for Physical Design Automation The independent feasible regions of flip-flops and buffers can also be determined in a fairly straightforward fashion [27]. With the definition of f easible regions of flip-flops in place, the buffer planning algorithms outlined in preceding sections can be easily extended to handle th e latency minimization problem. 33.4.1.2 Multiple-Terminal Net Optimization In the case of two-pin net optimization (Section 33.4.1.1), the planning can be carried out without first performing routing. In the case of multiple-terminal net optimization, the assumptio n is that the routing solution of global nets is known. In the context of design migration, this is typically true, where the microarchitects and circuit designers would like to make minimal changes to the design. A natural algorithm to adopt would be that of van Ginneken [14]. In Ref. [26], each flip-flop and buffer insertion solution can be represented by a four-tuple γ = (c , r, λ, a),wherec is the capacitance seen by the upstream resistance, r is the required arrival time, λ is the maximum number of flip-flops crossed when going from this node (or edge) to its leaf nodes, and a is the flip-flop or buffer assignment at this node. For simplicity, we assume that long edges are segmented properly and that flip-flop and buffer insertion is allowed only at nodes. At a leaf node v, the solution is (c v , r v ,0,∅),wherec v is the sink capacitance, and r v is th e req uired arrivaltimeatnodev. Thepropagationofasolutionfromanodetoitsparentedge (the edgeconnecting the node to its parent node) proceeds as in the dynamic programming algorithm of Ref. [14]. Let the node solution at node v be (c v , r v , λ v , a v ). The corresponding solution at the upstream node of the branch (u, v) is (c v + C u,v , r v − R u,v (C u,v + c v ), λ v , ∅),whereC u,v is the edge capacitance and R u,v is the edge resistance. When two downstream branches meet at a parent node, we m erge two solutions (c u , r u , λ u , a u ) and (c v , r v , λ v , a v ) from the two branches to form (c u +c v ,min(r u , r v ),max(λ u , λ v ), a u ∪ a v ). When we insert a buffer g to drive a subtree with solution (c u , r u , λ u , a u ), the new solution is (c g , r u −R g c u −t g , λ u , {g}),wherec g is the gate capacitance of g, R g is the output resistance of g,and t g is the intrinsic delay of g. When we add a flip-flop f to drive the subtree instead, the new solution is (c f , T CP − t su,f , λ u + 1,{f }),wherec f is the gate capacitance of f , T CP is the clock period, and t su, f is the setup time of f . Note that when we insert a flip-flop, we have to first verify that the pipeline stage immediately after the newly inserted flip-flop has nonnegative slack or required arrival time. As in the van Ginneken’s algorithm, it is important to perform pruning of all solutions to keep only noninferior solutions that can lead to an optimal solution at the root node. Let γ = (c, r, λ, a) and γ = (c , r , λ , a ) be two solutions at any node in the tree. We say that γ is inferior and can be pruned if at least one of the following is true: • λ = λ , c ≥ c ,andr < r • λ = λ , c > c ,andr = r • λ = λ , c = c , r = r ,andcost(γ) > cost(γ ), where cost(·) is a user-specified cost function associated with the flip-flop and buffer solution; an example of the cost function is the total area of the solution. • λ>λ , c ≥ c ,andr ≤ r Also note that all solutions kept in the algorithm have nonnegative r. 33.4.2 LATENCY CONSTRAINED OPTIMIZATION Suppose the required latency at leaf node v is λ v (assuming that the latency at the root node is zero), we can generalize the algorithm given in Section 33.4.1.2 by using γ = (c v , r v , −λ v ,0) at v .The algorithm in Section 33.4.1.2 can then be applied to compute an optimal solution to the latency constrained optimization problem with a minor modification: Any solution that has a latency greater than zero can be pruned [26]. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C033 Finals Page 667 29-9-2008 #24 Global Interconnect Planning 667 As the required latency at the root node is zero, only solutions that have zero latency would be feasible. Consequently, at the root node, if a solution has a negative latency λ, more flip-flops can always be added to make the solution feasible, that is, the latency at the root node equals zero. As we search top-down to retrieve an optimal solution at all nodes, we might have to insert more flip- flops. Consider the solution (c u +c v ,min(r u , r v ),max(λ u , λ v ), a u ∪a v ) obtained by merging solutions (c u , r u , λ u , a u ) and (c v , r v , λ v , a v ) of two downstream branches. If λ u = max(λ u , λ v ), an additional λ u − λ v flip-flops should be inserted to the branch that contains the solution (c v , r v , λ v , a v ). 33.4.3 WIRE RETIMING Unfortunately, long wires cannot be pipelined in isolation. It is important to consider the effect of interconnect latency on overall system behavior. Relocation of flip-flops to pipeline logic path while preserving the functionality of the circuit is known as retiming [28]. However, traditional retiming approaches ignore interconnectdelay. In modern-day designs, it isimperative to consider the problem of retiming with both interconnect and gate delays [29–31]. In the context of retiming, a sequential circuit can be represented by a direct graph G R (V R , E R ), where each node v ∈ V R corresponds to a combinational gate, and each directed edge e uv ∈ E R connects the output of gate u to the input of gate v, through a nonnegative number of registers. Without loss of generality, G R can be assumed to be strongly connected; fictitious nodes and edges can be added to make it strongly connected otherwise. Let d u be the gate delay of node u, w uv the number of flip-flops of edge e uv ,andd uv the interconnect delay of edge e uv if all the flip-flops are removed. Although it is hard to accurately model interconnect delay, it is fairly accurate to assume that the delay of a wire is linearly proportional to its length for the following reasons: When a wire is short, the linear component of the wire delay dominates the quadratic component. For a long wire, buffers inserted at appropriate locations can render the delay linear. The retiming problem can be viewed as one of determining a labeling of the nodes r : V R → Z, where Z is the set of integers [28], such that w uv + r(v) − r(u) ≥ 0 for all edges w uv ∈ E R .The retiming label r(v) of node v represents the number of flip-flops moved from its outputs to its fan-ins and ˆw uv = w uv +r(v)−r(u) denotes the number of flip-flops on edge e uv after retiming. Retiming can be formulated as a p roblem of determining a feasible retiming solution for a given clock period, that is, a solution in which the number of flip-flops on every edge is nonnegativefor a given clock period. The minimum achievable clock period T ∗ CP can then be computed by performing a binary search. A feasible retiming solution for a given clock period T CP must satisfy the following set of constraints [30]: d v ≤ a(v) ∀ v ∈ V R , a(v) ≤ T CP ∀ v ∈ V R , w uv + r(v) −r(u) ≥ 0 ∀ e uv ∈ E R , a(v) ≥ a(u) +d uv + d v − T CP [w uv + r(v) −r(u)]∀e uv ∈ E R , Here, a(v) represents the maximum arrival time at the output of gate v from a flip-flop that directly drives the logic path containing v. The first two constraints are fairly straightforward. The third constraint is required for a feasible retiming solution. The fourth constraint ensures that sufficient flip-flops are inserted along each edge e uv for the circuit to be operable at a clock period of T CP .Every flip-flop along the edge e uv after retiming reduces the right-hand side of the inequality by T CP . By introducing a variable R(v) defined as a(v)/T CP + r(v) at each node v, the preceding set of constraints can be transformed into a set of difference constraints as follows [30]: R(v) −r(v) ≥ d(v) T CP ∀ v ∈ V R , (33.5) Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C033 Finals Page 668 29-9-2008 #25 668 Handbook of Algorithms for Physical Design Automation R(v) −r(v) ≤ 1 ∀ v ∈ V R , (33.6) r(u) − r(v) ≤ w uv ∀ e uv ∈ E R , (33.7) R(v) −R(u) ≥ d uv T CP + d v T CP − w uv ∀ e uv ∈ E R , (33.8) These difference constraints involve |V R | real variables R(v), |V R | integer variables r(v),and2|V R |+ 2|E R | constraints, and can be solved in polynomial time of O(|V R ||E R |log |V R |+|V R | 2 log 2 |V R |), using Fibonacci heap as the data structure [32]. Given a feasible retiming solution, the exact positions at which flip-flops should be inserted can be determined as follows: For each edge e uv with nonzero ˆw uv , the first flip-flop on this edge is inserted at a distance that corresponds to a delay of T CP − a(u) from the output of gate u.Other flip-flops are inserted at a distance that corresponds to a delay of T CP from the previous one, until gate v is reached. All remaining flip-flops on this edge are then inserted right before v. A fast approximation algorithm can be obtained by first replacing each gate by a wire of the same delay, and then solving optimally and efficiently the retiming problem with only interconnect delays [30]. The key to the fast approximation algorithm is the observation that for a directed graph where d v = 0forallv ∈ V R ,givenR(v) for all v ∈ V R that satisfy the constraint in Equation 3 3.8, the set of difference con straints can be satisfied by setting r(v) =R(v) for all v ∈ V R . The problem of finding R(v) for all v ∈ V R to satisfy the constraint given in Equation 33.8 can be posed as a single-source longest-paths problem on G R with the cost or length of each edge e uv ∈ E R defined as d uv /T CP − w uv . Any node in G R can be the source node as the graph is strongly connected. If G R has a positive cycle, the clock period T CP is infeasible. The single-source longest-paths problem can be solved by the Bellman–Ford algorithm in O(|V R ||E R |) time complexity. With a path compaction preprocessing step to the reduce the size of G R , the complexity can be further reduced. Given a retiming solution for a graph with only interconnect delays, if the solution retimes some flip-flops into a wire that repr esents a gate, a postprocessing step is required to get back a feasible retiming solutio n tha t has both gate and interconnect delays. Fir st, we move the flip-flops in a gate to its fan-ins or fan-outs depending on which direction has a shorter distance (delay). A linear program is then used to determine the exact positions of the flip-flops on the interconnect edges. The objective of the linear program is to minimize the clock period T CP subject to constraints on the flip-flop counts and constraints on the delays between flip-flops. Let x k uv denote the delay from the kth flip-flop to the (k +1)st flip-flop of the wire from node u to node v in G R ,fork = 0, 1, , ˆw uv . The linear program is formulated as follows: Minimize T CP subject to ˆw uv k=0 x k uv = d uv ∀ e uv ∈ E R , x ˆw uv uv + d v ≤ a(v) ∀ e uv ∈ E R s.t. ˆw uv > 0, a(u) +x 0 uv ≤ T CP ∀ e uv ∈ E R s.t. ˆw uv > 0, a(u) +d uv ≤ a(v) ∀ e uv ∈ E R s.t. ˆw uv > 0, 33.4.4 AREA CONSTRAINED WIRE R ETIMING To account for the area overhead incurred by wire retiming during the planning stage, a more closely related problem is that of minimum-area retiming. To render conventional minimum-area retiming applicable to interconnects,eachlong interconnect canbe represented asaseriesofinterconnectunits, Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C033 Finals Page 669 29-9-2008 #26 Global Interconnect Planning 669 each of which has delay but performs no logic function. A natural segmentation o f an interconnect can be obtained by buffer insertion, with each interconnect unit being a buffer driving an interconnect segment. Although minimum-area retiming is optimal in terms of overall area consumption, it may not be directly applicable to interconnect retiming and planning. To minimize the total area consumption, it may relocate flip-flops from regions with a lot of empty space to overcongested regions. That may result in area constraint violations in a given floorplan, necessitating iterations of floorplanning and interconnect planning. Therefore, for interconnect retiming and planning, it is necessary to consider local area constraints such that both the timing and the impact on floorplan of the relocated flip-flops can be taken into account. In Ref. [29], a new retiming p roblem, called local area constrained (LAC) retiming problem, has been formulated with the following three sets of constraints, of which the first two are typical of the retiming problem [28] and the third captures the local area constraints: 1. Edge weights must be nonnegative: r(v) −r(u) ≥−w(e u,v ), ∀ e u,v ∈ E R . 2. For any path u v whose delay (along successive combinational logic paths) is larger than the clock period T CP , there should be at least one flip-flop on it after retiming: r(v) −r(u) ≥−W(u, v) +1, ∀ u v, D(u, v)>T CP , where W(u, v) defines the minimum latency for a signal to transfer from u to v before retiming D(u, v) is the maximum delay (of successive combinational logic paths) of the logic path from u to v with the minimum latency W(u, v) 3. To define the local area constraints, we let F be the set of all functional units, V T be the set of all tiles, and for any t i ∈ V T , C(t i ) be the remaining capacity (after buffer insertion) that is available f or flip-flop insertion. The function P : F → V T maps each functional unit v ∈ F to a tile t i ∈ V T such that P(v) = t i means that functional unit or interconnect unit v is in tile t i of the floorplan. The local area constraint of a tile requires that P(u)=t i , e u,v ∈E R w(e u,v ) +r(v) −r(u) ≤ C(t i ), ∀ t i ∈ V T . As each local area constraint involves more than two retiming variables, the LAC-retiming problem is an integer linear programming problem, which is NP-complete. In Ref. [29], a heuristic based on minimum-area retiming was used to solve the LAC-retiming problem. In minimum-area retiming, all flip-flops are assumed to have the same area cost; thus, the minimization of total number of flip-flops is equivalent to the minimization of the total area. In LAC-retiming, the insertion of flip- flops into differenttiles should take into account the differences in the tile capacities. To achieve that, the LAC-retiming problem was solved in Ref. [29] as a series of weighted minimum-area retiming problems, with the weights of flip-flops adjusted according to the congestion levels in the tiles. As different weights are assigned to flip-flops in different tiles based on the area consumption and tile capacities in the series of minimum-area retiming problems, flip-flop s from overutilized tiles can be repositioned to those with low-area consumption. 33.5 CONCLUDING REMARKS While Semiconductor process scaling h as enabled integrated circuits of increasingly high perfor- mance, it has also created several new design concerns. In this chapter, we have summarized several Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C033 Finals Page 670 29-9-2008 #27 670 Handbook of Algorithms for Physical Design Automation buffer planning methodologies that tackle the design challenges brought forth by the exponential growth of buffers. Most of these methodologiesaddressboth timing andlayout closure issues simulta- neously by allocating sufficient silicon resources and routing resources during floorplanning or right after floorplanning. As multiple-cycle data communications become increasingly necessary, many of these buffer planning methodologies have been extended to also address the exponential growth of flip-flops (clocked repeaters). The challenge here is to account for the changes in latency intro- duced by additional flip-flops along global interconnects. While we have presented these planning methodologies in the context of synchronous system design, we believe that these methodologies also have an important role to play in the design of SOCs, NOCs, latency-insensitive systems, and globally asynchronous locally synchronous systems. It is also important to recognize that the planning methodologies presented in this chapter may have fundamental limits. To a certain extent, the planning methodologies shield the downstream stages of physical synthesis from the problem of inserting a huge fraction of repeater (and clocked repeater). However, empirical studies [33] indicate that it is unlikely that incremental improvements to the physical synthesis technologies can adequately handle the exponential growth in repeater and clocked repeater counts if the scaling continues at the existing pace. Instead, a correct-by- construction design methodology that trades off optimality for predictability has been proposed in Ref. [33]. Perhaps even more alarming is a theoretical study, which is based on Rent’s rule [34,35], that demonstrates the necessity of excessively long wires as the number of computing elements within a system continues to grow [36]. As large monolithic designs are unattractive, increased quality, instead of improved capacity, of CAD algorithms and tools should perhaps be the proper objective of future research [36]. REFERENCES 1. J. Cong, T. Kong, and Z. Pan. Buffer block planning for interconnect planning and prediction. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 9( 6):929–937, 2001 (ICCAD 1999). 2. C. J. Alpert, J. Hu, S. S. Sapatnekar, and P. G. Villarrubia. A practical methodology for early buffer and wire resource allocation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 22(5):573–583, 2003 (DAC 2001). 3. P. Sarkar and C. -K. Koh. Routability-driven repeater block planning for interconnect-centric floorplanning. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 20(5):660–671, 2001 (ISPD 2000). 4. J. Cong, L. He, K. -Y. Khoo, C. -K. Koh, and Z. Pan. Interconnect design for deep submicron ICs. In Proceedings of IEEE/ACM International Conference on Computer Aided Design, San Jose, CA, pp. 478– 485, 1997. 5. W. C. Elmore. The transient response of damped linear networks with particular regard to wide-band amplifiers. Journal of Applied Physics, 19(1):55–63, January 1948. 6. C. J. Alpert and A. Devgan. Wire segmenting for improved buffer insertion. In Proceedings of ACM/IEEE Design Automation Conference, Anaheim, CA, pp. 588–593, June 1997. 7. F. F. Dragan, A. B. Kahng, I. I. Mandoiu, S. Muddu, and A. Zelikovsk y. Provably good global buffering by generalized multiterminal multicommodity flow approximation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 21(3):263–274, 2002 (ASPD AC 2001). 8. F. F. Dragan, A. B. Kahng, S. Muddu, and A. Zelikovsky. Provably good global buffering using an av ailable buffer block plan. In Proceedings of IEEE/ACM International Conference on Computer Aided Design, San Jose, CA, pp. 104–109, 2000. 9. X. Tang and D. F. Wong. Network flow based buf fer planning. Integr ation, 30(2):143–155, 2001 (ISPD 2000). 10. Y. -H. Cheng and Y. -W. Chang. Integrating buffer planning with floorplanning for simultaneous multi- objective optimization. In Proceedings of IEEE/ACM Asia South Pacific Design Automation Conference, pp. 624–627, Piscataway, NJ, 2004. IEEE Press. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C033 Finals Page 671 29-9-2008 #28 Global Interconnect Planning 671 11. H. -R. Jiang, Y. -W. Chang, J. -Y. Jou, and K. -Y. Chao. Simultaneous floorplan and buffer block optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23(5):694– 703, 2004 (ASPDAC 2003). 12. Y. Ma, X. Hong, S. Dong, S. Chen, Y. Cai, C. K. Cheng, and J. Gu. Dynamic global buffer planning optimization based on detail block locating and congestion analysis. In Proceedings of ACM/IEEE Design Automation Conference , pp. 806–811, New York, 2003. ACM Press. 13. R. McInerney, M. Page, K. Leeper, T. Hillie, H. Chan, and B. Basaran. Methodology for repeater insertion management in the RTL, layout, floorplan, and fullchip timing databases of the Itanium microprocessor. In Proceedings of ACM International Symposium on Physical Design, San Diego, CA, pp. 99–104, 2000. 14. L. P. P. P. van Ginneken. Buffer placement in distributed RC-tree networks for minimal Elmore delay. In Proceedings of IEEE International Symposium on Circuits and Systems, New Orleans, LA, pp. 865–868, 1990. 15. S. Chen, X. Hong, S. Dong, Y. Ma, Y. Cai, C. -K. Cheng, and J. Gu. A buffer planning algorithm based on dead space redistribution. In ASP-DAC ’03: Proceedings of the 2003 Confer ence on Asia South Pacific Design Automation, pp. 435–438, Piscataway, NJ, 2003. IEEE Press. 16. S. Chen, X. Hong, S. Dong, Y. Ma, Y. Cai, C. -K. Cheng, and J. Gu. A buffer planning algorithm with congestion optimization. In Proceedings of IEEE/ACM Asia South Pacific Design Automation Conference, pp. 615–620, Piscataway, NJ, 2004. IEEE Press. 17. C. W. Sham and E. F. Young. Routability driven floorplanner with buffer block planning. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 22(4):470–480, 2003 (ISPD 2002). 18. K. K. Wong and E. F. Young. Fast buffer planning and congestion optimization in interconnect-driven floorplanning. InProceedingsof IEEE/ACM Asia SouthPacific DesignAutomation Conference, Kitakyushu, Japan, pp. 411–416, 2003. 19. C. Albrecht, A. B. Kahng, I. Mandoiu, and A . Zelikovsk y. Floorplan ev aluation with timing-driven global wireplanning, pin assignment and buffer/wire sizing. In Proceedings of IEEE/ACM A sia South Pacific Design Automation C onference, Bangalore, India, pp. 580–591, 2002. 20. Y. Ma, X. Hong, S. Dong, S. Chen, C. -K. Cheng, and J. Gu. Buffer planning as an integral part of floorplanning with consideration of routing congestion. IEEE Transactions on Computer-Aided Design of Inte grated Circuits and Systems, 24(4):609–621, 2005 (ISPD 2003, ASPDAC 2004). 21. H. Xiang, X. Tang, and D. F. Wong. An algorithm for integrated pin assignment and buffer planning. ACM Transactions on Design Automation of Electr onics Systems, 10(3):561–572, 2005 (DAC 2002). 22. P. Sarkar and C. -K. Koh. Repeater block planning under simultaneous delay and transition time constraints. In Proceedings of IEEE/ACM Design, Automation and Test in Europe Conference, pp. 540–545, Piscataway, NJ, 2001. I EEE Press. 23. S. -M. Li, Y. -H. Cherng, and Y. -W. Chang. Noise-aware buffer planning for interconnect-driven floorplanning. InProceedings of IEEE/ACM Asia South Pacific Design Automation Conference, Kitakyushu, Japan, pp. 423–426, 2003. 24. A. Devgan. Eff icient coupled noise estimation for on-chip interconnects. In Proceedings of IEEE/ACM International Conference on Computer Aided Design, San Jose, CA, pp. 147–153, 1997. 25. D. Matzke. Will physical scalability sabotage performance gains? IEEE Computers, 8:37–39, September 1997. 26. P. Cocchini. A methodology for optimal repeater insertion in pipelined interconnects. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 22(12):1613–1624, 2003 (ICCAD 2002). 27. R. Lu, G. Zhong, C. -K. Koh, and K. -Y. Chao. Flip-flop and repeater insertion for early interconnect planning. In Proceedings of IEEE/ACM Design, Automation and Test in Europe Conference, Paris, France, pp. 690–695, March 2002. 28. C. E . Leiserson and J. B. Saxe. R etiming s ynchronous circuitry. Algorithmica, 6:5–35,1991. 29. R. Lu and C. -K. Koh. Interconnect planning with local area constrained retiming. In Pr oceedings of IEEE/A C M Design, Automation and Test in Europe Conference, Messe Munich, Germany, pp. 442–447, March 2003. 30. C. C. Chu, E. F. Young, D. K. Tong, and S. Dechu. Retiming with interconnect delay. In Proceedings of IEEE/ACM International Conference on Computer Aided Design, San Jose, CA, pp. 221–226, 2003. 31. C. Lin and H. Zhou. Retiming for wire pipelining in system-on-chip. IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems, 23(9):1338–1345, 2004 (ICCAD 2003). . generation. Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C033 Finals Page 664 29-9-2008 #21 664 Handbook of Algorithms for Physical Design Automation As the number of flip-flops. (i − 1/2)W FR for 1 ≤ i ≤ n. Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C033 Finals Page 666 29-9-2008 #23 666 Handbook of Algorithms for Physical Design Automation The. Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C033 Finals Page 662 29-9-2008 #19 662 Handbook of Algorithms for Physical Design Automation worst case when