Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C026 Finals Page 552 29-9-2008 #19 552 Handbook of Algorithms for Physical Design Automation v 1 T(V 1 ) v v 2 v 3 … v k R min FIGURE 26.14 If α 1 and α 2 satisfy the condition in Definition 1 at v 1 , α 2 is redundant. (From Shi, W. and Li, Z., IEEE Trans Computer-Aided Design, 24, 879, 2005. With permission.) node being processed. However, all candidates at this node must be propagated further upstream toward the source. This means the load seen at this node must be driven by some minimal amount of upstream wire or gate resistance. By anticipating the upstream resistanceahead of time, one can prune out more potentially inferior candidates earlier rather than later, which reduces the total number of candidates generated. More specifically, assume that each candidate must be driven by an upstream resistance of at least R min . The pruning based on anticipated upstream resistance is called predictive pruning. Definition 1 Predictive Pruning.Letα 1 and α 2 be two nonredundant candidates of T(v) such that C(α 1 )<C(α 2 ) and Q (α 1 )<Q(α 2 ).IfQ(α 2 ) −R min ·C(α 2 ) ≤ Q(α 1 ) −R min ·C(α 1 ),thenα 2 is pruned. Predictive pruning preserves optimality. The general situation is shown in Figure 26.1 4. Let α 1 and α 2 be candidates of T (v 1 ) that satisfy the condition in Definition 1. Using α 1 instead of α 2 will not increase delay from v to sinks in v 2 , , v k . It is easy to see C(v , α 1 )<C(v, α 2 ).IfQ at v is determined by T(v 1 ),wehave Q(v, α 1 ) − Q(v, α 2 ) = Q(v 1 , α 1 ) −Q(v 1 , α 2 ) − R min ·[C(v 1 , α 1 ) − C(v 1 , α 2 )] ≥ 0 Therefore, α 2 is redundant. Predictive pruning technique prunes more r edundant solutions while guarantees optimality. I t is one of four key techniques of fast algorithms proposed in Ref. [39]. In Ref. [42], significant speedup is achieved by simply extending predictive pruning technique to buffer cost. Aggressive predictive pruning technique, which uses a resistance larger than R min to prune candidates, is proposed in Ref. [43] to achieve further speedup with a little degradation of solution quality. 26.5.3 CONVEX PRUNING The basic data structure of van Ginneken’s algorithms is a sorted list of nondominated candidates. Both the pruning in van Ginneken’s algorithm and the predictivepruning are performed by comparing two neighboring candidates a time. However, more potentially inferior candidates can be pruned out by comparing three neighboring candidate solutions simultaneously.For three solutions in the sorted list, the middle one may be pruned according to convex pruning. Definition 2 Convex Pruning.Letα 1 , α 2 and, α 3 be three nonredundant candidates of T(v) such that C(α 1 )<C(α 2 )<C(α 3 ) and Q(α 1 )<Q(α 2 )<Q(α 3 ).If Q(α 2 ) −Q(α 1 ) C(α 2 ) −C(α 1 ) < Q(α 3 ) − Q(α 2 ) C(α 3 ) − C(α 2 ) (26.25) then we call α 2 nonconvex, and prune it. Convex pruning can be explained by Figure 26.15. Consider Q as the Y-axis and C as the X-axis. Then candidates are points in the two-dimensional plane. It is easy to see that the set of nonredundant Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C026 Finals Page 553 29-9-2008 #20 Buffer Insertion Basics 553 c q c 1 c 2 c 3 q 1 q 2 q 3 Pruned (a) c q c 1 c 3 c 4 q 1 q 3 q 4 (b) c 4 q 4 FIGURE 26.15 (a) Nonredundant candidates N(v) and (b) nonredundant candidates M(v) after convex pruning. (From Li, Z. and Shi, W., IEEE Trans Computer-Aided Design, 25, 484, 2006. With permission.) candidates N(v) is a m onotonically increasing sequence. Candidate α 2 = (Q 2 , C 2 ) in the above definition is shown in Figure 26.15a, and is pruned in Figure 26.15b. The set of nonredundant candidates after convex pruning M(v) is a convex hull. For two-pin nets, convex prunin g preserves optim ality. Let α 1 , α 2 ,andα 3 be candidates of T(v) that satisfy the condition in Definition 2. In Figu re 26.1 5, let the slope between α 1 and α 2 (α 2 and α 3 )beρ 1,2 (ρ 2,3 ). If candidate α 2 is not on the convex hull of the solution set, then ρ 1,2 <ρ 2,3 . These candidates must have certain upstream resistance R including wire resistance and buffer/driver resistance. If R <ρ 2,3 , α 2 must become inferior to α 3 when both candidates are propagated to the upstream node. Otherwise, R >ρ 2,3 which implies R >ρ 1,2 , and therefore α 2 must become inferior to α 1 . In other words, if a candidate is not on the convex hull, it will be pruned either by the solution ahead of it or the solution behind it. Please note that this conclusion only applies to two-pin nets. For multipin nets, when the upstream could be a merging vertex, nonredundant candidates that are pruned by convex pruning could still be useful. Convex pruning of a list of nonredundant candidates sorted in increasing (Q,C) order can be perfor med in linear time by Gr a ham’s scan. Furthermore, when a new candidate is inserted to the list, we only need to check its neighbors to decide if any candidate should be pruned under convex pruning. The time is O(1), amortized over all candidates. In Refs. [40,41], the convex pruning is used to form the convex hull of nonredundant candidates, which is the key component of the O(bn 2 ) algorithm and O(mn) algorithm. In Ref. [43], convex pruning (called squeeze pruning) is performed on both two-pin and multipin n ets to prune more solutions with a little degradation of solution quality. 26.5.4 EFFICIENT WAY TO FIND BEST CANDIDATES Assume v is a buffer position, and we have computed the set of nonredundant candidates N (v) for T(v),whereN (v) does not include candidates with buffers inserted at v. Now we want to insert buffers at v and compute N(v).DefineP i (v, α) as the slack at v if we add a buffer of type B i for any candidate α: P i (v, α) = Q(v, α) − R(B i ) ·C(v, α) − K(B i ) (26.26) If we do not insert any buffer, then every candidate in N (v) is a candidate in N(v).Ifweinsert a buffer, then for every buffer type B i , i = 1, 2, , b, there will be a new candidate β i : Q(v,β i ) = max α∈N (v) {P i (v, α)} C(v,β i ) = C(B i ) Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C026 Finals Page 554 29-9-2008 #21 554 Handbook of Algorithms for Physical Design Automation Define the best candidate for B i as the candidate α ∈ N (v) such that α maximizes P i (v, α) among all candidates in N (v). If there are multiple α’s that maximize P i (v, α), choose the one with minimum C. In van Ginneken’s algorithm, it takes O(bn) to find one best candidate at each buffer position. According to convex pruning, it is easy to see that all best candidates are on the convex hull. The following lemma says that if we sort candidates in increasing Q and C order from left to right, then as we add wires to the candidates, we always move to the left to find the best candidates. Lemma 1 For any T(v), let nonredundant candidates after convex pruning be α 1 , α 2 , , α k ,in increasing Q and C order. Now add wire e to each candidate α j and denote it as α j + e. For any buffer type B i ,ifα j gives the maximum P i (α j ) and α k gives the maximum P i (α k + e),thenk≤ j. The following lemma says the best candidate can be found by local search, if all candidates are convex. Lemma 2 For any T(v), let nonredundant candidates after convex pruning be α 1 , α 2 , , α k ,in increasing Q and C order. If P i (α j−1 ) ≤ P i (α j ), P i (α j ) ≥ P i (α j+1 ),thenα j is the best candidate for buffer type B i and P i (α 1 ) ≤···≤P i (α j−1 ) ≤ P i (α j ) P i (α j ) ≥ P i (α j+1 ) ≥···≥P i (α k ) With the above two lemmas and convex pruning, one best candidate is found in amortized O(n) time in Ref. [40] and O(b) time in Ref. [41], ∗ which are more efficient thanvan Ginneken’s algorithm. 26.5.5 IMPLICIT REPRESENTATION Van Ginnken’s algorithm uses explicit representation to store slack and capacitance values, and therefore it takes O(bn) time when adding a wire. It is possible to use implicit representation to avoid explicit updating of candidates. In the implicit representation, C(v, α) and Q(v, α) are not explicitly stored for each candidate. Instead, each candidate contains five fields: q, c, qa, ca, and ra. † When qa, ca and, ra are all 0, q and c give Q(v, α) and C(v, α), respectively. When a wire is added, only q a, ca, and ra in the root of the tree [39] or as global variables themselves [41] are updated. Intuitively, qa represents extra wire delay, ca represents extra wire capacitance, and ra represents extra wire resistance. It takes only O(1) time to add a wire with the implicit representation [39,41] . For example, in Ref. [41], when we reach an edge e with resistance R(e) and C(e), qa, ra, and ca are updated to reflect new values of Q and C of all previous candidates in O(1) time, without actually touching any candidate: qa = qa + R(e) · C(e)/2 +R(e) · ca ca = ca + C(e) ra = ra + R(e) ∗ In Ref. [ 40], Lemma 1 is presented differently. It says if all buffers are sorted decreasingly according to driving resistance, then the best candidates for each buffer type in such or der is from left to right. † In Ref. [41], only two fields, q and c, are necessary f or each candidate. qa, ca, and ra are global variables for each two-pin segment. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C026 Finals Page 555 29-9-2008 #22 Buffer Insertion Basics 555 The actual value of Q and C of each candidate α are decided as follows: Q(α) = q −qa − ra ·c C(α) = c +ca (26.27) Implicit representation is applied on balance tree in Ref. [39], where the operation of adding a wire takes O(b log n ) time. It is applied on a sorted linked list in Ref. [41], where the operation of adding a wire takes O(1) time. REFERENCES 1. J. Cong. An interconnect-centric design flow for n anometer technologies. Proceedings of IEEE, 89(4): 505–528, April 2001. 2. J.A. Davis, R. Venkatesan, A. Kaloyeros, M. Beylansky,S. J. Souri, K. Banerjee, K.C. Saraswat, A. Rahman, R. Reif, and J. D. Meindl. Interconnect limits on gigascale integration (GSI) in the 21st century. Proceedings of IEEE, 89(3): 305–324, March 2001. 3. R. Ho, K. W. Mai, and M. A. Horowitz. The future of wires. Proceedings of IEEE, 89(4): 490–504, April 2001. 4. A. B. Kahng and G. Robins. On Optimal Interconnections for V LSI. Kluwer Academic Publishers, Boston, MA, 1995. 5. J. Cong, L. He, C. -K. Koh, and P. H. Madden. Performance optimization of VLSI interconnect layout. Integration: The VLSI Journal, 21: 1–94, 1996. 6. P. Saxena, N. Menezes, P. Cocchini, and D. A. Kirkpatrick. Repeater scal ing and its impact on CAD. IEEE Transactions on Computer-Aided Design, 23(4): 451–463, April 2004. 7. J. Cong. Challenges and opportunities for design innovations in nanometer technologies. SRC Design Sciences Concept Paper, 1997. 8. M. S. Bazaraa, H. D. Sherali, and C. M. Shetty. Nonlinear Programming: Theory and Algorithms. John W iley & Sons, NY, 1993. 9. C. J. Alpert and A. Devgan. Wire segmenting for improvedbuffer insertion. In Proceedings of the ACM/IEEE Design Automation C onference, Anaheim, CA, pp. 588–593, 1997. 10. C. C. N. Chu and D. F. Wong. Closed form solution t o simultaneous buffer insertion/sizing and wire sizing. ACM T ransactions on Design Automation of Electronic Systems, 6(3): 343–371, July 2001. 11. L. P. P. P. van Ginneken. Buffer pl acement in distributed RC-tree net works for minimal Elmore delay. In P roceedings of the I EEE International Symposium on Cir cuits and Systems, New Orleans, LA, pp. 865–868, 1990. 12. J. Lillis, C. K. Cheng, and T. Y. Lin. Optimal wire sizing and buffer insertion for low power and a generalized delay model. IEEE Journal of Solid-State Circuits, 31(3): 437–447, March 1996. 13. N. Menezes and C. -P. Chen. Spec-based repeater insertion and wire sizing for on-chip interconnect. In Pr oceedings of the International Conference on VLSI Design, Goa, India, pp. 476–483, 1999. 14. L. -D. Huang, M. Lai, D. F. Wong, and Y. Gao. Maze routing with buffer insertion under transition time constraints. IEEE Transactions on Computer-Aided Design, 22(1): 91–95, Januar y 2003. 15. C. J. Alpert, A. B. Kahng, B. Liu, I. I. Mandoiu, and A. Z. Zelikovsky. Minimum buffered routing with bounded capacitive load for slew rate and reliability control. IEEE Transactions on Computer-Aided Design, 22(3): 241–253, March 2003. 16. C. Kashyap, C. J. Alpert, F. Liu, and A. Devgan. Closed form expressions for extending step delay and slew metrics to ramp inputs. In Proceedings of the A CM International Symposium on Physical Design, Monterey, CA, pp. 24–31, 2003. 17. H. B . Bakoglu. Circuits, Interconnections and Packaging for VLSI. Addison-Wesley, Reading, MA, 1990. 18. N. H. E. Weste and K. Eshraghian. Principles of CMOS VLSI Design: A System Perspective. Addison-Wesley Publishing Company, Reading, MA, 1993. 19. S. Hu, C. J. Alpert, J. Hu, S. Karandikar, Z. Li, W. Shi, and C. -N. Sze. Fast algorithms for slew constrained minimum cost buffering. In Proceedings of the ACM/IEEE Design Automation Conference, San Francisco, CA, pp. 308–313, 2006. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C026 Finals Page 556 29-9-2008 #23 556 Handbook of Algorithms for Physical Design Automation 20. J. Cong and C. K. Koh. Simultaneous driver and wire sizing for performance and power optimization. IEEE Transactions on VLSI Systems, 2(4): 408–425, December 1994. 21. S. S. Sapatnekar. RC interconnect optimization under the Elmore delay model. In Proceedings of the ACM/IEEE Design Automation Conference, San Die go, CA, pp. 392–396, 1994. 22. J. Cong and K. -S. Leung. Optimal wiresizing under the distributed Elmore delay model. IEEE Transactions on Computer-Aided Design, 14(3): 321–336, March 1995. 23. J.P. Fishburnand C.A. Schevon.Shapinga distributedRC linetominimize Elmore delay.IEEE Transactions on Circuits and Systems, 42(12): 1020–1022, December 1995. 24. C. P. Chen, Y. P. Chen, and D. F. Wong. Optimal wire-sizing formula under the Elmore delay model. In Proceedings of the ACM/IEEE Design Automation Conference, Las Vegas, NV, pp. 487–490, 1996. 25. C. J. Alpert, A. Devgan, J. P. Fishburn, and S. T. Quay. Interconnect synthesis without wire tapering. IEEE Transactions on Computer-Aided Design, 20(1): 90–104, January 2001. 26. A. Devgan. Efficient coupled noise estimation for on- chip interconnects. In Proceedings of the IEEE/ACM International Conference on Computer-Aided D esign, San Jose, CA, pp. 147–151, 1997. 27. C. J. Alpert, A. Devgan, and S. T. Quay. Buffer insertion for noise and delayoptimization. IEEE Transactions on Computer-Aided Design, 18(11): 1633–1645, November 1999. 28. C. C. N. Chu and D. F. Wong. A new approach to simultaneous b u ffer insertion and wire sizing. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, pp. 614–621, 1997. 29. W. C. Elmore. The transient response of damped linear networks with particular regard to wideband amplifiers. Journal of Applied Physics, 19: 55–63, January 1948. 30. F. J. Liu, J. Lillis, and C. K. Cheng. Design and implementation of a global router based on a new layout- driven timing model with three poles. In Proceedings of the IEEE International Symposium on Circuits and Systems, Hong Kong, China, pp. 1548–1551, 1997. 31. J. Qian, S. Pullela, and L. T. Pillage. Modeling the effective capacitance for the RC interconnect of CMOS gates. IEEE Tr ansactions on Computer-Aided Design, 13(12): 1526–1535, December 1994. 32. S. R. Nassif and Z. Li. A more effective C eff .InPr oceedings of the IEEE International Symposium on Quality Electronic Design, San Jose, CA, pp. 648–653, 2005. 33. B. Tutuianu, F. Dartu, and L. Pileggi. Explicit RC-circuit delay appr oximation based on the first three moments of the impulse response. In Proceedings of the ACM/IEEE Design Automation Conference,Las Vegas, NV, pp. 611–616, 1996. 34. C. J. Alpert, F. Liu, C. V. Kashyap, and A. Devgan. C losed-form delay and slew metrics made easy. IEEE Transactions on Computer-Aided Design, 23(12): 1661–1669, December 2004. 35. C. J. Alpert, A. Devgan, and S. T. Quay. Buffer insertion with accurate gate and interconnect del ay computation. In Proceedings of the ACM/IEEE Design Automation Conference, New Orleans, LA, pp. 479–484, 1999. 36. C. -K. Cheng, J. Lillis, S. Lin, and N. Chang. Interconnect Analysis and Synthesis. Wiley Interscience, New York, 2000. 37. S. Hassoun, C. J. Alpert, and M. Thiagarajan. Optimal buffered routing path constructions for single and multiple clock domain systems. In Proceedings of the IEEE/ACM International Conference on Computer- Aided Design, San Jose, CA, pp. 247–253, 2002. 38. P. Cocchini. A methodology for optimal repeater insertion in pipelined interconnects. IEEE Transactions on Computer-Aided Design, 22(12): 1613–1624, December 2003. 39. W. Shi and Z. Li. A fast algorithm for optimal buffer insertion. IEEE Transactions on Computer-Aided Design, 24(6): 879–891, June 2005. 40. Z. Li and W. Shi. An O(bn 2 ) time algorithm for buffer insertion with b bu ffer types. IEEE Transactions on Computer-Aided Design, 25(3): 484–489, March 2006. 41. Z. Li and W. Shi. An O(mn) time algorithm for optimal buffer insertion of nets with m sinks. In Proceedings of Asia and South Paci fic Design Automation Conference, Yokohama, Japan, pp. 320–325, 2006. 42. W. Shi, Z. Li, and C. J. Alpert. Complexity analysis and speedup techniques for optimal buffer insertion with minimum cost. In Proceedings of Asia and South Pacific Design A u tomation Conference, Yokohama, Japan, pp. 609–614, 2004. 43. Z. Li, C. N . Sze, C. J. Alpert, J. Hu, and W. Shi. Making fast buffer insertion even faster via approximation techniques. In Proceedings of Asia and South Pacific Design Automation Conference, Shanghai, China, pp. 13–18, 2005. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C027 Finals Page 557 23-9-2008 #2 27 Generalized Buffer Insertion Miloš Hrki´c and John Lillis CONTENTS 27.1 Introduction 557 27.2 Two-Phase Approach and Buffer-Aware Tree Construction 560 27.2.1 C-Tree Algorithm 560 27.2.2 Buffer Tree Topology Generation 561 27.3 Simultaneous Tree Construction and Buffer Insertion 562 27.3.1 P-Tree Algorithm 562 27.3.2 S-Tree Algorithm 564 27.3.3 SP-Tree Algorithm 566 27.3.4 Complete Tree Topology Exploration 566 References 566 27.1 INTRODUCTION It has been widely recognized that interconnect is a dominating factor in modern very large scale integration (VLSI) circuit designs. Chapter 26 gave an overviewof challenges that interconnect faces and introduced a technique called repeater insertion that has proven to be very efficient in addressing emerging interconnect issues. Early work on repeater insertion focused mainly on improving interconnect timing performance. The most influential work is van Ginneken’s dynamic programming algorithm [1]. The algorithm performs buffer insertion on a fixed and embedded tree (e.g., as given by a global router) and produces an optimal timing solution under Elmore d elay model [2]. Various generalizations of van Ginneken’s algorithm have appeared in the literature taking into account issues of practical importance such as buffer libraries with inverting and noninverting buffers, simultaneous wire sizing, and slew-based delay models. Additio nally, generalizations that address natural constrained optimization variants of the problem(e.g.,minimizationofarea or powerconsumptionsubject to timing constraints)have also appeared. Progress has also been made in improving computational complexity as well as practical runtime. Many of these results are presented in Chapter 26. A significant limitation o f van Ginneken’s approach is that it requires a fixed and embedded tree that has to be provided in advance. This constraint forces the final buffered solution quality to depend on the input tree. Even though the algorithm provides an optimal timing solution for a given tree, it will produce a poor solution when given a poor tree. A few example scenarios that are very common in practice can be used to illustrate this limitation. As noted earlier, one of the basic interconnect optimization tasks is delay minimization. Given that sinks may have very different required signal arrival time constraints, a routing solution that focuses only on, for example, minimizing wirelength may not be good enough. In Figure 27.1, sinks F and G are timing critical while the others are no t. Configuration in Figure 27.1a has better wirelength, but the buffering cost is very high. On the other hand, configuration in Figure 27.1b can achieve better timing results with slightly more wirelength but many fewer buffers. 557 Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C027 Finals Page 558 23-9-2008 #3 558 Handbook of Algorithms for Physical Design Automation (a) (b) G E F D CBA E FG D CBA FIGURE 27.1 Buffering example: S inks F and G are assumed to be critical; tree (a) has slightly smaller wirelength but requires more buffers (and may p revent timing constraints on F and G from being met) than the tree (b). + − + − + − + − + − + − (a) (b) FIGURE 27.2 Buffering example: To meet signal polarity requirements, the number of buffers that is required v aries si gnif icantly f rom one topology to another. In some cases, certain sinks of a net require input signals of inverted polarity. Choices made during route construction can have a large impact on the cost of buffering solutions, as we can see in Figure 27.2. The two solutions Figure 27.2 have very different buffer and wiring costs. Figure 27.3 shows a simple example illustrating the issues ra ised during buffering and routing in the presence of blockages. In configuration of Figure 27.3a, the route goes over the blockage and cannot be buffered (thus, possibly violating timing, load, or slew constraints). If the route completely avoids the blockage, th e resulting solution is expensive in terms of wire and buffer costs (Figure 27.3b). Finally, by being aware of different types of blockages, configurationin Figure 27.3c dominates both in delay and resource usages/costs. Recently, some designs have reserved internal areas of macroobjects for buffering of external nets (e.g., the whitespace in macros as in Figure 27. 4). Any buffer insertion algorithm that has to work on a route that is not aware of the layout specifics will have limited chances of success. Referring to Figure 27.4, assuming that sink A is critical and the others are not, the two solutions in Figure 27.4 can have significant quality difference (e.g., cost or timing c haracteristics). In o ther practical formulations, routing or buffering feasibility is not considered a zero or one property (blocked or free). Instead, a complex cost function b ased on the local and global design densities and congestions should drive routing and buffering algorithms; such formulations can prevent overconstraining the design space, but require incremental interaction with placers and routers. Even more, the overall design closure can suffer because irresponsible use of buffering resources on nets (or portions of nets) that are not critical can prevent other critical nets from meeting their constraints. ∗ Given the examples above, routing and buffering algorithms should be able to account for the cost/performance trade-offof the solutionsthat they produce.Generating the fastestbufferingsolution ∗ Some of the approaches that ar e specifically designed to target b lockages ( routing or placement) as well as design density and congestion are presented in more detail in Chapter 28. However, some of the ideas will be reviewed in this chapter because they are among the c ore components of some tree synthesis and buffering algorithms. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C027 Finals Page 559 23-9-2008 #4 Generalized Buffer Insertion 559 (b)(a) (c) FIGURE 27.3 Buffering example: Depending on the interaction between routes and blockages, buffered solution can be (a) infeasible, (b) expensi ve, (c) or not bad at all. (a) AA (b) 1 2 31 2 3 D C C D B B FIGURE 27.4 Buffering example: With increasing complexity of constraints, ability of buffering algorithms to handle such constraints is becoming more important. Assuming th at sink A in critical, solutions (a) and (b) can have significant quality difference. may be necessary for some nets, but if applied to all nets, the design would quickly become too expensive (e.g., in area and power usage), or even become impossible to manufacture. In addition, algorithmcomplexityandruntime isa veryimportantpractical factorgiventhat hundredsofthousands of nets may need to be buffered within a given CPU time budget. In the following sections, we give an overview of recent research that addresses one or more of the p roblems mentioned above. This area of research is still very active and our summ ary presents only a snapshot of the past and current research. The majority of techniques that address problems mentioned above can be placed in one of the two categories. Several works propose a two-stage sequential method where a buffer-aware tree is constructed first, followed by van Ginneken style buffer insertion as in Refs. [3–6]. These techniques have small execution tim e with some sacrifice in solution quality and predictability. In Section 27.2, we describe techniques from Refs. [3,6] in more detail. A more robust and predictable approach proposes simultaneous route construction while per- forming buffer insertion. An example is the buffered P-Tree class of algorithms [7], which integrates buffer insertion into the P-Tree Steiner tree construction algorithm [8]. The P-Tree algorithm intro- duced a paradigm of finding an optimal solution in a constrained, but very large, space including topological, embedding, and buffering degrees of freedom, as opposed to applying ad hoc heuristics. Section 27.3 presents methods for simultaneous routing tree construction and buffer insertion from Refs. [7–12]. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C027 Finals Page 560 23-9-2008 #5 560 Handbook of Algorithms for Physical Design Automation 27.2 TWO-PHASE APPROACH AND BUFFER-AWARE TREE CONSTRUCTION 27.2.1 C -TREE ALGORITHM The workin Ref. [3]addressesthe problem of buffering under timing andpolarityconstraints. Given a net with placed pins, timing and polarity requirements at sinks, driver properties, a buffer library, and the technology’s interconnect parasitics, the goal is to find a Steiner tree that, after buffer insertion, meets timing constraints while minimizing solution cost (i.e., w ire and buffer usage). A two-phase flow is proposed: a buffer-aware Steiner tree construction called C-Tree is followed by a van Ginneken style buffer insertion. It is argued that an optimal buffer insertion on a fixed and routed tree can produce good/optimal results as long as it is given the right Steiner tree. However, in practice, instead of finding the right tree (which is very difficult because the tree construction algorithm is not optimizing the true objective) one can construct a buffer-aware Steiner tree, which tries to anticipate potential buffer locations. The main idea in C-Tree (clu ster ed tree) is to construct a tree in two stages. First, sinks are clustered b ased on a distance metrics (timing criticality, polarity requirements, physical distance). Then, lower level trees are constructed on each cluster. After determining tapping points for each cluster, the top-level timing-driven tree is constructed, connecting the driver with cluster tapping points. Merging the top-level tree with cluster trees yields a final tree for the entire net. Sink properties used for clustering are spatial (physical location coordinates), temporal (required arrival times), and polarity. The distance metrics incorporate all three elements. They are defined separately and then combined using scaling factors into a single distance metric. The spatial distance isgivenbysDist(s i , s j ) =|x(s i )−x(s j )|+|y(s i )−y(s j )|. Polarity distance is definedaspDist (s i , s j ) = |pol(s i ) −pol(s j )|. As for the temporal distance, Ref. [3] argues that required arrival time is not the only indicator of sink criticality. For exam ple, if two sinks s 1 and s 2 have the same required arrival time, and s 1 is further away from the driver, then s 1 is more critical because it is harder to achieve the same required arrival time over the longer distance. Thus, an estimate of the achievable d elay is used to adjust required arrival time and obtain achievable slack. It is further argued that the difference in achievableslacks (AS) may not yet be goodenough. Forexample,if AS(s 1 ) =−1ns, AS(s 2 ) = 1ns, and AS(s 3 ) = 10 ns, sinks s 1 and s 2 seem closer althoughin practice s 1 is the only critical sink because s 2 and s 3 havehigh-positiveAS. Thus, the sink criticality is defined as crit(s i ) = e α[mAS−AS(s i )]/(aAS−mAS) , where mAS and aAS are the minimum and average AS values over all sinks and α>0isauser parameter. The criticality is a value between 0 and 1, where 1 is the mo st critical (the average sink criticality by this formula is closer to noncritical). The temporal distance tDist(s i , s j ) is now defined as the difference in sink criticality. Finally, the distance metric is a linear combination of spatial, temporal, and polarity distances (noting that spatial distance is normalized by spatial diameter sDiam(N) defined as the maximum distance between the sinks): β[sDist(s i , s j )/sDiam(N)]+(1 −β)tDist(s i , s j ) +pDist(s i , s j ). The clustering itself is done using K-center heuristics. It is an iterative approach, which identifies sinks that are f urthest away and labels them as cluster seeds. The remaining sinks are then clustered around the closest seed. More details can be found in Ref. [3]. Once the clusters are determined, timing-driven Steiner trees are constructed on each cluster and one on the top level using the Prim–Dijkstra algorithm from Ref. [13]. The exper imental results show that this technique often exhibits a good trade-off between runtime and the quality of results (i.e., providing good solutions on the average in terms of both the cost and the delay while keeping low ru ntime). In a ddition, this method is not very complicated to implement. One should be aware of the fact that this algorithm is not designed to handle obstacles and d esign congestion in general, so results may not be very predictable in those scenarios. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C027 Finals Page 561 23-9-2008 #6 Generalized Buffer Insertion 561 27.2.2 BUFFER TREE TOPOLOGY GENERATION A more recent work[6]also recognizes the problem ofbufferingfixedtrees, togetherwith the growing problem of design size, where millions of nets have to be optimized in a reasonable amount of time. This work presents a new algorithm for generation of tree topologies that are buffer-friendly. The algorithm balances achieving the signal required arrival time constraints and minimizing wirelength. Let us first explain the notion of the tree topology in this work (we will refer to it as a partially embedded tree topology). Figure 27.5a shows a partially embedded tree topology. It is a directed tree structure where each node except the root has only one input edge, each internal node has exactly two output edges, while the root has only one output edge. In addition, each node has an assigned placement location (placement overlap is allowed). However, the embeddings of the edges (i.e., routes) as well as the number of buffers and buffer placements are not specified. An example of a completely embedded and buffered tree topology is given in Figure 27.5b. Once the partially embedded topology tree is constructed, many of the known techniques can be used to perform two-pin routing and buffer insertion between the tree nodes (i.e., Refs. [14–16]). As opposed to the approach in Ref. [3], subtree parities (i.e., signal polarities) are resolved locally because inverters are being used for buffering. The algorithm proceeds in th e following manner. First, sinks are ordered based on criticality (the most critical first). In a manner similar to Ref. [3], criticality estimation is based on estimated slack rather than only relying on sink required arrival time. To estimate the delay from the driver to sinks, a linear delay model is used (similar to Ref. [5]) augmented by estimated buffer intrinsic delays and loads. The assumption is that these paths are going to be buffered eventually so the algorithm accounts for the delay that the path is going to have after buffering. In Ref. [6], some additional experiments are performed to justify this assumption and results show good correlation between estimates and final results. When the ordering is complete, sinks are added to the topology one at a time (the initial topology consists of the d river and the most critical sink only). A single sink insertion is performed by examining all edges in the current topology and finding the closest tapping point within the bounding box of the edge terminals (note that the topology is partially embedded and all nodes have fixed placement locations). The edge for which the overall slack has the best value is chosen and sink insertion is performed by breaking that edge and inserting a new internal node to the tree. The parent of the new node is the source of the chosen edge and the children are the newly inserted sink and the destination node of the chosen edge. By keeping the arrival times at each topology node, a single sink insertion can be performed in linear time, giving the overall quadratic algorithm complexity (note that each operation is fairly simple, which leads to a very small execution time). In addition, Ref. [6] proves theoretical lower bounds on slack and wirelength in two extreme cases: sinks close to the driver and sinks having large noncritical required arrival times. Among the (a) (b) A B C D A B C D FIGURE 27.5 (a) Partially embedded routing tree topology and (b) completely embedded and buffered tree. . fewer buffers. 557 Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C027 Finals Page 558 23-9-2008 #3 558 Handbook of Algorithms for Physical Design Automation (a) (b) G E F D CBA E FG D CBA FIGURE. Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C026 Finals Page 552 29-9-2008 #19 552 Handbook of Algorithms for Physical Design Automation v 1 T(V 1 ) v v 2 v 3 … v k R min FIGURE. Design Automation AU7242_C026 Finals Page 556 29-9-2008 #23 556 Handbook of Algorithms for Physical Design Automation 20. J. Cong and C. K. Koh. Simultaneous driver and wire sizing for performance