Handbook of algorithms for physical design automation part 92 potx

Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 892 23-9-2008 #13 892 Handbook of Algorithms for Physical Design Automation FF i t i t i Combinational logic DQ DQ FF j D ij clk2q FIGURE 42.11 Clock hazards and timing constraints. FF i and FF j as shown in Figure 42.11. Let t i and t j be the clock delays from clock source to FF i and FF j , respectively. Let D ij be the set of all combinational path delays from FF i to FF j .Lett clk2q i be the clock-to-Q delay for FF i .Lett setup j and t hold j be the setup time and hold time for FF j , r espectively. Let P be the clock period. The setup time and hold time constraints can be expressed as t i + t clk2q i + MAX  D ij  + t setup j ≤ t j + P (42.2) t i + t clk2q i + MIN  D ij  ≥ t j + t hold j (42.3) A clock schedule is a set of d elays from clock source to all registers in the synchronous system. The clock scheduling problem is to find a clock schedule {t 1 , , t N } for all registers FF 1 , , FF N to minimize the clock period P while satisfying the constraints in Equations 42.2 and 42.3. This problem can be formulated as a linear program as follows [20]: LP_SPEED: Minimize P subject to t j − t i ≥ t setup j + t clk2q i + MAX  D ij  − P for i, j = 1, , N t i − t j ≥ t hold j − t clk2q i − MIN  D ij  for i, j = 1, , N t i ≥ MIN _DELAY for i = 1, , N Alternatively, we can find a clock schedule to maximize the minimum safety margin M for a given clock period P. This problem can be formulated as a linear program as follows: LP_SAFETY: Maximize M subject to t j − t i ≥ t setup j + t clk2q i + MAX  D ij  − P +M for i, j = 1, , N t i − t j ≥ t hold j − t clk2q i − MIN  D ij  + M for i, j = 1, , N t i ≥ MIN_DELAY for i = 1, , N. In both formulations, MAX(D ij ) =−∞and MIN(D ij ) =∞if there is no combinational path from FF i to FF j . After theclock schedule S ={t 1 , , t N }is computed,the nextstep is toconstruct a clocknetwork to realize the obtained schedule. The DME algorithm in Section 42.2.4 can be easily extended to handle this problem. We only need to construct the merging segments to achieve the g iven skews instead of zero skews in the bottom-up phase of the DME algorithm. However, the solutions of the linear programs may not be unique. Each clock delay t i could be a range rather than a fixed value. In this case, the clock routing problem becomes the bounded-skew routing tree (BST) problem. In Ref. [21], Cong et al. proposed two algorithms, BME (boundary merging and embedding) and IME (interior merging and embedding), to handle this problem. These two algorithms extend the DME algorithm by finding a polygonal region based on the skew bounds rather than a merging segment to represent all possible locations for each tapping point. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 893 23-9-2008 #14 Clock Network Design: Basics 893 Apart from the original formulations of clock scheduling, there are some other extensions. Neves and Friedman [22] formulated the process variation tolerant optimal clock skew scheduling problem. To better control the effects of process variations, they find the permissible range (i.e., the range of the clock skew without timing violation) for each local path, select a clock skew value that allows a maximum variation of skew w ithin the permissible range, and finally determine the clock delay to each register. Recently, Ravindran et al. [23] discussed the multidomain clock skew scheduling problem. For a given number of clocking domains n and a maximum permissible within-domain latency δ, the multidomain range constraints require that all clock latencies m ust fit into n value ranges (l(d i ), l(d i ) +δ) for i = 1, , n. The objective of multidomain clock skew scheduling is to determine domain phase shifts l(d i ) and register latencies that satisfy the clock domain constraints and minimize the clock period. Finally, we want to have a brief discussion on two similar sequential optimization techniques, clock scheduling and retiming. They are, respectively, continuous and discrete optimizations with the same effect on minimizing the clock period [20]. The equivalence of the two techniques was studied in Ref. [24]. It is proved that there exists a retiming R to achieve clock period P if and only if there exists a clock schedule S with the same clock period. However, the practical use of retiming is limited due to two reasons. First, retiming has adverse impact on the verification methodology. Second, using retiming for maximum performance often causes a steep increase in the number of registers. Clock scheduling does not have these two limitations. Another advantage of clock scheduling is that because retiming can only move registers across discrete amounts o f logic delay, the resulting system after retiming can still benefit from clock scheduling. 42.5 HANDLING VARIABILITY In minimizing skew sensitivity to process variations, two guiding principles are that the network should be as symmetrical and as fast as possible. In a clock network designed and laid out symmetrically, chipwide process (or environmental) variations should affect all clock paths identically. An additional advantage is that any systematic skew caused by modeling errors is eliminated by symmetry. In a fast network, as the clock phase delay is small, any fractional variations in delay lead only to a modest amount of skew. In addition, a clock network with optimal delay is the most tolerant to process variations. At the optimal delay point with respect to a certain parameter, the delay sensitivity over that parameter (i.e., the slope of the delay function) should be zero. However, it is not trivial to apply these two principles in practice. Because of uneven load distribution and routing/buffer obstacles, it is u sually impossible to construct a completely symmetrical network. Moreover, minimizing the network delay may be conflicting with the optimization of some other metrics (e.g., skew, area and power). Several important works on reliable clock network design under process variations are discussed below. The concept of delay sensitivity is very useful in considerin g pr ocess variations. Pullela et al. [25] first made use of delay sensitivity with respect to wire width variations to improve the delay, skew and skew sensitivity of a given cloc k tree by wire width optimization. The Elmore delay model and the L-type RC model for each branch are used in the paper, but the concept can be generalized to other models. Let R j be the resistance, C j be the capacitance, and C d j be the downstream capacitance of branch j.LetU(i) be the set of all branches on the path from sink i to the root. Then the Elmore delay from the root to sink i is T d i =  j∈U(i) R j C d j . Therefore, the sen sitivities of Elmore delay of sink i with respect to circuit parameters C j and R j are ∂T d i ∂C j = R c ij (42.4) ∂T d i ∂R j =  C d j if j ∈ U ( i ) 0otherwise (42.5) Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 894 23-9-2008 #15 894 Handbook of Algorithms for Physical Design Automation where R c ij is the total resistance along the common path from sink i to the root and branch j to the root. R j and C j can be expressed as functions of width w j of branch j as R j = R 0 L j /w j and C j = C a L j w j + C f L j ,whereR 0 , C a ,andC f are technology parameters independent of w j ,andL j is the length of branch j. Therefore, the delay sensitivity of sink i to width w j is ∂T d i ∂w j = ∂T d i ∂C j ∂C j ∂w j + ∂T d i ∂R j ∂R j ∂w j = ∂T d i ∂C j C a L j − ∂T d i ∂R j R 0 L j w 2 j By incremental computation as d escribed in Ref. [26], Equations 42.4 and 42.5 for all i and j can be computed in O(n 2 ) timeforatreewithn sinks. ∗ Hence, the delay sensitivities for all sinks to all branch widths can also be found in O(n 2 ) time. In Ref. [25], a greedy heuristics is proposed to iteratively increase the widths to improve delay, skew, and skew sensitivity. The selection of the branch to widen in each step is based on the delay sensitivities, which give the delay change of each sink when widening a branch. In particular, they argued that wire widening is a better method for delay balancing than wire elongating as widening generally reduces skew sensitivity but elongating increases it. Lu et al. [27] formulated the minimizing skew violation (MinSV) p roblem to construct a clock tree considering wire width variation due to process variations. Given the range of permissible skew for each pair of clock sinks, they tried to find a clock routing tree such that the maximum skew violation among all pairs of sinks is minimized under wire width variation. The way they construct the tree follows the framework of the DME algorithm. Because of wire width variation, the skew between a sink pair becomes a range rather than a unique value. To maximize the safety margin due to process variations, in the bottom-up stage, they chose the merging segment for the tapping point such that the center of the skew range of the most critical sink pair coincides with the center of permissible range for this sink pair. Besides improving process variation tolerance, they also proposed an algorithm to minimize wirelength when there is no skew violation under wire width variation. Recently, Rajaram et al. [28] proposed to insert cross links in a given clock tree to improve its skew sensitivity. Like the grid and the spine structures, the cross links equalize delay of different points by connecting them together. Such an approach can tolerate both process and environmental variations. Moreover, because the cross links are selectively inserted b ased on the trade-off between skew sensitivity reduction and extra wire usage, this approach can achieve significant skew sensitivity reduction with little increase in wirelength. The link insertion alg orithm is improved in Ref. [29]. REFERENCES 1. J. M. Rabaey, A. Chandrakasan, and B. Nikoli´c. Digital Integrated Circuits: A Design Perspective, 2nd edn. Prentice Hall, 2003. 2. H. Veendrick. Short-circuit dissipation of static CMOS circuitry and its impact on the design of buffer circuits. IEEE Journal of Solid-State Circuits, SC-19:468–473, August 1984. 3. P. J. Restle, T. G. McNamara, D. A. Webber, P. J. Camporese, K. F. Eng, K. A. Jenkins, D. H. Allen, M. J. Rohn, M. P. Quaranta, D. W. Boerstler, C. J. Alpert, C. A. Carter, R. N. Bailey, J. G. Petronick, B. L. Krauter, and B. D. McCredie. A clock distribution network for microprocessors. IEEE Journal of Solid-State Circuits, 36(5):792–799, May 2001. ∗ In [25], an O(n 2 log n) algorithm by adjoint analys is is proposed. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 895 23-9-2008 #16 Clock Network Design: Basics 895 4. M. Gowan, L. Biro, and D. Jackson. Power considerations in the design of the Alpha 21264 microprocessor. In Proceedings of the ACM/IEEE Design Automation Conference, San Francisco, CA, pp. 433–439, 1998. 5. D. R. Gonzales. Micro-RISC architecture for the wireless market. IEEE Micro, 19(4):30–37, 1999. 6. D. E. Duarte, N. Vijaykrishnan, and M. J. Irwin. A clock power model to evaluate impact of architectural and technology optimizations. IEEE Transactions on Very Large Scale Integration Systems, 10(6):844–855, December 2002. 7. M. A. B. Jackson, A. Srinivasan, and E. S. Kuh. Clock routing for high-performance ICs. In Proceedings of the ACM/IEEE Design Automation Conference, Orlando, FL, pp. 573–579, 1990. 8. J. Cong, A. B. Kahng, and G. Robins. Matching-based methods for high-performance clock routing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 12(8):1157–1169, August 1993 (DAC 1991). 9. R. -S. Tsay. An exact zero-skew clock routing algorithm. IEEE Tr ansactions on Computer-Aided Design of Integrated Circuits and Systems, 12(2):242–249, February 1993 (ICCAD 1991). 10. W. C. Elmore. The transient response of damped linear network with particular regard to wideband amplifiers. Journal of Applied Physics, 19: 55–63, 1948. 11. M. Edahiro. Minimum skew and minimum path length r outing in V LSI layout design. NEC Research and Development, 32(4): 569–575, 1991. 12. T. -H. Chao, Y. -C. Hsu, and J. -M. Ho. Zero skew clock net routing. In Proceedings of the ACM/IEEE Design Automation C onference, Anaheim, CA, pp. 518–523, 1992. 13. K. D. Boese and A. B. Kahng. Zero-sk ew clock routing trees with minimum wirelength. In Proceedings of the IEEE International ASIC C onference, Rochester, NY, pp. 1.1.1–1.1.5, September 1992. 14. J. G. Xi and W. W. -M. Dai. Buffer insertion and sizing under process variations for low power clock distribution. In Proceedings of the ACM/IEEE Design Automation Conference, San Francisco, CA, pp. 491–496, 1995. 15. A. Vittal and M. Marek-Sadowska. Low-power buffered clock tree design. IEEE Transactions on Computer-Aided Design of In tegrated Circuits and Systems, pp. 965–975, September 1997 (DAC 1995). 16. S. P ullela, N. Menezes, J. Omar, and L. T. Pillage. Skew and delay optimization for reliable buffered clock trees. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, pp. 556–562, 1993. 17. B. J. Benschneider, A. J. Black, W. J. Bowhill, S. M. Britton, D. E. Dever, D. R. Donchin, R. J. Dupcak, R. M. Fromm, M. K. Gowan, P. E. G ronowski, M. K antrowitz, M. E . Lamere, S. Mehta, J. E. Meyer, R. O . Mueller, A. Olesin, R. P. Preston, D. A. Priore, S . Santhanam, M. J. Smith, and G. M. Wolrich. A 300-MHz 64-b quad-issue CMOS RISC microprocessor. IEEE Journal of Solid-State Circuits, 30(11):1203–1214, November 1995 (ISSCC 1995). 18. Shen Lin and C. K. Wong. Process-variation-tolerant clock ske w minimization. In Pr oceedings of the IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, pp. 284–288, 1994. 19. H. Su and S. S. Sapatnekar. Hybrid structured clock network construction. In Proceedings of the IEEE/ACM International Conference on Computer-Aided D esign, San Jose, CA, pp. 333–336, 2001. 20. J. P. Fishburn. Clock skew optimization. IEEE Transactions on Computers, 39(7):945–951, July 1990. 21. J. Cong, A. B. Kahng, C. -K. Koh, and C. -W. A. Tsao. Bounded-skew clock and Steiner routing. ACM Transactions on Design Automation of Electronics Systems, 3(3):341–388, 1998 (ICCAD 1995). 22. J.L. Neves andE. G. Friedman. Optimal clockskewscheduling tolerant toprocess variations. InProceedings of the ACM/IEEE Design Automation Conference, Las Vegas, NV, pp. 623–628, 1996. 23. K. Ravindran, A. Kuehlmann, and E. Sentovich. Multi-domain clock skew scheduling. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, pp. 801–808, 2003. 24. L. -F. Chao and E. H. -M. Sha. Retiming and clock skew for synchronous systems. In Proceedings of the IEEE International Symposium on Circuits and Systems, London, England, pp. 283–286, 1994. 25. S. Pullela, N. Menezes, and L. T. Pillage. Reliable non-zero sk ew clock trees using wire width optimization. In Proceedings of the ACM/IEEE Design Automation Conference, Dallas, TX, pp. 165–170, 1993. 26. C. -P. Chen and D. F. Wong. A fast algorithm for optimal wire-sizing under Elmore delay model. In Proceedings of the IEEE International Symposium on Circuits and Systems,vol.4,Atlanta,GA, pp. 412–415, 1996. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 896 23-9-2008 #17 896 Handbook of Algorithms for Physical Design Automation 27. B. Lu, J. Hu, G. Ellis, and H. Su. Process variation aware clock tree routing. In Proceedings of the International Symposium on Physical Design, Monterey, CA, pp. 174–181, 2003. 28. A. Rajaram, J. Hu, and R. Mahapatra. Reducing clock skew variability via cross links. In Proceedings of the ACM/IEEE Design Automation Conference, Anaheim, CA, pp. 18–23, 2004. 29. A. Rajaram, D. Z. Pan, and J. Hu. Improved algorithms for link-based non-tree clock networks for skew variability reduction. In Proceedings of the International Symposium on Physical Design, San Francisco, CA, pp. 55–62, 2005. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 897 24-9-2008 #2 43 Practical Issues in Clock Network Design Chris Chu and Min Pan CONTENTS 43.1 IBM S/390 898 43.2 IBM Power4 900 43.3 Alpha 21264 901 43.4 Intel Pentium II 904 43.5 Intel Pentium III 905 43.6 Intel Pentium 4 905 43.7 Intel Itanium 907 43.8 Intel Itanium 2 909 References 911 In this chapter, we present the clock network designs of several high-performance microprocessors to illustrate how the basic techniques presented in Chapter 42 are applied in practice. We fo cus on the clock network design of high-performance microprocessors as the stringent slew requirements make the design most challenging. Some useful discussions on practical issues in clock network design can also be found in Bindal and Friedman [1], Zhu [2], and Rusu [3]. Year/ Clock Number of Main Process Frequency Area Transistors Clock Skew Section Processor Reference (nm) (MHz) (mm 2 ) (M) Topology Deskew (ps) 43.1 IBM S/390 1997 [4] 200 (L eff ) 400 300 7.8 Tree No 30 43.2 IBM Power4 2002 [5] 180 SOI 1300 174 Tree driving single grid No 25 43.3 Alpha 21264 1998 [6] 350 600 15.2 Hierarchical grids No 65 43.4 Pentium II 1997 [7] 350 300 203 7.5 1 spine No 140 43.5 Pentium III 1999 [8] 250 650 123 9.5 2 spines Active 15 43.6 Pentium 4 2001 [9] 180 2000 217 42 3 spines Active 16 2003 [10] 90 Up to 5000 109 8 spines 10 43.7 Itanium 2000 [11] 180 800 25.4 Tree driving grids Active 28 43.8 Itanium 2 2002 [12] 180 1000 421 25 Tree driving trees No 62 2003 [13] 130 1500 374 410 Tree driving trees Fuse based 24 (dual core) 2005 [14] 90 100–2500 596 1720 Hierarchical trees Active 10 897 Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 898 24-9-2008 #3 898 Handbook of Algorithms for Physical Design Automation The processors discussed in this chapter are summarized in the table above. (Some en tries are left blank because the corresponding information cannot be found.) 43.1 IBM S/390 The design of a 400-MHz microprocessor for IBM S/390 Enterprise Server Generation-4 system is described in Ref. [4]. The chip is fabricated in a 0.2-µm L eff CMOS technology with five layers of metal and tungsten local interconnect. The power supply is 2.5 V. The chip size is 17.35 mm × 17.30 mm with about 7.8 million transistors. The clock distribution network uses a balanced tree design, which is suitable for the relatively low clock frequency. A single-phase clock is distributed from a phase-locked loop (PLL)/central clock buffer located near the center of the chip to all the latches inside the macros in three levels of hierarchy. The first two levels of clock distribution are in the fo rm of balanced H-like trees, using primarily the top two metal layers. The first-level tree routes the global clock from the central clock buffer to nine sector buffers, as shown in Figure 43.1. The sector buffers repower the clock to all macros inside the sectors. There are 580 macro clock pins in the whole design. RU 32 KB cache Directory 16 KB ROS 16 KB ROS TLB_LOG TLB_ABS 32 KB cache IU control IU control CLKD I/E trace PLL I/E trace FPU control FPU control FXU control FXU control FPU dataflow FPU dataflow FXU dataflow Address flow Address flow Instruction flow Instruction flow BCE logic FXU dataflow Clock sector buffer Clock waveform measurement point FIGURE 43.1 First-level tree of the IBM S/390 clock distribution network. (From Welb, C.F. et al., J. Solid-State Circuits, 32, 1665, 1997. With permission.) Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 899 24-9-2008 #4 Practical Issues in Clock Network Design 899 The clock propagation delay along the tree is b alanced against macro input capacitance and RLC characteristics of the tree wires. Horizontal wiring of each tree is in low-resistance Metal 5 (M5) (with 4.8-µm pitch). At various places along the tree, inductive coupling is reduced and return path is improved by using power wires for shielding. Decoupling capacitors are incorporated into central and sector buffers to reduce delta-I noise. A clock wiring methodology was developed with custom routing and timing computer-aided design (CAD) tools. The detailed routing as well as the widths of all clock wires were optimized to minimize skew, mean delay, power, wiring tracks, and sensitivity to process variations. Three-dimensional (3D) modeling was performed using a full-wave electromagnetic field solver [15], and distributed RLC modeling was used for virtually every wir e in all thetrees during the design and tuning/optimizationprocess[16]. A number of caseswereanalyzed, and the results were used to generate a combination of analytic models and lookup tables containing distributed RLC parameters for all clock geometries used. Each wire segment was represented by an equivalent circuit consisting of up to six RLC π -segments. Extensive simulations and wire width tuning [17] were done to guarantee low clock skew at macro pins. Typical simulated RLC delay of the first-level tree is 300 ps with 20 ps skew at the sector buffers. The sector buffer delay is 230 ps. Ty pical simulated RLC delay within sectors is 210 ps with 30ps skew at the macros. The last level of clock distribution is local to each macro. Figure 43.2 shows the clocking scheme within macros. From the macro pin, the clocks are wired to clock blocks. The overall target skew for this wire is under 20 ps. For large area macros, multiple clock pins were used to reduce wirelength to clock blocks. The clock block generates local clocks that drive latches. The target skew for local clocks is under 50 ps. All macrolevel wiring is done by hand for custom macros or with a place and route tool for synthesized macros. For synthesized macros that had many latches, and therefore multiple clock blocks, a clock optimization tool was used that reassigned latches to clock blocks based on cell placement. This resulted in clock blocks driving latches that were placed closest to them. Macro layouts were extracted for R and C parasitics, and the extracted netlists were used to time the macros. This means that any skew in the last level of clock distribution was captured in that macro’s timing abstraction. Figure 43.3 shows the measured waveforms o f the central clock buffer output and clocks at ten points of the 580 macro pin locations (marked on Figure 43.1) driven by the second level clock tree. The measurement was performed using a novel electron-beam prober with a 20-ps time resolution on the top wiring layer. Because the chip was powered using a standard cantilever probe card in the CLKG Clock chopper Clock splitter Combinational logic CLKL C 2 C1 L 2 L2 L 2L1 L 2L1 FIGURE 43.2 Last/macrolevel clock distribution of I BM S/390. (From Welb, C.F. et al., J. Solid-State Circuits, 32, 1665, 1997. With permission.) Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 900 24-9-2008 #5 900 Handbook of Algorithms for Physical Design Automation 2.5 2.0 1.5 1.0 0.0 Central clock buffer output Clock at ten of 580 macro pins in second level clock tree Volts 1000500 1500 0.5 0.5 0.5 FIGURE 43.3 Electron beam measured clock waveforms at macro pin locations mark ed on Figure 43.1. (From Welb, C.F. et al., J. Solid-State Circuits, 32, 1665, 1997. With permission.) electron-beam prober, the chip clock was run at low frequency to reduce power supply noise. Power supply noise during these measurements was measured to be less than 100mV. The results indicate a mean delay of 740ps and less than 30 ps skew from the central clock buffer to the macro pins. 43.2 IBM POWER4 The clock distribution of a 1.3-GHz Power4 microprocessor is described in Refs. [5,18]. The chip is fabricated in the IBM 0.18-µm CMOS 8S3 SOI (silicon-on-insulator) technology with seven levels of copper wiring. It has 174 million transistors. The power supply is 1.6 V. The microprocessor uses a single chip-wide clock domain, with no active or programmable skew-reduction circuitry. Having multiple domains would allow active/programmable deskewing and coarse clock gating, and could result in lower skew within each small domain. Inevitably, however, with multiple domains there is increased skewand uncertainty between domains. In addition, multiple clock domains complicate early- and late-mode timings, and degrade critical paths that cross multiple domain boundaries. Extensive simulations of the Power4 chip and test-chip hardware measurements support the simplifying decision to maintain a single-domain global clock grid for the entire chip, with no programmable or active deskewing. The global clock distribution strategy is basedonatopology using a number of tuned treesdriving a single full-chip clock grid [19]. This strategy is developed with the goal of being applicable to a variety of high-performanceserver microprocessors.It has been previously used in three S/390 chips and three PowerPC chips [19]. The trees-driving-grid topology combines many of the advantages of both trees and grids. Trees have low latency, low power, minimal wiring track usage, and the potential for very low skew. However, without the grid, trees must often be rerouted whenever the locations of clock pins change, or when the load capacitance values change significantly. The grid provides a constant structure so that the trees and the grids they are driving can be designed early to distribute the clock near every location where it may b e needed. The regular grid also allows simple regular tree structures. This is important as it facilitates the design of carefully designed transmission line structures with well-controlled capacitance and inductance. The grid reduces local skew by connecting nearby points directly. The tree wires are then tuned to minimize skew over longer distances. The g lobal clock distribution network of the 1.3- GHz Power4 chip is illustrated in Fig ure 43.4 using a 3D visualization showing all wire and buffer delays. In the network, a PLL near the center of the chip drives buffered H-trees, which are designed as symmetrically as possible. The H-trees Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 901 24-9-2008 #6 Practical Issues in Clock Network Design 901 800 700 600 500 400 300 200 100 Grid Tuned sector trees Sector buffers level 4 Buffer level 3 Buffer level 2 Buffer level 1 X Y Delay (ps) FIGURE 43.4 3D visualization of the Power4 global clock distribution. (From Restle, P.J. et al., Proc. IEEE Intl. Solid-State Circuits Conf., 2002, pp. 144–145. With permission.) drive the final set of 64 carefully placed sector buffers. Each sector buffer drives a tunable sector tree network, designed for minimum delay without length matching. These sector trees are tuned primarily by wire-width tuning. Then they all drive a single full-chip clock grid at 1024 evenly spaced points. From the global clock grid, a hierarchy of short clock routes completed the connection from the grid down to the individual local clock buffer inputs in the macros. There are 15,200 global clock pins. It is reported in Ref. [5] that the maximum skew measured at 19 places with picoprobes is 25 ps, and the maximum skew by picosecond imaging for circuit analysis (PICA) measurements from nine sector buffers is less than 18 ps. 43.3 ALPHA 21264 The clocking design of a 600-MHz Alpha 21264 microprocessor is presented in Ref. [6]. The chip is fabricated in a 0.35-µm CMOS process with six metal layers. Four metal layers (called M1 to M4) are for signals, one (between M2 and M3) is for a V SS reference plane, and one (above M4) is for a V DD reference plane. It has 15.2 million transistors. This microprocessor employs a hierarchical clock distribution scheme as illustrated in Figure 43.5. At the top level, there is a global clock grid called GCLK, which covers the entire die. Next, there are six major clock grids over certain execution units. At the bottom level, local clocks are generated as needed from any clock (global clock, major clocks, or other local clocks). Previous Alpha microprocessors use a single grid to distribute the global clock signal [20,21]. Thehierarchical scheme is chosen forthis microprocessor because oftighter skew constraints, the importanc e of clock power minimization, andtheneed of aflexible clockingmethodology to solve local timing problems. The drawbackis that skew management becomes much more compli- cated. State elements and clocking points exist from 0 to 8 stages past GCLK. The clock distribution network needs to be carefully designed b ased on rigorous and thorough timing verification. The GCLK grid is driven by a global clock distribution network as shown in Figure 43.6. The network connects a PLL located in a corner of the chip to 16 distributed global clock drivers. The arrangement of global clock drivers, which resembles four windowpanes, achieves low skew by dividing the chip into regions, thus reducing the maximum distance from the drivers to the farthest loads. A windowpane arrangement also reduces sensitivity to process variation because each grid pane is redundantly driven from four sides. In general, distributing the drivers widely across the chip . Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 892 23-9-2008 #13 892 Handbook of Algorithms for Physical Design Automation FF i t i t i Combinational logic DQ. U ( i ) 0otherwise (42.5) Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 894 23-9-2008 #15 894 Handbook of Algorithms for Physical Design Automation where R c ij is. Hierarchical trees Active 10 897 Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 898 24-9-2008 #3 898 Handbook of Algorithms for Physical Design Automation The processors

Định dạng
Số trang	10
Dung lượng	214,18 KB