Handbook of algorithms for physical design automation part 93 pps

Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 902 24-9-2008 #7 902 Handbook of Algorithms for Physical Design Automation External clock PLL Conditioned local clocks Conditioned local clocks State elements Local clocks Local clocks Cond Cond GCLK grid Major clk grid FIGURE 43.5 Alpha 21264 clock hierarchy. (From Bailey, D.W. and Behschneider, B.J., IEEE J. Solid-State Circuits, 33, 1627, 1998. With permission.) PLL FIGURE 43.6 Global clock distribution network of Alpha 21264. (From Bailey, D.W. and Behschneider, B.J., IEEE J. Solid-State Circuits, 33, 1627, 1998. With permission.) also helps power-supply and heat-dissipation problems. The GCLK grid is shown in Figure 43.7. It traverses the entiredie and uses 3 percent of M3 and M4. All clock interconnectislaterally shielded with either V DD or V SS . All clock wires and all lateral shields are manually placed. The measured GCLK skew is 65 ps running at 0 ◦ Cambientand2.2V. The six major clocks are two gain stages past GCLK with grids juxtaposed with GCLK, but shielded from it. The major clock grids are shown in Figure 43.8. Because of the wide variation of clock loads, the grid density varies widely between major clocks, and sometimes even for a single major clock. The densest areas use up to 6 percent of M3 and M4. Major clocks driven by a gridded global clock substantially reduce power because major clock drivers are localized to the clock loads and major clock grids arelocallysized to meet the skew targets. A griddedglobal clock without major Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 903 24-9-2008 #8 Practical Issues in Clock Network Design 903 FIGURE 43.7 GCLK of Alpha 21264. (From Bailey, D.W. and Behschneider, B.J., IEEE J. Solid-State Circuits, 33, 1627, 1998. With permission.) clocks would require larger driversand a denser grid to deliverthe same clock skew and edges. Major clocks are designed so that delay fromGCLK is cente red at 300 ps. The target specifications for skew are ±50 ps. The target specifications for 10–90 percent rise and fall times are less than 320ps. All major clocks easily meet both sets of objectives. PCLK ECLK JCLK CCLK MCLK FCLK FIGURE 43.8 Six major clock grids of Alpha 21264. (From Bailey, D.W. and Behschneider, B.J., IEEE J. Solid-State Circuits, 33, 1627, 1998. With permission.) Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 904 24-9-2008 #9 904 Handbook of Algorithms for Physical Design Automation Local clocks are generally neither gridded nor shielded. There are no strict limits on the number, size, or logic function of local-clock buffers, and there is no duty-cycle requirement, although timing path constraints must always be met. Local clocks have permitted ranges for clock rise and fall times, but with only this restriction there is considerable d esign freedom. As a result, it facilitates the implementation of clock gating to reduce power and clock skew scheduling to improve performance. Because, rather dense grid structures are required to meet the aggressive skew targets, the clock power consumption is very significant. At 600MHz and 2.2V, typical power usage for the processor is 72 W. The complete distribution network that drives GCLK uses 5.8 W, and GCLK uses 10.2 W. The major clocks use 14.0 W. Local unconditional clocks use 7.6 W, and local conditional clocks use a maximum of 15.6W, assuming they switch every cycle. The clock distribution network design for a 1.2-GHz Alpha 21364 microprocessor can be found in Ref. [22]. We choose not to include the details here as Compaq, which acquired DEC in 1998, decided to phase out Alpha on 2001. 43.4 INTEL PENTIUM II The clock distribution network design for a 300-MHz Intel Pentium II microprocessor is presented in Refs. [7,23]. The chip is fabricated in a 0.35-µm CMOS process with four metal layers. The power supply is 2.8 V. The chip has 7.5 million transistors and the die area is 203 mm 2 . This processor uses a single spine scheme to distribute the global clock as shown in Figure 43.9. The spine is driven by a balanced tree with five levels of buffers. Global clock is distributed to all units in M4. The measured skew is also shown in Figure 43.9. The skew across M4 global distribution is 140 ps. The low skew is achieved by balancing the load of each global clock tapping and adjusting global clock track length. SK = Ϫ564 ps SK = Ϫ476 ps SK = Ϫ488 ps SK = Ϫ592 ps SK = Ϫ460 ps Input point to local buffers with clock gating SK = Ϫ424 ps SK = Ϫ548 ps Five-level driver for 500 pF load with M4 metal strapping ring FIGURE 43.9 Global clock distribution network of Pentium I I with electron beam measured skew. SK i s the skew relative to feedback point from local buffer. (From Young, I.A., Mar, M.F., and Bushan, B., Proc. IEEE Intel. Solid-State Circuits Conf., pp. 330–331, 1997. With permission.) Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 905 24-9-2008 #10 Practical Issues in Clock Network Design 905 43.5 INTEL PENTIUM III The design o f an Intel Pentium III microprocessoris presented in Ref. [8].This chip has an operating voltage of 1.4–2.2V and is running up to 650 MHz. It is fabricated in a 0.25-µm CMOS process with five metal layers. It has 9.5 million transistors and the chip size is 10.17 mm ×12.10mm. This processor uses a two-spine scheme for global clock distribution. A two-spine clock block diagram is shown in Figure 43.10. The two spines were shielded properly such that they would not be impacted by the fringing fields from any interconnects associated with the core as well as I/O sides. The two-spine scheme has many benefits over a single-spine approach. First, the serpentine wires can be shortened, and hence power consumption can be reduced. Second, power distribution to the clock subsystem becomes easier as the clock power demand is more spread out. Third, shielding of clock network is also easier as shields are more readily available on sides than in the center. Fourth, routing congestion can be improved because there will not be a center spine running through the center part of the chip, which is typically most congested. Skew minimization between the two spines is a major challenge. Because of the lengthy left and right clock spines with multiple tap points, it was very difficult to match the delays with good accuracy. In addition to precision capacitance matching techniques on the global clock tree, an adaptive digital deskewing technique based on a delay-locked loop (DLL) was employed [24]. The deskewing circuit is composed of delay lines to both spines, a phase detection circuit, and a controller (Figure 43.10). The phase detection circuit determines the phase relationship between the two spines and generates an output accordingly. The controller takes the phase detection information and makes a discrete adjustmen t to one of the delay lines. T he digital delay line is implemented with two inverters in series. Each inverter has a bank of eight capacitive loads connecting to the output. The addition or removal of the capacitive loads is controlled by the de lay shift register. This allows 17 monotonic discrete steps of delay. Latency from sampling clocks to making adjustment to the delay lines is just over three cycles. Note that this DLL-based deskewing scheme compensates for not only interconnect/device mismatch but also process, voltage, and temperature variations. Adaptive deskewing helped to reduce the left-to-right clock spine skews from 100 to 15 ps. 43.6 INTEL PENTIUM 4 The clocking scheme of a 2-GHz Intel Pentium 4 microprocessor is presented in Ref. [9]. The chip is fabricated in a 0.18-µm CMOS process with six metal layer s. The chip has 42 million transistors and the die area is 217 mm 2 . Core PD clk_Gen Delay line Delay line Delay SR Delay SR Deskew cti X clk FB clk Left spine Right spine FIGURE 43.10 Block diagram for two-spine global clock distribution of Pentium III. (From Senthinathan, R., Fischer, S., Rangchi, H., and Yazdanmehr, H., IEEE J. Solid-State Circuits, 3, 1454, 1999. With permission.) Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 906 24-9-2008 #11 906 Handbook of Algorithms for Physical Design Automation FIGURE 43.11 Three spines in a 0.18-µm Pentium 4. (From Kurd, N., Barkatullah, J., and Dizon, R., IEEE J. Solid-State Circuits, 36, 1647, 2001. With permission.) To cover the large Pentium 4 die, its global clock distribution uses three spines as shown in Figure 43.11. A modified buffered binary tree is used to distribute the global clock from the clock generator to the spines. Then 47 domain buffers are driven, producing 47 independent clock domains (Figure 43.12). Domain buffers can be disabled to power down large functional units to save power. The clock distribution n etwork includes static skew optimization capability to correct systematic skew (caused by asymmetric layout or within-die process variation) as well as provide intentional skew. Each domain buffer consists of a programmable d elay stage controlled by a 5-bit domain deskew register (DDR) that determines the edge timing of the domain clock. The values of the DDRs can be set according to phase information obtained by aphase-detector network of 46 phase detectors. From PLL FIGURE 43.12 Global clock distribution in Pentium 4. (From Kurd, N., Barkatullah, J., and Dizon, R ., IEEE J. Solid-State Circuits, 36, 1647, 2001. With permission.) Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 907 24-9-2008 #12 Practical Issues in Clock Network Design 907 10.7 mm Clock stripes 10.2 mm FIGURE 43.13 Eight stripes (i.e., spines) in a 90-nm Pentium 4. ( From Bindal, N. et al., Proc. IEEE Intel. Solid-State Circuits Conf., pp. 346–498, 2003.) This deskewing scheme can reduce interdomain skew from 64 to about 16 ps. A major component of the clock distribution jitter is due to supply noise from logic switching. To reduce supply-noise induced jitter, an RC-filtered power supply is used for global clock drivers. The clock distribution design for a next generation Pentium 4 microprocessor that scales to 5 GHz is described in Ref. [10]. The chip is implemented in a 1.2 V, 90-nm dual-Vt process with seven metal layers. The die size is 10.2 mm ×10.7 mm. The clock network consistsof apre-global clock network(PGCN), aglobal clockgrid (GCG),and local clocking.The PCGN comprises 12inversionstages fromthe PLL tothe die center,and 15 stages to the input of more than 1400 GCG drivers. It has a tree structure with strategic shorting of inputs to adjacent receivers within a stage to eliminate skew accumulation over multiple stages because of ran- dom variations. Shorting ofadjacent receivers provides avery gradualclock skew gradientat the input to adjacent GCG drivers. The GCG consists of eight spines spaced roughly 1200 µm apart, as shown in Figure 43.13. The local scheme consists of two stages of gated buffering. The first stage is used for reducing power consumption throughclock gating. The second stageis reserved for functionalgating. The design achieves less than 10 ps of global clock skew. The final grid stage and its driver dissipate 1.75 W/GHz in addition to 0.75 W/GHz in the PGCN. Overall die area allocation ranges from 0.25 percent for devices and lower metals, to less than 2, 3, and 5 percent for M5, M6, and M7 layers, respectively. 43.7 INTEL ITANIUM The clock design of an 800-MHz Itaniummicroprocessor is presented in Ref. [11]. The microprocessor is the first implementation of Intel’s IA-64 architecture. Its core contains 25.4 million transistors and is fabricated on a 0.18-µm, six layer metal CMOS process. The high level of integration require s a significant silicon real estate and high clock loading. The large die size and the small feature size result in prominent within-die process variation. Hence, the Itanium processor uses an active deskewing scheme in conjunction with a combined balanced clock tree and clock grid to distribute the clock over the die. The design also provides enough flexibility for the local clock implementation to support intentional clock skew and time borrowing. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 908 24-9-2008 #13 908 Handbook of Algorithms for Physical Design Automation DSK Reference clock Global distribution Local distribution Regional distribution GCLK CLKP CLKN VCC/2 Main clock RCD RCD DSK PLL DSK DLCLK OTB FIGURE 43.14 Clock distribution topology of the Itanium microprocessor. (From Tam, S. et al., IEEE J. Solid-State Circuits, 35, 1545, 2000. With permission.) DSK DSK DSK DSK DSK DSK DSK PLL DSK FIGURE 43.15 Global core H-tree of the Itanium microprocessor. (From Tam, S. et al., IEEE J. Solid-State Circuits, 35, 1545, 2000. With permission.) The clock system architecture is shown in Figure 43.14. The clock topology is partitioned into global distribution, regional distribution, and local distribution. In the global distribution, a core clock anda reference clock are routed from a PLLclock generator to eight deskew clusters via two identical and balanced H-trees. A schematic drawing of the global core clock tree is shown in Figure 43.15. The global clock tree is implemented exclusively in the two highest level metal layers. To reduce capacitive noise coupling and to ensure good inductive return path, the tree is fully shielded laterally with V DD /V SS . In addition, inductive reflections at the branch points are minimized by properly sizing the metal widths for impedance matching. The regional clock distribution encompasses the deskew buffer, the regional clock driver (RCD), and the regional clock grid. There are 30 separate clock regions each consisting of the above three elements. The 30 regional clocks are illustrated in Figure 43.16. Each of the e ight deskew clusters consists of four distinct deskew buffers. Because 32 deskew buffers are available, two of them are unused. The deskew buffer is connected to the RCDs by a binary distribution network, which uses top layer metals with complete lateral shielding. The RCDs are located at the top and bottom of the regional clock grid. The grid is implemented using M4 and M5. As with the global clock network, it contains full lateral shielding to ensure low capacitance coupling and good inductive return paths. The regional c lock grid utilizes up to 3.5 percent of the available M5 and up to 4.1 percent of the available M4 routing over a region. The deskew buffer architectureis shown in Figure 43.17. Itis a digitallycontrolled DLL structure. A phasedetector residingwithin the localcontrollerof thedeskewbufferanalyzesthe phasedifference Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 909 24-9-2008 #14 Practical Issues in Clock Network Design 909 DSK DSK DSK DSK CDC CDC DSK DSK DSK DSK DSK = Cluster of four deskew buffers = Central deskew controller FIGURE 43.16 Thirty regional clocks of the Itanium microprocessor. (From Tam, S. et al., IEEE J. Solid-State Circuits, 35, 1545, 2000. With permission.) RCD RCD Regional clock grid Deskew buffer Delay circuit Global clock TAP I/F Ref. clock Local controller FIGURE 43.17 Deskew buffer architecture of the Itanium microprocessor . (From Tam, S. et al., IEEE J. Solid-State Circuits, 35, 1545, 2000. With permission.) between the reference clock and a local feedback clock sampled from the regional clock grid. Then the core clock delay is adjusted througha digitally controlled analogue delay line. Experimentalskew measurements show that the total skew is 28 ps with deskewing and is 110 ps without deskewing. The local clock distribution consists of local clock buffers (LCBs) and local clock routings that are embedded within a functional unit. The LCBs receive the input directly from the regional clock grid and then drive the clocked sequential elements. 43.8 INTEL ITANIUM 2 The clock distribution of the 1-GHz Itanium 2 processor is described in Refs. [12,25]. The chip is fabricated on a 180-nm CMOS process with six layers of aluminum interconnects. The processor has 25 million logic transistors and 221 million total transistors. The die size is 21.6 mm ×19.5 mm. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 910 24-9-2008 #15 910 Handbook of Algorithms for Physical Design Automation L1R B Core primary driver Repeaters SLCBs Gaters L2R To PLL A FSB cloc k C From latch pipe FIGURE 43.18 Clock distribution of the Itanium 2 microprocessor. (From Anderson, F.E., et al., Proc. IEEE Intel. Solid-State Circuits Conf., pp. 146–147, 2002. With permission.) The clock network of this processor is shown in Figure 43.18. Similar to that of the Itanium processor in Section 43.7, it can be partitioned into global distribution (L1R), regional distribution (L2R driven by second-level clock-buffers[SLCBs]), and local distribution (driven by gaters). How- ever, it has three significant differences from that of Itanium. First, the global clock network, which is also implemented as a balanced H-tree, applies differential routing to reduce jitter from supply noise, injected common mode noise, and signal slew rates. It is also heavily shielded to reduce jitter because of coupled noise. Second, instead of grids, the r egional distribution makes use of width and length balanced side-shielded H-trees. Third, deskewing technique is not utilized. The skew is minimized by precisely tuning the delay of the H-trees. It achieves a skew of 62 ps. The clock distribution of a more advanced Itanium 2 processor is presented in Ref. [13]. This chip is fabricated on a 130-nm CMOS process with six layers of copper interconnects. It operates at 1.5 GHz at 1.3 V. It has a total of 410 million transistors with a die size of 374 mm 2 . The main difference from its 180-nm predecessor is that this design implements a fuse-based deskewingtechnique to address the clockskewissueand to increase the frequency of operation.There are 23 regional clocks in the core. The SLCB associated with each region contains a 5-bit register that stores the deskew setting. The register controls the delay of the SLCB. On-chip electrically programmable fuses are incorporated to set the register values. To reduce the area required for the fuses, only three of the five deskew setting bits can be addressed with fuses. When the device is under test, all five deskew bits can be accessed using SCAN for finer resolution. The fuse-based deskew can remove unintentional clock skew caused by on-die process variations and clock network design mismatches. It can also inject intentional skew to improve the critical timing paths. A fuse- based deskew scheme is selected over an active scheme because of the deterministic nature of the fuse-based algorithm and its simple implementation. The intrinsic skew without using any deskew technique is 71 ps. The skew reduces to 24 ps when operating with the 3-bit resolution fuse-based deskew. It further reduces to 7 ps when the 5-bit resolution SCAN-based deskews is applied. The clock distribution of a dual-core Itanium 2 processor, code-named Montecito, is described in Ref. [14]. The chip is fabricated on a 90-nm CMOS process with seven layers of copper interconnect and it has 1.72 billion transistors with a die size of 21.5 mm × 27.7 mm [26]. It implements a dynamically variable-frequencyclock system to support a power management scheme, which maxi- mizes processor performance within a configured power envelop [27]. Its clock distribution delivers a variable-frequency clock from 100MHz to 2.5 GHz over a clock network over 28-mm long. The clock network consists of four stages as shown in Figure 43.19. The first stage is the level-0 (L0) route, which connects the PLL to 14 digital frequency dividers (DFDs). The L0 route is the only stage that does not adjust supplies and frequencies during normal operation. The L0 route is 20-mm Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 911 24-9-2008 #16 Practical Issues in Clock Network Design 911 Bus clock Repeaters L0 route L1 route L2 route Postgater route Core0 Core1 PLL Foxton x3 IOs Bus logic DFD CVD SLCB SLCB SLCB SLCB SLCB RAD RAD Gaters Latches Latches Latches Latches Latches Latches Latches Differential Single ended Variable-frequency full rail transitions Fixed frequency low-voltage swings Gaters Gaters Gaters Gaters CVD CVD CVD CVD DFD DFD DFD DFD DFD x3 FIGURE 43.19 Clock distribution of the dual-core Itanium 2 microprocessor. (From Mahone y, P., et al., Proc. IEEE Intel. Solid-State Circuits Conf., pp. 292–293, 2005. With permission.) long consisting of four 5-mm segments that are 400-mV low-voltage swing differential routes. Each segmentis resistively terminated at the receiver and is tapered to optimize RLC flight time and reduce power consu mption. All route segments are matched in compo sitio n in both layer and length. The second stage, the level-1 (L1) route, connects the DFD to 6–10 SLCBs. The DFD output varies in frequency and it operates on a varying core supply voltage. A half-frequency distribution using differential 0 ◦ and 90 ◦ clocks is used. The third stage, the level-2 (L2) route, connects the SLCB to LCBs. A typical SLCB drives 400 LCBs at 200 different locations across 3mm with a skew of less than 6 ps between locations. For this stage, instead of using a grid-based network as in many contemporary designs, a skew-matched RLC tree network technique is employed to reduce metal resources and power. An in-house tool is utilized to route the trees and to match route RLC delays using width and space. The resu lting clock route is adaptable to changes in th e design, and uses far less metal resources and power than a grid-based design while achieving skews that are nearly as low as in grid-based designs. The LCBs, called clock vernier devices, can add 70 ps of delay to any clock in 8 ps increments and are controlled via scan operations. They can facilitate postsilicon debug and remove skew not found in presilicon analysis. The fourth stage, the postgater route, is in the hands of the individual circuit designers. Clock gaters are designed by the clock team into the library in a variety of sizes. With hundreds of latches per gater, routes up to 2-mm long must be engineered for delay, shielding, and load matching. Montecito implements an active deskewing system that runs continuously to null out offsets causedbyprocess,temperature,andvoltagevariationsacrossthedie.Thesystemreliesonahierarchical collection of phase comparators between the ends of different L2routes (i.e., only thefirst three stages are corrected by deskewing). Each SLCB has a 128-bit delay line with 1-ps resolution. With active deskewing and scan-chain adjustments, the total clock-network skew is reduced to less than 10ps. REFERENCES 1. N. Bindal and E. Friedman. Challenges in clock distribution networks. In Proc. Intl. Symp. on Phys. Des., Monterey, C A, p. 2, 1999. 2. Q. K. Zhu. High-Speed Clock N etwork Design. Kluwer Academic, Boston, 2003. . Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 902 24-9-2008 #7 902 Handbook of Algorithms for Physical Design Automation External clock PLL Conditioned local. With permission.) Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 904 24-9-2008 #9 904 Handbook of Algorithms for Physical Design Automation Local clocks are. With permission.) Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 906 24-9-2008 #11 906 Handbook of Algorithms for Physical Design Automation FIGURE 43.11

Định dạng
Số trang	10
Dung lượng	364,4 KB