Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C039 Finals Page 822 29-9-2008 #11 822 Handbook of Algorithms for Physical Design Automation AA B C C NORNAND B FIGURE 39.5 Inverter processing. A A B B XOR NAND NAND NAND C C FIGURE 39.6 Cell e xpansion. AB AB FIGURE 39.7 Off-path resizing. A AND AND AND X Y AND F D F C A B E B C D E FIGURE 39.8 Shattering. cells generally present lower pin capacitances, and so may improve the delay on a timing- critical net, though it could hurt the delay for another path. The timing analyzer and the optimization metric is the arbiter on whether the optimization suggestion is accepted by PDS. When correcting hold violations (short paths), the off-path cells can be powered up to present higher pin capacitance and slow down a path. • Shattering: Similar to cell expansion, larger fan-in cell can be decomposed into a tree of smaller cells. This may allow the most critical path to move ahead to a faster, smaller cell. Figure 39.8 shows how the delay through pins A and B of a five-input AND gate can be reduced by shattering the gate into three NAND gates, so that a and b only n eed to propagate through a cell with less complexity. Merging, the opposite of shattering, can also be an effective timing optimization. A rule of thumb is that merging is good when the slacks at the inputs of a tree are similar, and shattering is good when there is a wider distribution of slacks. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C039 Finals Page 823 29-9-2008 #12 Placement-Driven Synthesis Design Closure Tool 823 Note that the optimizations are atomic actions and are synergistic. Optimizations can call other optimizations. For example, a box could be shattered, then pin-swapped, and finally resized, and the new solution could then be accepted or rejected based on the combined effect of these optimizations. 39.4.4 ADVANCED SYNTHESIS TECHNIQUES The descriptions of some of the above incremental synthesis optimizations are deceptively simple. For example, Figure 39.6 shows an XOR decomposed as two inverters and three NAND gates. It could also be implemented as two inverters, two ANDs, and an OR; two inverters, one OR, and two NANDs; or three inverters, an AND, and two NANDs; etc. An optimization like cell expansion examines several decompositions based on rules of thumb, but does not explore the expansion possibilities in any systematic way. Another way of accomplishing cell expansion and many of the other optimizations is through logic restructuring [9], which provides a systematic way of looking at functionalimplementations. In this method, seed cells are chosen and a fan-in and depth-limited cone is examined for reimplemen- tation to achieve timing or area g oals. The seed box and its restricted co ne of logic are represented as a Boolean decision diagram (BDD) [10]. This provides a canonical form from which different logic structur e s can be implicitly enumerated and evaluated. When a new structure is chosen, it is implemented based on the primitives available in the library and the new cells are placed and sized. The restructuring process can be thought of as a new technology mapping of the selected cone. Advanced synthesis techniques can be a computationally intensive p rocess because, for a large cone, the number of potential implementations can be huge. The fan-in and depth constraints must be chosen so as to balance design quality with runtime. However, they are q uite effective and are especially useful for high-performance microprocessor blocks, which ty pically are small yet have very aggressive timing constraints. 39.4.5 FIXING EARLY PATHS Timing closure consists of correcting both long (late mode) and short (early mode) paths. The delay of long paths must be decreased because the signal is arriving at the register a cycle too late, while the delay of short paths must be increased because they are arriving a cycle too early. The strategy we use in PDS is to correct the long paths without consideration of the short paths, then do short-path correction as a postprocess in such a way as to lengthen the paths without causing a long path to be more critical. This can be tricky because it is possible that all the boxes along a short path can be intertwined with a long path. Doing short-path correction requires th at there be (at least) two tim ing models active: early mode timing tests are done with a slow clock and fast d ata, while late-mode tests are done with fast clocks and slow data. The presence of two timing models enables correction of the early mode paths while minimizing or reducing any adverse effects to the late-mode paths. In PDS, short-path correction is done very late in the process, after routing and with SPICE extraction. The premier way of correcting short paths is by adding delay pads (similar to buffers) along the path to slow it down. In some cases, short-path nets can be reconnected to existing buffers (added for electrical violations or long-path correction) to slow down the path. This can correct the path without incurring the area overhead of a new pad. As noted above, resizing to a slower cell or powering up side-path cells can also be used for early mode correction. 39.4.6 DRIVERS FOR MULTIPLE OBJECTIVES The previousdiscussion discussedtransforms primarily inthe contextof improvingtiming.However, other objectives like wirelength, routing congestion, or placement congestion can be addressed by the same set of optimizations or transforms. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C039 Finals Page 824 29-9-2008 #13 824 Handbook of Algorithms for Physical Design Automation To facilitate the use of transforms for multiple objectives, PDS employs a driver-transform paradigm. Programs are split into local transforms and drivers. The transforms are responsible for the actual manipulation of the logic, for example, adding a buffer, moving or resizing a cell, etc. The driver is responsible for determining the sections of logic that need to be optimized. If the optimization goal is electrical correction the driver will pick cells that violate slew or capacitance limits; if the goal is timing, it will pick cells that lie on the critical paths; if the goal is area reduction, the driver will choose cells in the noncritical region, where slack can be sacrificed for area. The transforms understand their goals (e.g., whether they should be trying so save area of time) and adjust their actions accordingly. The drivers are also responsible for determining which transforms should be applied in what order. Given a list of available optimizations, the driver may ask for evaluations of the characteristics of applying each transform, then choose the order of application based on a cost/benefit analysis (in terms of timing, area, power, etc.). A driver may also leave it to the transform to decide when to apply, in which case the order of the transfo rm list given to the driver becomes quite important. There are a variety of drivers available in PDS. • The most commonly used one is the critical driver,which picks a group of pins with negative slack and sends the pins to the transforms for evaluation. Because transforms can interact, the critical driver iterates both on the current and, when no more can be done on its current list, iterates on sets of lists. To conserve ru ntime, it “rememebers” the transforms that have been tried and does not retry failed attempts. • The corr ection driver isusedtofilter nets, which violate their capacitanceorslewconstraints, which can then be used with a transform designed to fix these violations. • Levelized drivers that present design in “input to output” o r “output to input” order, and are useful in areas like global resizing, where it is desirable to resize all of the sink cells before considering the driving cell. • There is a randomized driver that provide pins in a random order so that an optimization that relies on the order of pins m ay discover alternate solution. • The histo driver is used in the compression phase to divide all the failing paths into slack ranges and then work iteratively on each range. • Of special important is the list driver, which simply provides a predetermined list of cells or nets for the transform to optimize. This enables the designer to selecting specific pieces of the design for optimization while in an interactiveviewing session. The designer’s selection is made into a list of objects for optimization to be processed by the list driver. In summer, PDS contains a large number o f atomic transformations and a variety o f drivers that can invoke them . This yields a flexible and robust set of optim izations that can be included in a fixed-sequence script or can be used directly by designers. 39.5 MECHANISMS FOR RECOVERY During the PDS flow, optimization may occur that damage the design. Local regions could become overfull, legalization could slide critical cells far away, unnecessary wiring could be introduced, etc. This is inevitable in such a complex system. Thus, a key component to PDS is its ability to gracefully recover from such conditions. We n ow overview a few recovery techniques. 39.5.1 AREA RECOVERY The total area used by the design is one of the key metrics in physical synthesis. The process of reducing area is known as area recovery. The goal of area recovery is to rework the structure of the design so as to use less area without sacrificing timing quality; this contrasts with other work in area Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C039 Finals Page 825 29-9-2008 #14 Placement-Driven Synthesis Design Closure Tool 825 reduction, which makes more far-reaching design changes [11] or changes logic cell or IP designs to be more area efficient [12]. Aside from the obvious benefits of allowing the design to fit on the die, or of actually reducing die size, reduction in area also contributes to better routab lility, lower power, and better yield. It especially useful as a recovery mechanism because it can create placement space in congested areas that other optimizations can now exploit. Recall the previously discussed SPI bin model: for a bin of size 1000, if area recovery can reduce the used call area from 930 to 800, this increases the free space available for other cell (such as buffers) from 70 to 200. When a design comes into physical synthesis from logic synthesis, the d esign has normally been optimized with a simplified timing model (e.g, constant delay or wireload). Once the design is placed, routes can be modeled using Steiner estimates or actual global or detailed routes. As more accurate information is know about the real delays due to wires, logic can be restructured to reduce area without impacting timing closure or other design goals. For example, for most designs, a plurality of the nets will be two-pin nets. A wireload model will give the same delay for every two-pin net. Obviously, this is a very gross estimate, as some such nets may be only a few tracks long, while others could span the entire chikp. Paths with shorter- than-average nets may h ave seemed critical during logic synthesis but, once the design is placed and routed, are noncritical, while paths with longer-than-averagenets may be more critical than predicted during logic synthesis. PDS timing optimizations can also create a need fo r area recovery when there are multiple intersecting critical paths. For example, in Fig ure 39.9, PDS will first optimize B because its slack is more critical than A. Using gate sizing, PDS may change B to larger B and thereby improve its slack from −15 to +20. Net it will optimize A, improving its slack from −10 to +10 and also taking more area. This may improve the slack at B to +30. Area reduction might then be applied to change B to B , reducing both area and slack. A good strategy is to have area recovery work on the nontiming-critical paths of the design and give up delay to reduce area. Normally, the noncritical regions constitute a huge percentage—80 percent or more—of the g ates in the design, so care must be taken to use very efficient algorithms in this domain. In addition to being careful about timing, area reduction optimizations must take care to not disturb other design goals, such as electrical correctness and signal integrity. By far, the most effective method of reducing area is sizing down cells. Aside from b eing extremely effective, this method has the advantages of being nondestructive to placement (because Slack −15 A B BЈ BЈ B Љ A AЈ AЈ Slack +20 Slack +30 Slack +10 Slack −10 Slack −10 Slack +10 Slack +10 FIGURE 39.9 Physical synthesis creates opportunities for area recovery. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C039 Finals Page 826 29-9-2008 #15 826 Handbook of Algorithms for Physical Design Automation the new call will fit where the old one was) and minimally disruptive to wiring (the pins might move slightly). Care must be taken when reducing the size of drivers of nets so that they do not become crosstalk victims (Chapter 34). When optimizing without coupling information, noise can be captured to the first order by a slew constraint. As a rule of thumb, Ref. [13] recommends that drivers of nets in the slack range of zero to 15 percent of the clock period not have the size of their drivers reduced. Because timing optimizations nearly always add area, a good rule of thumb for area reduction techniques is that they are the reverse of timing optimizations. So area reduction can remove buffers, inverter pairs, or hold-violation pads so long as timing and electrical correctness goals are p reserved [14]. It can force the use of multilevel calls (such as XORs and AOs), which are n ormally smaller and slower than equivalentsigle-levelimplementations. If appropriatelibrary functionsare available, it can manipulate inverters to, for example, change NANDs to ANDs, ORs, or NORs, if the new configuration is smaller and maintains other design goals. These types of optimizations may be applied locally in a pattern-matching kind of paradigm. For example, each noncritical a buffer in a design could be examined to determine whether it can be removed. Another, more general, approach would be to simultaneously apply areas-reduction techniques through re-covering a section of logic to produce a different selection of technology cells. In this context, re-covering involves a new technology mapping for the selected logic with an emphasis on area, and is more frequently used in logic synthesis, as the p lacement and routing aspects of physical synthesis make this technique extremely complex. Some success at re-covering small, fairly shallow (two to four levels) sections of logic has been reported [15]. A useful adjunct to are a reduction is yield optimization. Overall critical-area-analysis (CAA) yield scores [16] can be reduced by considering individual CAA scores for the library cells and using this as part of the area reduction scheme. For example, suppose a transform wants to reduce the size of a particular cell. Two fu nctionally identical cells may be of the same size and either could be used in the context of the cell to be downsized. However, one may have a better CAA score than the other (though slightly different auxiliary characteristics like delay and input capacitance), so the better scoring cell should be used. Of course, area reduction generally improves CAA scores by reducing the active area of the design. In some cases, it is desirable to apply area reduction even in the critical-slack range. When a design, or part of a design, is placement congested, it is sometimes a good strategy to sacrifice some negative-slack paths by making them slower but smaller to create room to improve paths with even worse slack. Again, resizing is a good example. Suppose a path has a slack of −50 and it would be desirable to upsize a cell on the path, but there is no room to do so. Downsizing a cell in the neighborhood, degrading its slack from −2to−4, may make sense as long as the loss from the downsizing is less than the gain from the upsizing. Typically, this kind of trade-off is made early in the physical synthesis process. The effectiveness of area recovery is very dependent on the characteristics of the design, on the logic synthesis tool used to create it, and on the options used for the tool. Reductions in area of around 5 percent are typical, but reductions in excess of 20 percent have been observed. 39.5.2 ROUTING RECOVERY Total wirelength and routing congestion can also be recovered. Damage to wirelength can be caused by legalization, buffering, or timing-driven placement. For example, when one first buffers a net, it may use a timing-driven Steiner topology (Chapter 25). Later, when one discovers that this net is not critical and meets its timing constraint, it can be rebuffered with a minimum Steiner tree (Chapter 24) to reduce the overall wirelength. PDS has a function that rebuilds all trees with positive slack and sufficiently high windage, defined as follows. A net with k −1 buffers divides it into k trees. Let T k be the sum of the minimum Steiner wirelengh of these k trees. Let T 0 be the wirelength of the minimum Steiner tree with all the Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C039 Finals Page 827 29-9-2008 #16 Placement-Driven Synthesis Design Closure Tool 827 buffer removed. Windage is the value of T k − T 0 . Nets with high windage indicate potentially good candidates for wirelength reduction through alternative buffer placement. One can also deploy techniques to mitigate routing congestion. A buffer tree that goes through a routing congestedregion likely cannot be rerouted easily unless one also replaces the buffers. Smaller spacing be tween buffers reduces the flexibility of routing, so these problems must be handled befo re routing is required. PDS has a function that identifies buffer trees in routing congested regions and rebuilds them so that they avoid the routing resources using algorithms described in Chapter 28. Routing techniques can also be mitigate via spreading the placement using diffusion [17]. Of course, wiring congestion can occur independently of buffers. As noted earlier. PDS h as programs that will reduce wirelength by moving boxes (also using the windage model) and by pin swapping within fan-in-trees. 39.5.3 VT RECOVERY As explained previously, for multi-vt libraries, trade-offsamongvtlevelscan havesignificant impacts on leakage poweranddelay. In some instances, low-vtcellsmay havebeen used tospeed up the design but subsequent optimization may have made the use of low-vt unnecessary. In terms of Figure 39.9, it could be that the change from B to B was actually a vt assignment, in which B was a high-vt cell while B was low-vt. Once A has been changed to further improve timing, it may be possible to change B back to a higher-vt cell to reduce power. In fact, a reasonable str ategy for timing closure is to use low-vt cells very aggressively to close on timing, even though it likely will completely explode the power budget. Then, vt recovery techniques can attempt to reduce power as much as possible while maintaining timing closure. 39.6 OTHER CONSIDERATIONS This chapter focuses primarily on physical synthesis in the context of a typical flat ASIC design style. However, PDS is also used to drive timing closure for hierarchical designs and for designing the sub-blocks of high-performance microprocessors. We now discuss a some issues and special handling required to drive physical synthesis in these regimes. 39.6.1 HIERARCHICAL DESIGN Engineers have been employing hierarchical design since the advent of the hardware description languages. What has changed over the years is the degree with which the hierarchy is maintained throughout the design automation flow.The global nature of optimizations like placement, buffering, and timing, means it is certainly simpler for PDS to handle a flat design. However, PDS is just one consideration for designers in terms of whether they design flat or hierarchically. Despite the simplicity of flat design, as of this writing, hierarchical design is becoming more prevalent. There are several reasons for this: • Design size: The available memory in h ardware may be insufficient to model properly the entire design. Although hardware performance may also be an issue, it can often be mitigated through various thread-parallel techniques. • Schedule flexibility: The design begins naturally partitioned along functional boundaries. A large project, employing several engineers, will not be finished all at once. Hierarchical design allows for disparate schedules among the various partitions and design teams. This is especially true for microprocessor designs. • Managing risk: Engineers cannot afford to generate a great deal of VHDL and then simply walk away. In some cases, the logic design process is highly interactive. The design automa- tion tools must successfully cope with an ever-changing netlist in which lo gic changes may Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C039 Finals Page 828 29-9-2008 #17 828 Handbook of Algorithms for Physical Design Automation arrive very late in the schedule. By p artitioning the design, it is possible to limit the impact of these changes, protecting the large investment required to reach the current design state. • Design reuse: It is common to see the same logic function replicated many times across the design. In a fully automated methodology, one can uniquely construct and optimize each instance of this logic. If the uses are expanded uniquely, each use can be optimized in the context in which it is used. If the physical implementation is reused, then the block must be optimized so that it works in all of its contexts simultaneously, which is a more challenging task. However, common practice shows that even the so -called fully au tomated methodologies require a fair amount of human intervention. Although reuse does present more complexity, there is a point (number of instances) for every design, for which the benefit of implementing the logic just once outweighs the added complexity. After choosing and hierarchical design automation methodology, the single most important decision impacting physical synthesis is the manner in which the design is partitioned. One may work within the boundaries implied by the logic design, or instead one may completely or partially flatten the design and allow the tools to draw their own boundaries. The first choice, working within the confines of the logic design, is still the most common use of hierarchy. Hierarchical processing based on logical partitioning involves getting a leaf set of logical par- titions (perhaps using node reduction as described below) then using those partitions as physical entities, which are floorplanned. In this sense, the quality of the logical partitioning is defined by the quality of the corresponding physical and timing partitioning, which in turn directly affects d ifficulty of the problem presented to PDS. But this is a source of conflict in developing the design. From a functional point of view, for example, the designer might develop a random logic block that describes the control flow for some large section o f d ataflow logic. This is a good functional decomposition and is probably good for simulation, but it may not be good physically because in reality one would not want the control to be segmented in a predefined area by itself, but would want it to be interspersed among the data flow block. The distribution of function within the logical h ierarchy may make it impossible to success- fully execute physical synthesis. Attributes of an optimal physical partitioning include a partition placement and boundary pin assignment that construct realively short paths between the partitions. Attributes of an optimal timing partitioning include paths that do not go in and out of several par- titions before bein g captured by a sequential element with the signals being launched of captured logically close to the hierarchical boundaries. An effective partitioning also include a distribution of chip resources. The first step is to reduce the number of hierarchical nodes by collapsing the design hierarchy.Collapsing the design hierarchy removes hierarchical boundaries that will constrain PDS. In practice, this node reduction is limited only by the performance of the available tools. It is possible (even probable) that some logic function get promoted all the way to the top level if it interfaces with multiple partitions. In our earlier example, the control flow logic partition would be a good candidate to promote to the top level so its logic could be distributed as needed. As noted above, one of the motivating factors for doing hierarchical design is to manage risk by limiting the impact on the design of logic changes to the logical partition. Collapsing nodes can reducethis advantage ofhierarchy,so thereis againaconflict between obtaining a good physical representation and maintaining the logic hierarchy for engineering changes. The next step, floorplanning,is to assign space on the chip image to each partition while reserving some space for top level logic. These two steps, although guided by automated analysis, usually require a fair amount of human intervention. To run PDS on a partition out of the context of the rest of the design hierarchy, sufficient detail regarding the hierarchical boundaries must be provided. The floorphanning steps specify the outline of the hierarchical boundary. What remains is the determinations determine of the location of the Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C039 Finals Page 829 29-9-2008 #18 Placement-Driven Synthesis Design Closure Tool 829 pins on the hierarchical boundary and their timing characteristics. These d etails are best determined by viewing the design hierarchy as “virtually flat” and performing placement and timing analysis. A virtually-flat placement simultaneously places all partitions, allowing hierarchical boundary pins to float, while constraining the contents of each partition to adhere to the floorph a n. The hierar- chical boundary pins are then placed at the intersection of the hierarchical net r oute with the outline of the partition. The timing details for hierarchical boundary pins can be calculated b y constructing a flat timing graph for the hierarchical design. Once the hierarchical boundary paths h ave been timed, the arrival and required arrival times should be adjusted by apportioning the slack. This process of slack apportionment involves examining a timing path that crosses hierarchical boundaries and determining what portion of that path may be improved through physical synthesis. To perfectly solve this problem, the slack apportionment algorithm would h ave to encompass the entire knowledge base of the optimization suite. Because, this is impractical, one must rely upon simple heuristics. The elastic delay of a particular element in a h ierarchical path can be modeled as a simple weight applied against the actual delay. If it is known that a portion of the design will not be changing much, one would assert a very low elasticity. In the case of an static random access memory (SRAM) or core, a zero elasticity would be used. Once the elastic delay along the hierarchical path is determined, the slack is apportioned between the partitions based upon the r elative amount of elastic delay contained within each partition. In addition to timing, capacitance and slew values are apportioned to the hierarchical pins. This results in hierarchical boundary pin placement and timing assertions allow physical synthesis to be executed on each partition individually. Once all of the blocks have been processed out of context, all of the sequentially terminated paths within a block have been fully optimized, but there still may be some improvement needed on cross-hierarchy paths. In Figure 39.10, consider the path between sequential elements S1andS2. Two cells on the path are in block 1 and three cells are in block 2. There is a global net between them going from block pin 1 to block pin 2. There are timing and other assertions on BP1 and BP2 that have been developed during the apportionmentphase. Out-of-content optimization on block 1 and block 2 may have made these assertions incorrect. At this point, one wants to reoptimize this path in a virtually flat way by traversing the path hierarchically and applying optimization along it with accurate (nonapportioned) timing. Note that no additional optimization needs to be done on the logic cloud between sequentials S0andS1 because there was no timing approximation needed during out-of-context optimization. Block 1 Block 2 BP1 BP2 S0 S1 S2 FIGURE 39.10 Hierarchical design example. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C039 Finals Page 830 29-9-2008 #19 830 Handbook of Algorithms for Physical Design Automation Further, when the hierarchical optimization is done on the S1toS2 path, no timing information is needed for the logic between S0andS1. Eliding the timing on such paths reduces CPU time and the memory footprint needed for hierarchical processing. Again referring to Figure 39.10, top-level optimization may be performed to buffer the net from BP1 to BP2. 39.6.2 HIGH-PERFORMANCE CLOCKING In microprocessor designs,clockfrequenciesare significantly higherthan forASICs and the transistor counts are large as well. Thus, the global clock distribution can contribute up to 50 percent of the total active power in high-per formance multihertz designs. In a well-designed balanced clock trees, most of the power is consumed at the last level of the tree, that is , the final stage of the tree that drives the latches. The overall clockpowercanbe significantly reduced byconstrainingeach latch tobe as physically close as possible to the local clock buffer (LCB) that drives it. Figure 39.11 shows this clustering that latches around the LCB. One may think that constraining latches in this matter could hurt performance because latches may not be ideally placed. However, generally, there is an LCB fairly close to a latch’s ideal location, which means the latch does not have to be moved too far to be placed next to an LCB. Further, there can be a positive timing effect because skew is reduced from all the latches b eing clustered around local clock buffers (as shown in Figure 39.12). Savings in power are obtained as a result of the reduction in wire load being driven by the clock buffer. We have found empirically that clustering latches in this manner reduces the capacitive load on the LCB by up to 40 percent, compared to unconstrained latch placement; this directly translates into power saving for the local clock buffer. 39.6.3 POWER GATING TO REDUCE LEAKAGE POWER Exponential increase in leakage power has b een one of the most challenging issues in sub-90 nm CMOS technologies. Power gating is one of most effective techniques to reduce both subthreshold leakage andgateleakage asit cutsoff thep ath to thesupply 18, 19. Conceptually, it isastr aightforward technique; however, the implementation can be quite tricky in high-performance designs where the performance trade-off is constrained to less than 2 percent of the frequency loss due to power gate FIGURE 39.11 L atch clustering around LCBs in a high-performance block. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C039 Finals Page 831 29-9-2008 #20 Placement-Driven Synthesis Design Closure Tool 831 FIGURE 39.12 C luster of latches around a single LCB. (footer/header switch insertion). Figure 39.13 shows a simple schematic of a logic block that has been power g ated by a header switch (PFET) or a footer switch (NFET). Obviously, footer switches are preferred due to the better drive capability of NFETs. Ope rationally, if the logic block is not active, the SLEEP signal can turn off the NFET (footer switch) and the virtual ground (drain of NFET) will float toward V dd (supply voltage), thereby reducing the leakage by orders of magnitude. Introducing a series transistor (footer/header) in the logic path results in a performance penalty. This performance penalty can be mitigated by making the size of the foo ter/header larger so as to reduce the series resistance. However, the leakage benefit reduces with increasing size of the power gate. Practically, in low-power applications, over 2000 times leakage saving can be obtained at the expense of 8–10 percent reduction in performance. However, in high-performance designs, this is a relatively large performance penalty. So, larger power gate sizes are chosen (approximately 6–8 percent of logic area) to achieve less than 2 percent p erformance penalty with over 20 times leakage reduction. In general, power gating can be physically implemented in the designs using block-based coarse- grained power gating and intrablock fine power gating (similar to multiple-supply voltages). In a block-based implementation, the footer (or header) switches surround the boundary of the block, as shown in Figure 39.14. This physical implementation is easier because it does not disturb the internal layout of the block. However, it has a potential drawback in terms of larger IR drop on the virtual ground supply. For IP blocks, this is the preferred implementation technique for power gating. Sleep Header switch Logic block Footer switchSleep FIGURE 39.13 Power gating using header/footer switches. . transforms. Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C039 Finals Page 824 29-9-2008 #13 824 Handbook of Algorithms for Physical Design Automation To facilitate the use of. Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C039 Finals Page 822 29-9-2008 #11 822 Handbook of Algorithms for Physical Design Automation AA B C C NORNAND B FIGURE. Physical synthesis creates opportunities for area recovery. Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C039 Finals Page 826 29-9-2008 #15 826 Handbook of Algorithms for