Handbook of algorithms for physical design automation part 98 ppsx

Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 952 24-9-2008 #13 952 Handbook of Algorithms for Physical Design Automation To logic To logic Memory block FIGURE 45.15 Memory/logic interconnect block. interconnect blocks; crosses indicate programmable connections. The flexibility of the connection block, F m , can be defined as the number of programmable connections available between each horizontal pin and the adjacent vertical channel. In Figure 45.14, F m = 4. In Ref. [Wilton99], it is shown that a value of F m between 4 and 7 works well. To increase routability, the arch itecture in Figure 45.14 includes dedicated tracks for memory-to-memory connections. These tracks are used when multiple memory arrays are cascaded together to form larger user arrays, and are more efficient for such memory-to-memoryconnections.EMBs can also be used to implement logic by configuring them as large ROMs [Cong98] [Wilton00]. 45.5.2 DISTRIBUTED MEMORY Commercial FPGAs such as Xilinx’s Virtex-4, Virtex-II, and Spartan-3 devices allow the 4-input LUTs in their logic blocks to b e configured as 16 × 1-bit memories [Xilinx05a]. These memories have synchronous inputs. Their outputs can be synchronous through the use of the LUTs associated register. These 16 × 1-bit memories can also be cascaded to implement deeper or wider memory arrays through specialized logic resources. Another method for supporting distributed memory is proposed in Ref. [Oldridge05].This architecture allows the configuration memory in the interconnect switch blocks to be used as user memory and is very efficient for wide, shallow memories. 45.6 EMBEDDED COMPUTATION BLOCKS 45.6.1 M ULTIPLIERS AND DSP BLOCKS To address the performance requirements of digital signal processing (DSP) applications, FPGA manufacturers typically include dedicated hardwaremultipliers intheirdevices.AlteraCyclone II and Xilinx Virtex-II/-II Prodevicesinclude embedded 18×18-bit multipliers,whichcanbe split into 9×9- bit multipliers [Xilinx05a]. TheVirtex-II/-II Prodevicesare further optimized with directconnections to the Xilinx block RAM resources for fast access to input operands. As manufacturersmoved toward high-performanceplatform FPGAs, they began to include more complex dedicated hardware blocks, referredto as DSP blocks, which are optimizedforawider range of DSP applications.Altera’s Stratix and Stratix II DSP blocks support pipelining, shift registers, and can be configured to implement Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 953 24-9-2008 #14 Field-Programmable Gate Array Architectures 953 9×9-bit, 18×18-bit, or 36 ×36-bit multipliers that can optionally feed a dedicated adder/subtractor or accumulator [Altera05]. Xilinx Virtex-4 XtremeDSP slices contain a dedicated 18 18-bit 2’s complement sig ned multiplier, adder logic, 48-bit accumulator, and pipeline registers. They also have dedicated connections for cascading DSP slices, with an optional wire-shift, without having to use the slower general routing fabric [Xilinx05a]. This inclusion of dedicated multipliers or DSP blocks to complement the general logic resources results in a heterogeneous FPGA architecture. Research has considered what could be gained from tuning FPGA architectures to specific application domains, in particular DSP. The work in Ref. [Leijten03] deliberately avoids creating a heterogeneous architecture because they found that DSP applications contain both arithmetic and random logic, but that a suitable ratio between arithmetic and random logic is difficult to determine. Instead they develop two mixed-grain logic blocks that are suitable for implementing both arithmetic and random logic by looking at properties of the target arithmetic operations and of the 4-LUT. Their logic b locks are coarse-grained: each block can implement up to 4-bit addition/subtraction, 4 bits of an array multiplier, 4-bit 2 :1 multiplexer, or wide Boolean functions. At the same time, each logic block continues to be able to implement single-bit output random logic functions much like a normal LUT. Their architecture reduces configuration memory requirements by a factor of 4, which is good for embedded systems or those with dynamic reconfiguration,and offers h igher flexibility for handling a range of proportions of datapath operations to random logic. 45.6.2 EMBEDDED PROCESSORS The increase in the capacity of FPGAs has enabled the creation of entire systems on a chip. To support applications involving microcontrollers and microprocessors, FPGA manufacturers offer embedded processors tailored to interface with the FPGA logic fabric. There are two types of FPGA embedded processors: soft and hard. Soft processors are intellectual property cores that have configurable features, such as caches, register file sizes, RAM/ROM blocks, and custom instructions. They are typically available as hardware description language descriptions and are implemented in the logic blocks of the FPGA. Altera and Xilinx have 32-bit reduced instruction set computer (RISC) processor cores that are optimized for their FPGAs: Nios/Nios II and PicoBlaze/MicroBlaze, respectively. Altera and Xilinx also offer development and debugging tools and other intellectual property cores that interface with their processors. The advantages of soft processors include the options to use and configure features only when they are needed, reducing area, and the ability to include multiple processors on a single chip. A Xilinx MicroBlaze requires as few as 923 LUTs [Xilinx05b] and can be used in the creation of multiprocessor systems. Because so ft pr ocessors are implemented using logic resources, they are slower and consume more power than off-the-shelf processors. Hard processors are dedicated hardware embedded on the FPGA. Altera Excalibur devices include the ARM 32-bit RISC processor and Xilinx Virtex-4 and Virtex-II Pro devices include up to two IBM PowerPC 32-bit RISC processors [Altera02,Xilinx05b]. 45.7 SUMMARY This chapter has described the essential architectural features of contemporary FPGAs. Most commercial FPGAs contain small LUTs, in which logic is implemented. These LUTs are usually arranged in clusters, often with special support for arithmetic circuits (such as carry chains). Signals are transmitted between logic blocks using fixed metal tracks, connected using programmable switches. The topology of these tracks and switches make up the device’s routing architecture. In addition to logic blocks, modern FPGAs contain significant amounts of embedded memory, and dedicated arithmetic functional blocks (such as multipliers). This chapter has set the stage for the next chapter, which describes physical design algorithms that target FPGAs. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 954 24-9-2008 #15 954 Handbook of Algorithms for Physical Design Automation REFERENCES [Actel05a] Actel Corp., Pr oASIC3 Flash Family FPGAs Handbook, 2005. Available at: http://www.actel. com/documents/PA3_HB.pdf. [Actel05b] Actel Corp., Actel Quality and Reliability Guide, 2005 Available at http://www.actel.com/ document/RelGuide.pdf. [Ahmed04] E. Ahmed and J. Rose, The effect of LUT and cluster size on deep-submicron FPGA performance and density, IEEE Transactions on VLSI, 12(3): 288–298, March 2004. [Altera02] Altera Corp., Excalibur Device Overview, May 2002. Available at: http://www.altera.com/ literature/ds/ds_arm.pdf [Altera05] Altera Corp., Stratix II Device Handbook, 2005. Available at http://www.altera.com/ literature/list_stx2.jsp [Betz99] V. Betz, J. Rose, and A. Marquardt, Architectur e and CAD for Deep-Submicr o n FPGAs, Kluwer Academic Publishers, N orwell, MA, February 1999. [Brown96] S. Brown, M. Khellah, and G. Lemieux, Segmented routing for speed-performance and routability in field-programmable gate arrays, Journal of VLSI D esign, 4( 4): 275–291, 1996. [Chang96] Y. -W. Chang, D. Wong, and C. Wong, Universal s witch modules for FPGA design, in ACM T ransactions on Design Automation of Electronic Systems, Vol. 1, NY, January 1996, pp. 80–101. [Cong98] J. Cong and S. Xu, Technology mapping for FPGAs with embedded memory blocks, in Pr oceedings of the 6th ACM/SIGDA International Symposium on Field Pr ogrammable Gate Arrays, pp. 179–188, Monterey, CA, 1998. [Dehon05] A. DeHon, Design of programmable interconnect for sublithographic programmable logic arrays, in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, C A, February 2005, pp. 127–137. [Ebeling96] C. Ebeling, D. Conquist, and P. Franklin, RaPiD—Reconfigurable pipelined datapath, in Inter- national Conference on Field-Pro grammable Logic and Applications,Darmstadt,Germany, 1996, pp. 126–135. [Ferrera04] S. P. Ferrera and N. Carter, A magnoelectronic macrocell employing reconfigurable thresh- old logic, in ACM/SIGDA International Symposium on Field-Programmable Gate Arr ays, Monterey, C A, February 2004, pp. 143–154. [Goldstein00] S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, and R. Taylor, PipeRench: A reconfigurable architecture and compiler, Computer, 33(4): 70–77, 2000. [Hauck00] S. Hauck, M. M. Hosler, and T. W. Fry, High-performance carry chains for FPGAs, IEEE Transactions on VLSI Systems, 8(2): 138–147, April, 2000. [Lattice05] Lattice Semiconductor Corp., LatticeXP Datasheet, 2005. Available at http://www. latticesemi.com/lit/docs/datasheets/fpga/DS1001.pdf [Leijten03] K. Leijten-Nowak and J. van Meerbergen, An FPGA architecture with enhanced datapath functionality, in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, C A, February 2003, pp. 195–204. [Lemieux04a] G. Lemieux and D. Lewis, Design of Interconnection Networks for Programmable Logic, Kluwer Academic Publishers, N orwell, MA, November 2004. [Lemieux04b] G. Lemieux, E. Lee, M. Tom, and A. Yu, Directional and single-driver wires in FPGA interconnect, in IEEE International Conference on Field-Programmable Technology, Brisbane, Australia, December 2004, pp. 41–48. [Lewis05] D. Lewis, E. Ahmed, G. Baeckler, V. Betz, M. Bourgeault, D. Cashman, D. Galloway, M. Hutton, C. Lane, A.Lee, P.Leventis, S. Marquardt, C. McClintock, K. Padalia, B.Pedersen, G. Powell, B. Ratchev, S. Reddy, J. Schleicher, K. Stevens, R. Yuan, R. Cliff, and J. Rose, The Stratix II logic and routing architecture, in ACM/SIGDA International Symposium on FPGAs, Monterey, C A, February 2005, pp 14–20. [Lin94] C. -C. Lin, M. Marek-Sadowska, and D. Gatlin, Universal logic gate for FPGA design, in Proceedings of the 1994 IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, November 1994, pp. 164–168. [Marquardt00] A. Marquardt, V. Betz, and J. Rose, Speed and area trade-offs in cluster-based FPGA architectures, IEEE Transactions on VLSI, 8( 1): 84–93, February 2000. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 955 24-9-2008 #16 Field-Programmable Gate Array Architectures 955 [Masud99] M. I. Masud and S. J. E. Wilton, A new switch block for segmented FPGAs, in International Workshop on Field Progr ammable Logic and Applications, Glasgow, U.K., August 1999, pp. 274–281. [Mei03] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix, in Interna- tional Conference on Field-Programmable Logic and Applications, Lisbon, Portugal, 2003, pp. 61–70. [Oldridge05] S. W. Oldridge and S. J. E. Wilton, A n ovel FPGA architecture supporting wide, shallow memories, IEEE Transactions on Very-Large Scale Inte g ration (VLSI) Systems, 13(6): 758– 762, June 2005. [Quick05] Quicklogic, Eclipse II Family Data Sheet, 2005. Available at http://www.quicklogic.com/ images/eclipse2_family_DS.pdf [Rose90] J. S. Rose, R. J. Francis, D. Lewis, and P. Chow, Architecture of field-programmable gate arrays: The effect of logic block functionality on area efficiency, IEEE Journal of Sol id-State Circuits, 25(5): 1217–1225, October 1990. [Rose93] J. Rose, A. E l Gamal, and A. Sangiovanni-Vincentelli, Architecture of field-programmable gate ar rays, Proceedings of the IEEE, 81(7): 1013–1029, July 1993. [Singh92] S. Singh, J. Rose, P. Chow, and D. Lewis, The effect of logic block architecture on FPGA performance, IEEE Journal of Solid-State Cir cuits, 27( 3): 281–287, March 1992. [Singh00] H. Singh, M. -H. Lee, G. Lu, F. Kurdahi, N. Bagherzadeh, and E. Chaves, MorphoSys: An integrated reconfigurable system for dataparallel and compute intensive applications, IEEE Transactions on Computers, 49(5): 465–481, 2000. [Singh01a] A. Singh, A. Mukherjee, and M. Marek-Sado wska, Interconnect pipeling in a throughput- intensive FPGA architecture, in ACM/SIGDA International Symposium on Field- Pr ogrammable Gate Arrays, M onterey, CA, February 2001, pp. 153–160. [Singh01b] D. P. Singh and S. D. Brown, T he case for registered routing switches in field programmable gate arrays, in ACM/SIGDA International Symposium on Field-Pr ogrammable Gate Arrays, Monterey, C A, February 2001, pp. 161–172. [Sivaswamy05] S. Sivaswamy, G. Wang, C. Ababei, K. Bazargan, R. Kastner, and E. Bozorgzadeh, HARP: Hardwired routing pattern FPGAs, in ACM International Symposium on Filed Programmable Gate Arrays, Monterey, CA, February 2005, pp. 21–32. [Trimberger94] S. Trimberger, Field-Programmable Gate Array Technology, Kluwer Academic Publishers, Norwell, MA, 1994. [Weaver04] N. Weaver, J. Hauser, and J. Wawrzynek, The SFRA: A corner-turn FPGA architecture, in ACM/SIGD A International Symposium on FPGAs, February 2004, pp. 3–12. [Wilton00] S. J. E. Wilton, Hetergenous technology mapping for area reduction in FPGAs with embedded memory arrays, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 19(1):56–68, 2000. [Wilton97] S.J.E. Wilton, Architecture andalgorithms for field-programmable gate arrays withembedded memory, PhD thesis, University of Toronto, Toronto, Ontario, Canada, 1997. [Wilton99] S. J. E. Wilton, J. Rose, and Z. G. Vranesic, The memory/logic interface in FPGA’s with large embedded memory arrays, IEEE Transactions on Very-Large Scale Integration Systems, 7(1):80–91, March 1999. [Xilinx05a] Xilinx Corp., Virtex-4 Users Guide, 2005. Available at http://www.xilinx.com/support/ documentation/user_guides/ug070.pdf [Xilinx05b] Xilinx Corp., Processor IP Reference Gui de, February 2005. [Ye05] A. G. Ye and J. Rose, Using bus-based connections to improve field-programmable gate array density for implementing datapath circuits, in ACM/SIGD A Symposium on FPGAs, February 2005, Monterey, CA, pp 3–13. [Zeidman02] B. Zeidman and R. Zeidman, Designing with FPGAs and CPLDs, CMP Books, Upper Saddle River, NJ, 2002. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 956 24-9-2008 #17 Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 957 9-10-2008 #2 46 FPGA Technology Mapping, Placement, and Routing Kia Bazargan CONTENTS 46.1 Introduction 957 46.2 Technology Mapping and Clustering 958 46.2.1 Technology Mapping 959 46.2.2 Clustering 960 46.3 Floorplanning 961 46.3.1 Hierarchical Methods 961 46.3.2 Floorplanning on FPGAs with Heterogeneous Resources 963 46.3.3 Dynamic Floorplanning 964 46.4 Placement 966 46.4.1 Island-Style FPGA Placement 967 46.4.2 Hierarchical FPGA Placement 969 46.4.3 Physical Synthesis and Incremental Placement Methods 969 46.4.4 Linear Datapath Placement 972 46.4.5 Variation-Aware Placement 974 46.4.6 Low Power Placement 975 46.5 Routing 975 46.5.1 Hierarchical Routing 976 46.5.2 SAT-Based Routing 977 46.5.3 Graph-BasedRouting 979 46.5.4 Low Power Routing 980 46.5.5 Other Routing Methods 980 46.5.5.1 Pipeline Routing 980 46.5.5.2 Congestion-Driven Routing 981 46.5.5.3 Statistical Timing Routing 981 References 982 46.1 INTRODUCTION Computer-aided design(CAD) tools for field-programmable gatearrays (FPGAs) primarily emerged as extensionsoftheirapplication-specific integrated circuit (ASIC)counterpartsin the 1980s because of the relativematurity of the ASIC CAD toolsat that time.Traditional logic optimizationtechniques, simulated-annealing-based placement algorithms, and maze routing methods were common in the FPGA world. But as FPGA architecture developeddistinct features both in terms of logic and routing architectures, FPGA CAD to ols evolved into today’s FPGA design flows that are highly optimized for specific characteristics of FPGA devices. More specialized timing models, technology mapping This work is supported i n part by the National Science Foundation under grant CCF-0347891. 957 Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 958 9-10-2008 #3 958 Handbook of Algorithms for Physical Design Automation Technology mapping RTL synthesis Placement Routing Configuration bitfile Design entry Verification/ simulation Power analysis Timing analysis FIGURE 46.1 Typical FPGA flow. solutions, and placement and routing strategies are needed to ensure high-quality mapp ing of circuits to FPGAs. Figure 46.1 shows a common design flow for FPGA designs. The high-level description of the FPGA design is fed to a register transfer level (RTL) synthesis tool that performs technology-independentlogic optimization. The synthesis tool might detect opportunities for utiliz- ing special-purpose logic gates within the FPGA logic fabric. Examples are carry chains, high-fanin sum-of-product g ates, and embedded multiplier (see Sections 45.3.1.2 and 45.3.2). The functional gates of the technology-independent optimized design are mapped to FPGA lookup tables (LUTs) (see Section 45.3.1), a process called technology mapping. Clustering of the LUTs is performed next (see Section 45.3.2). Placement and routing steps follow clustering. Floorplanning may or may not precede p lacement. Each of these steps would use timing and power analysisengines tobetteroptimizethedesign. Furthermore,the user might simulateorperformformal verificationsat various steps of the design cycle. If timing or power constraints are not met, the design flow might backtrack to a previous step. For example, if routing fails due to high congestion, then placement might be attempted again with different parameters. The rest of the chapter is organized into four sections. FPGA-specific technology mapping and clustering algorithms are covered in Section 46.2.1. Sections 46.3 and 46.4 cover floorplanning and placement algorithms. We conclude the chapter by discussing routing algorithms in Section 46.5. 46.2 TECHNOLOGY MAPPING AND CLUSTERING Technology mapping converts a logic circuit into a netlist of FPGA K-LUTs and their connections. A K-LUT is usually implemented as a K-input, one output static random-access memory (SRAM) block. By writing the truth table of a Boolean function in the K-LUT, we can implement any function that has K or fewer inputs regardless of the complexity of the function. Neighboring LUTs can be clustered into local groups with dedicated fast routing resources to improve the delay of the circuit. Clustering algorithms are used to group together local LUTs to minimize connection delays. Later in the design flow, these clusters are used as i nput to the placement step. Some placement algorithms might never touch a cluster, but some other placement methods (such as the ones presented in Section 46.4.3) might move individual logic blocks from one cluster to another to improve timing, power, etc. Given thefact thattechnology mapping consideringareaand delay optimizationis NP-hard,Cong and Minkovich [1] synthesize benchmarks with known optimal or upper-bound technology mapping solutions and test state-of-the-art FPGA synthesis algorithms to see how far these algorithms are Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 959 9-10-2008 #4 FPGA Technology Mapping, Placement, and Routing 959 from producing optimal solutions (a preliminary version of their work appeared in the FPGA 2007 conference). They show that current technology mapping solutions are close to optimal (between 3 and 22 percent away, see Table III in Ref. [1]) while logic optimization methods have much room for improvement. Although some argue that the generated benchmarks are artificial and do not reflect characteristics of large industry benchmarks, nevertheless the work in Ref. [1] gives us insights into what needs to be done to improve existing CAD algorithms. Our goal in the next two sections is to introduce basictechnology mapping andclustering algorithmsso thatthe readercanbetter understand placement and routing algorithms for FPGAs. Many great technology mapping algorithms (such as DAOmap [2], ABC [3], and the work by Mishchenko et al. [4]) are not discussed here. 46.2.1 TECHNOLOGY MAPPING A major breakthroughin the FPGA technology mappingcame about in 1994 with the introductionof the FlowMap tool [5]. Library-based ASIC technology mapping (that maps a logic network to gates such as AND, OR, etc.) for depth min imization was known to be NP-hard, but Cong et al. proved that the K-LUT technology mapping can be done in O(KVE),whereV and E are the number of nodes (gates) and edges (wires) in the circuit, respectively. The FlowMap algorithm traverses th e c ircuit graph containing simple gates and their connections in a breadth-first search fashion and determines depth-optimal mappings of the fanin cones of the nodes as it progresses toward primary outputs. The fanin cone of a node is the set of all gates from the circuit primary inputs (input pads) to the node itself. The algorithm uses the notion of K-feasible cuts to find K-LUT mappings of a subcircuit. Figure 46.2a shows an example subgraph in which a cut separates the nodes into disjoint sets X and X where only three nodes in set X provide inputs to nodes in X, that is, the nodes that are drawn using thick lines. Cut (X, X) is said to be K-feasible for K ≤3. All the nodes in set X can be mapped into one 3-LUT, which gets its input values from the LUTs that implement the three boundary nodes in X and their fanin cones. The labels on the nodes in Figure 46.2a show the depth of the minimum depth K-LUT mapping of the input cone of the node. The authors prove that for a node t, the minimum depth is either the maximum label l in X,orl+1. ∗ Consider an examplegraphforanother circuit shownin Figure46.2b. s f g d 1 1 1 1 0 0 0 00 0 h tЈ N Ј t s f g d 1 ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ 1 1 1 1 1 1 1 1 h tЈ N Љ t 0 0 0 1 1 2 2 X 3 3 3 3 4 4 4 4 t 1 s X (a) Three-feasible cut (b) Node labeling (d) Dual graph X s f g d 1 1 1 0 00 0 0 1 2 2 2 h a b c t N t X (c) Checking the feasibility of a mapping of depth 2 FIGURE 46.2 Flowmap mapping steps. (From Cong, J. and Ding, Y., IEEE Tr ans. Comput. Aided Des. Inte grated Circuits Syst., 13, 1, 1994. Copyright IEEE 1994. With permission.) ∗ If the new node t can be packed with the r est of the nodes with label l, then the depth of L UTs used in implementing the circuit up to this point would not increase. Otherwise, a new LUT with depth l + 1 has to be allocated to house the new node t. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 960 9-10-2008 #5 960 Handbook of Algorithms for Physical Design Automation In a breadth-first search traversal on subgraph N t , when we get to node t, the question is whether we can pack t with nodes a, b, c (which have the maximum depth of l) in one K-LUT. We cancreatea n auxiliary graph N  t (shown inFigure46.2c,note that nodes with labelscorrespond to their counterpart nodes in Figure 46.2b with the same labels), which replaces a, b, c, t with one node t  and see if t  —and possibly other nodes—can be packed in one K-LUT. Node t  can be mapped to a K-LUT if we can find a cut (X, X) where t  ∈ X and at most K nodes in X provide input to nodes in X. Network flow algorithms can be used to answer this question. We can model one LUT in the fanin cone as a flow of one unit, and look for a maximum flow of K-units at the sink node. If the maximum flow is K, it means that we have found a cut with at most K-LUTs as inputs, and anything below the cut can be packed into a K-LUT. More details are provided next. Subgraph N  t can be transformed to a dual graph N  t (Figure 46.2d) in which each node y is replaced by two nodes y i and y o that are connected by an edge of weight 1. An edge (y, z) in N  t corresponds to edge (y o , z i ) in N  t with an infinite edge weight. If a flow of K units can be found in N  t , then at most K nodes in X provide inputs to nodes in X, which means node t in the original N t graph can indeed be packed with other nodes with the maximum label. The authors introduce variations on the original technology mapping algorithm to minimize area as a secondary objective. 46.2.2 CLUSTERING Today’s FPGAs cluster LUTs into groups and provide fast routing resources for intracluster connections. When two LUTs are assigned to one cluster, their connections can use the fast routing resources within the cluster, and hence reduce the delay on the connection. On the other hand, if two LUTs are in two separate clusters, they have to use intracluster routing resources that are more scarce and more costly in terms of delay. Placemen t and routing algorithms are needed to balance the usage of intracluster routing resources (see Sections 46.4 and 46.5). Many clustering algorithms were introduced in the past decade. Most work by first selecting a seed and then choosing LUTs to cluster with the seed. The difference between various clustering algorithms is in their criteria for choosing the seed node and the way other nodes are chosen to be absorbed by the seed. The clustering algorithm used in the popular versatile placement and routing (VPR) tool [6] is called T-VPack [7], which is an extension of the earlier packing algorithm VPack. VPack chooses LUTs with high number of input connections as initial seeds for clusters. The criteria for packing a node B into a cluster C is the attraction of the node, defined as the number of nets that are shared between node B and nodes inside C. The more sharing there is between nodes within a cluster, the less routing demand is needed to connect clusters. T-VPack is the timing-driven version of VPack and extends the definition of the attraction of a node to include timing criticality of nets connecting the node to those packed into the cluster. Timing criticality of a net i is defined as 1 – [slack(i)/MaxSlack]. If two nodes have equal net criticality values connecting them to nodes packed into a cluster, then the one through which m ore critical paths pass is chosen to be packed into the cluster first. The results in Ref. [7] show that cluster s of size 7–10 provide the best area/delay tradeoff. Clustering algorithms such as RPack [8] and the work by Singh et al. [9] improve routability of the clustered circuit by introducing absorption costs that try to weigh nodes based on how promising they are in absorbing more nets into the cluster. The authors in Ref. [9] define connectivity factor (c) of a LUT x as c(x) = separation(x)/degree(x) 2 , where separation of a LUT is the number of LUTs adjacent to it. Figure 46.3a shows node A with a separation value of 18, degree of 4, and connectivity of 1.125. Figure 46.3b shows node B with the sam e degree as A, but with a smaller separation and hence smaller connectivity. Node A cannot absorb any nets if one node from each net is clustered into the sam e cluster as A. On the other hand, node B can absorb all the nets shown in Figure 46.3 by including one node from each net in its cluster. The selection of the seed node in Singh et al. work is done by lexicographically sorting nodes by their (degree, –connectivity) values and choosing the ones with highest values as initial seeds (T-VPack used only the degree values). Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 961 9-10-2008 #6 FPGA Technology Mapping, Placement, and Routing 961 (a) Number of nets absorbed = 0 (b) Number of nets absorbed = 4 A B FIGURE 46.3 Examples illustrating the usefulness of the connectivity factor. (Based on Singh, A. and Marek- Sadowska, M., Proceedings of the ACM/SIGDA International Symposium on Field Pro g rammable Gate Arrays, 59–66, 2002. With per mission.) Nodes are greedily packed into seed clusters based on how many nets they absorb, with higher priority given to the nets with fewer terminals. To guara ntee spatial uniformity of the clustered netlist, the authors limit the number of available pins to a cluster so that the number of logic blocks inside a cluster and the number o f connections to the nodes within the cluster follow Rent’s rule. Doing so effectively depopulates clusters to reduce overall intercluster routing demands. Such strategies are in line with what DeHon’s study [10] on routing requirements o f FPGA circuits suggested. Because interconnect resources (switches and buffers) consume most of the silicon area of an FPGA (80–90 percent), sometimes it is beneficial to underutiliz e clusters to reduce routing d emand in congested regions of the FPGA array. 46.3 FLOORPLANNING Floorplanning is used on FPGAs to speed up the placement process or to place hard macros with prespecified shapes. The traditional FPGA floorplanning problem is discussed in Section 46.3.1. Another class of floorplanning algorithms for FPGAs is the ones that deal with heterogeneous resource types. An example of this approach is the work by Cheng and Wong [11], to be covered in Section 46.3.2. A third class of floorplanning for FPGAs addresses dynamically reconfigurable systems in which modules are added or removed at runtime, requiring fast, on-the-fly modification of the floorplan. These approaches are discussed in Section 46.3.3. 46.3.1 HIERARCHICAL METHODS Sankar and Rose [12] first u se a bottom-up clustering method to build larger clusters out of logic blocks (refer to Section 46.2.2). Then they use a hierarchical simulated annealing algor ithm to speed up the placement compared to a flat annealing methodology. They show trade-offs between placement runtime and quality. While clustering the circuit into larger subcircuits, they limit the shape and size of the clusters to prespecified values. The leaves of the clustering tree ar e the logic blocks and the first level of the tree are nodes that combine exactly two leaves. All level-one nodes will be placed in 1 × 2 regions, that is, on two adjacent clusters in the same row. The next level of hierarchy clusters two level 1 clusters and will be placed as 2 ×2 squares. Figure 46.4 shows the clustering and placement conceptually. Such restrictio ns on the clustering and placemen t steps would limit the ability of the algorithms to search a larger solution space compared to an unrestricted version of the problem, but on the other hand relieve the algorithm designers of dealing with the sizing problem during the floorplanning process, described in Section 9.4.1. The work b y Emmert and Bhatia [13] too starts by clustering the logic elements into larger subcircuits. The input to their flow is a list of m acros, each macro being either a logic block, or a set of logic blocks with a list of predefined shapes. An example of a macro is a multiplier with two shape option s, one for minimum area, the other for minimum delay. . Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 952 24-9-2008 #13 952 Handbook of Algorithms for Physical Design Automation To logic To logic Memory block FIGURE. Saddle River, NJ, 2002. Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 956 24-9-2008 #17 Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C046. house the new node t. Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 960 9-10-2008 #5 960 Handbook of Algorithms for Physical Design Automation In a breadth-first

Định dạng
Số trang	10
Dung lượng	210,52 KB