Handbook of algorithms for physical design automation part 100 docx

Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 972 9-10-2008 #17 972 Handbook of Algorithms for Physical Design Automation move to other CLBs, overcrowding them instead. This process gradually results in moving nodes from overcrowded regions to empty regions. They take care not to cause thrashing in which LUTs are moved back and forth between two clusters. Avoiding thrashing can be done by keeping a history of violations of CLBs. Hence, if thrashing has been occurring for a few moves, the relative cost of both CLBs involved in thrash ing is increased, resulting in the extra LUT or register to be moved to a third CLB. 46.4.4 LINEAR DATAPATH PLACEMENT Callahan et al. [29] presented GAMA, a linear-time simultaneous placement and mapping method for LUT-based FPGAs. They only focus on datapaths that are comprised of arrays of bitslices. The basic idea is to preserve the datapath structure so that we can reduce the problem size by primarily looking at a bitslice of the datapath. Once a bitslice is mapped and placed, other bitslices of the datapath can be mapped and placed similarly on rows above or below the initial bitslice. One of the goals in developing GAMA was to perform mapping and p lacement with little compu- tational effort. To achieve a linear time complexity, the authors limit the search space by considering only a subset of solutions, which means they might not produce an optimal solution. Because optimal mapping of directed acyclic graphs (DAGs) is NP-comp lete, GAMA first splits the circuit graph into a forest of trees before processing it by the mapping and placement steps. The tree coveringalgorithm does not directly handle cycles or nodes with multiple fanouts, and might duplicate nodes to reduce the number of trees. Each tree is compared to elements from a preexisting pattern library that contains compound modules such as the one shown in Figure 46.12. Dynamic programming is used to find the best cover in linear time. After the tree covering process, a postpro cessing step is attempted to find opportunities for local optimization at the boundaries of the covered trees. I nterested readers are referred to Ref. [29] for more details on the mapping process of GAMA. Because the modules will form a bitslice datapath layout, the placement problem translates into finding a linear ordering of the modules in the datapath. Wirelength minimization is the primary goal during linear placement. The authors assume that the output of every module is available at its right boundary. A tree is placed by recursively placing its left and right subtrees, and then placing the root node to the right of the subtrees. The two subtrees are placed next to each other. Figure 46.13 shows an example of a tree placement. Because subtree t2 is wider, placing it to the right of subtree t1 will result in longer wirelength. Because the number of fanin nodes to the root of the tree is bounded, an exhaustive search for the right placement order of the subtrees is reasonable and would result in a linear-time algorithm. In addition to the local placement algorithm, Callahan et al. also attempt some global optimiza- tions. The linear placement algorithm arranges modules within a tree, but all trees in the circuit must also be globally placed. A greedy algorithm is used to place trees next to each other so that + & + & Pattern in library Library pattern found in circuit graph FIGURE 46.12 Example of a pattern in the tree covering library. (Based on Callahan, T. J. et al., Proceed- ings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 123–132, 1998. W ith permission.) Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 973 9-10-2008 #18 FPGA Technology Mapping, Placement, and Routing 973 t2 t1 (a) t2 t1 (b) FIGURE 46.13 Tree placement example. (Based on Callahan, T. J. et al., Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 123–132, 1998. With per mission.) the length of the critical path in the circuit is minimized. Furthermore, after global and local placement is accomplished, individual modules are moved across tree boundaries to further optimize the placement. Ababei and Bazargan [30] proposed a linear placem ent methodology for datapaths in a dynami- cally reconfigurablesystem in which datapaths correspondingto different basic blocks ∗ in a program are loaded, overwritten, and possibly reloaded on linear strips of an FPGA. They assume that the FPGA chip is divided into strips as shown in Figure 46.14. An expression tree corresponding to computations in a basic block is placed entirely in one strip, getting its input values from either memory blocks on the two sides of the strip and writing the output of the expression to one of these memory blocks. Depending on how frequently basic blocks are loaded and reloaded, three placement algorithms are developed: 1. Static placement: This case is similar to the problem considered by Callahan et al. [29], that is, each expression tree is given an empty FPGA strip to be placed on. The solution proposed by Ababei tries to minimize critical path delay, congestion, and wirelength Strip 1 I/O M I/O M I/O M I/O M I/O M I/O M I/O M I/O M Strip 2 Strip 3 Strip 4 FIGURE 46.14 FPGA divided into linear strips. ∗ A basic block is a sequence of code, for example, written in the C language, with no jumps or function calls. A basic block, usually the body of a loop with many iterations, could be mapped t o a coprocessor like an attached FPGA to p erform computations faster. Data used by the basi c block should be made accessible to the coprocessor and the output of the computations should in turn be made accessible to the processor . This could be achieve d either by streaming data from the processor to the FPGA and vice versa, or by providing direct memory access to the FPGA. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 974 9-10-2008 #19 974 Handbook of Algorithms for Physical Design Automation using a matrix bandwidth minimization formulation. The matrix bandwidth minimization algorithm is covered in Section 47.3.2.1. 2. Dynamic placement with no module reuse: In this scenario, we assume that multiple basic blocks can be mapped to the same strip, either because a number of them run in parallel, or because there is a good chance that a mapped b asic block be invoked again in the future. The goal is to place the modules of a new expression on the empty regions between the modules of previous basic blocks,leaving the previously placed modules and their connections intact. As a result, the placementof thenewbasicblock becomes a linear,noncontiguousplacement problem with blockages being the modules from previous basic blocks. 3. Dynamic placement with no module reuse: This scenario is similar to the previous one, except that we try to reuse a few modules and connections left over by previous basic blocks that are no longer active. Doing so will save in reconfiguration time and results in better usage of the FPGA real estate. Finding the largest common subgraph between the old and the new expression trees helps u s maximize the reuse of the modules that are already placed. The authors proposeagreedysolutionfor thesecond problem, that is,dynamicplacement without module reuse. The algorithm works directly on expression trees. Modules are rank-ordered b ased on parameters such as the volume (sum of module widths) of their children subtrees, and latest arrival time on the critical path. The ordering of the nodes determines the linear order in which they should be placed on the noncontiguous space. To solve the third problem, that is, dynamic placement with module reuse, first a linear ordering of modules is obtained using the previous two algorithms to minimize wirelength, congestion, and critical path delay. Then a maximum matching between the existing inactive modules and the linear ordering is sought such that the maximum number of modules are reused while perturbations to the linear ordering are kept at a minimum. The algorithm is then extended to be applied to general graphs, and not just trees. To achieve better reuse, a maximum common subgraph problem is solved to find the largest subset of modules and their connections of the expression graphs that are already placed and those of the new b asic block. 46.4.5 VARIATION-AWARE PLACEMENT Hutton et al. proposed the first statistical timing analysis placement method for FPGAs [31]. They consider both inter- and intradie process variations in their modeling, but do not model spatial correlations among within-die variables. In other words, local variations are modeled as independent random variables. ∗ In Ref. [31], they model delay of a circuit element as a Gaussian variable, which is a function of V t and L eff , each of which are broken into their global (systematic) and local (random) components. Block-based statistical timing analysis [ 33] is used to compute the timing criticality of nodes, which will be used instead of TVPR’s timing-cost component (see Equations 46.3 and 46.5). SSTA (statistical static timing analysis) is performed only at each temperature, not at every move. In their experiments, they compare their statistical timing-based placement to TVPR, and consider the effect of guard-banding and speed-binning. Guard-banding is achieved by adding k.σ to the delay of every element, where k is a user-defined factor such as 3, 4, or 5, and σ is the stan- dard deviation of the element’s delay. Timing yield considering speed-binning is computed during Monte Carlo simulations by assuming that chips are divided into fast, medium, and slow critical path delays. Their statistical placement shows yield improvements over TVPR in almost all combi- nations of guard-banding and speed-binning scenarios. In a follow-up work, Lin and He [34] show ∗ Cheng et al. [32] show that by ignoring spatial correlations, we lose at least 14 percent in the accuracy of the estimated delay. The error in delay estimation accur acy is defined as the integration of the absolute er ror between the distributions obtained through Monte Carlo simulations and statistical sum and maximum computations of the c ircuit delay. See S ection 46. 3 of Ref. [32] for more details. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 975 9-10-2008 #20 FPGA Technology Mapping, Placement, and Routing 975 that combining statistical physical synthesis, statistical placement and statistical routing result in significant yield improvements (from 50 failed chips per 10,000 chips to 5 failed chips in their experimental setup). Cheng et al. [32] propose a placement method that tailors the placement to individual chips, after the variation map for every chip is ob tained. This is a preliminary work that tries to answer the question of given the exact map of FPGA element delays, how much improvement can we get by adapting the placement to individual chips. They show about 5.3 percent improvement on average in their experimental setup, although they do not address how the device parameter maps can be obtained in practice. 46.4.6 LOW POWER PLACEMENT Low power FPGA placement and routing methods try to assign noncritical elements to low power resources on the FPGA. There have been many recent works targetting FPGA power minimization. We will only focus on two efforts: one deals with the placement problem [35] and the other addresses dual voltage assignment to routes [36], the latter will be discussed in Section 46.5.4. The authors in Ref. [35] consider an architecture that is divided into physical regions, each of which can be independently power gated. To enable leakage power savings, designers must look into two issues carefully: 1. Region granularity:They should determine the best granularity of the power gating regions. Too small a region would have high circuit overheads both in terms of sleep transistors and configuration bits that must control them. On the other hand, a finer granularity gives more control over which logic units could shut down and could potentially harness more leakage savings. 2. Placement strategies: CAD developers should adopt placement strategies that constrain logic blocks with similar activity to the same regions. I f all logic blocks placed in one region are going to be inactive for a long period of time, then the whole region can be power gated. However, architectural properties of the FPGA would influence the effectiveness of the placement strategy. For example, if the FPGA architecture has carry chains that run in the vertical direction, then the placement algorithm must place modules in regions that are vertically aligned. Not doing so could harm performance significantly. By constraining the placement of modules with similar power activity, we can achieve two goals: power gate unused logic permanently, and power gate inactive modules for the duration of their inactive period. In their experiments, they consider various sizes of the power gating regions and also look into dynamic versus static powering down of unused/idle regions. 46.5 ROUTING Versatile placement and ro uting [6]uses Dijkstra’s algorithm (i.e., a maze router) to connect terminals of a net. Its router is based on the negotiation-basedalgorithm PathFinder [37].PathFinder first routes all nets independently using the shortest route for each path. As a result, some routing regions will become overcongested. Then in an iterative process, nets are ripped-up and rerouted to alleviate congestion. Nets that are not timing-critical take detours away from the congested regions, and nets that are timing critical are likely to take the same route as round one. There is a possibility that two routing channels show a thrashing effect, that is, nets are ripped- up from one channel and rerouted through the other, and then in the next iteration be ripped-up from the second and rerouted through the first. To avoid this, VPR use a history term that not only penalizes routing through a currently congested region, but it also uses the congestion data from the recent history to avoid thrashing. So the congestion of a channel is defined as its current resource (over-)usage plus a weighted sum of the previous congestion values from previousrouting iterations. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 976 9-10-2008 #21 976 Handbook of Algorithms for Physical Design Automation Reexpand around new wire Expansion wavefront FIGURE 46.15 Local expansion of the wavefront. (Based on Betz, V. and Rose, J., Field-Pro grammable Logic and Applications (W. Luk, P. Y. Cheung, and M. Glesner, eds.), pp. 213–222, Springer-Verlag, Berlin, Germany, 1997. With permission.) To route a multiterminal net, VPR uses the m aze routing algorithm, described in Chapter 23. After connecting two terminals of a k terminal net, VPR’s maze router starts a wave from all points on the wire connecting the two terminals. The wave is propagated until the next terminal is reached. The process is repeated k −1 times. When a new terminal is reached, instead of restarting the wave from the new wiring tree from scratch, the maze routing algorithm starts a local wave from the new branch of wire that connected the new terminal to the rest of the tree. When the wavefront of the local wave gets as far out as the previous wavefront, the two waves are merged and expanded until a new terminal is reached. Figure 46.15 illustrates the process. 46.5.1 HIERARCHICAL ROUTING Chang et al. propose a hierarchical routing method for island-style FPGAs with segmented routing architecture in Ref. [38] (Section 45.4.1). Because nets are simultaneously routed, the net-ordering problem at the detailed routing level would not be an issue, in fact, global routing and detailed routing are performed at the same time in this approach. They model timing in their formulation as well, and estimate the delay of a route to be the number of programmable switches that it has to go through. This is a reasonable estimation because the delay of the switch points is much larger than the routing wires in a typical FPGA architecture. Each channel is divided into a number of subchannels, each subchannel corresponding to the set of segments of the same length within that channel. After minimum spanning routing trees are generated, delay bounds are assigned to segments o f the route and then the problems of channel assignment and delay bound recalculation are solved hierarchically. Figure 46.16 shows an example of a hierarchical routing step, in which connection i is generated by a minimum spanning tree algorithm. The problem is divided into two subproblems, one containin g pin1 and the other containing pin2. The cutline b etween the two regions contains a number of horizontal subchannels. The algorithm tries to decide on the subchannel through which this net is going to be routed. Once the subchannel is decided (see the right part of Figure 46.16), then the routing problem can be broken into two smaller subproblems. While dividing the problem into smaller subproblems, the algorithm keeps updating the delay bounds on the nets, and keeps an eye on the congestion. To decide on which subchannel j to use to route a routing segment i, the following cost function is used: C ij = C (1) ij + C (2) ij + C (3) ij (46.6) Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 977 9-10-2008 #22 FPGA Technology Mapping, Placement, and Routing 977 pin1 After assignment pin1 ch1 ch2 ch3 ch4 ch1 ch2 l i1 l i2 Subchannel j ch3 ch4 Cutline Pin2 Region 1 Region 2 pin2 Connection i FIGURE 46.16 Delay bound redistribution after a hierarchical routing step. (Based on Chang, Y W. et al., ACM T ransactions on Design Automation of E lectronic Systems, 5, 433–450, 2000. With permission.) where C (0) ij is zero if connection i can reach subchannel j,and∞ otherwise. Reachability can be determined by a breadth-first search on the connectivity graph. The second term intends to utilize the routing segments evenly according to the connection length and its delay bound: C (2) ij = a     l i U i − L j     (46.7) where l i is the Manhattan distance of the connection i U i is the delay bound of the connection L j is the length of routing segments in the subchannel j a > 0 is a constant The term tries to maximize routing resource efficiency in routing. So, for example, if a net has a delay bound U i = 4 and Manhattan distance l i = 8, it can be routed through four switches, which means the ideal routing resource whose length is just right for this connection is 8/4 = 2. For a subchannel that contains routing segments of length 2, the cost function will evaluate to zero, that is, segment length of 2 is ideal for routing this net. On the other hand, if a subchannel with segment length of 6 is considered, then the cost function will evaluate to 4, which means using segments of length 6 might be an overkill for this net, as its slack is high and we do not have to waste our length 6 routing resources on this net. Cost component C (3) ij in Equation 46.6 is shown in Figure 46.17. Figure 46.17b shows a typical nontimingdriven routing,and Figure 46.17a showsthe cost function used in Ref. [38]. The basic idea is to assign a lower cost to routes that are likely to use fewer bends. For example, in Figure 46.17a, if subchannel s3 is chosen, then chances are that when the subproblem of routing from a pin to s3 is being solved, more bends are introduced between the pin and s3. On the other hand, routing the net through s1ors5 will guarantee that the route from the subchannel to at least one pin is going to use no bends. Note that the cost of routing outside the bounding box of the net increases linearly to discourage detours, which in turn hurt the delay of a net. After a net is divided into two subnets, the delay bound of the net is distributed among the two subnets based on their lengths. So, for example, in Figure 46.17, if the original delay bound of connection i was U i ,thenU i1 =[l i1 /(l i1 + l i2 )]×U i ,andU i2 =[l i2 /(l i1 + l i2 )]×U i . 46.5.2 SAT-BASED ROUTING Recent advances in SAT (Satisfiability problem) solvers have encouraged researchers to formulate various problems as SAT prob lem s and u tilize the efficiency of these solvers. Nam et al. [39] formulated the detailed routing on a fully segmented routing arc hitecture (i.e., a ll routin g segments Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 978 9-10-2008 #23 978 Handbook of Algorithms for Physical Design Automation Cost s1 s2 s3 (b) (a) s4 s5 Cutline Pin C ij (3) X-coordinate of the subchannel used for routing Typical nontiming cost Subchannels Pin FIGURE 46.17 Cost function. (Based on Chang, Y W. et al., ACM Transactions on Design Aut omation of Electronic Systems, 5, 433–450, 2000. With permission.) are of length 1) as a SAT problem. The basic idea is shown in Figure 46.18. Figure 46.18a shows an instance of a global routing problem that includes three nets, A, B,andC and an FPGA with a channel width of three tracks. Figure 46.18b shows possible solutions for the routing of net A. In a SAT problem, constraints are written in the form of conjunctive normal form (CNF) clauses. The CNF formulation of the constraints on net A are shown in Equation 46.8, where AH, BH, and CH are integer variables showing the horizontal track numbers that are assigned to nets A, B,andC, respectively. AV is the vertical track number assigned to net A. The conditions on the first line enforce that a unique track number is assigned to A, the second line ensures that the switchbox constraints are met (here it is assumed that a subset switchbox is used), and the third line enforces that a valid track number is assigned to the vertical segment of net A. These conditions state the connectivity constraints for net A. 2 12 Row index (a) Global routing example (b) Possible solutions for net A Net B Net A Net C Column index CLB (0, 0) CLB (0, 0) CLB (2, 0) CLB (4, 0) CLB (2, 0) CLB (4, 0) CLB (4, 0) CLB (4, 2) CLB (2, 2) CLB (0, 2) CLB (0, 2) Vertical channel 1 2 2 1 1 0 0 21 0 Horizontal channel 1 Vertical channel 3 SRC 0 1 DST 34 0 1 0 FIGURE 46.18 SAT formulation of a detailed routing problem. (From N am, G J., Sakallah, K. A., and Rutenbar, R. A., IEEE Trans. Comput. Aided Des. Integrated Circuits Syst ., 21, 674, 2002. With permission.) Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 979 9-10-2008 #24 FPGA Technology Mapping, Placement, and Routing 979 Conn(A) = [ ( AH ≡ 0 ) ∨ ( AH ≡ 1 ) ∨ ( AH ≡ 2 ) ] ∧ [ ( AH = AV ) ] ∧ [ ( AV ≡ 0 ) ∨ ( AV ≡ 1 ) ∨ ( AV ≡ 2 ) ] (46.8) To ensure that different nets do not share the same track number in a channel (exclusivity constraint), conditions like Equation 46.9 must be added to the problem: Excl ( H1 ) = ( AH = BH ) ∧ ( AH = CH ) (46.9) where H1 refers to the horizontal channel shown in Figure 46.18a. The routability problem of the example of Figure 46.18a can be formulated as in Equation 46.10: Routable ( X ) = Conn ( A ) ∧Conn ( B ) ∧Conn ( C ) ∧ Excl ( H1 ) (46.10) wh ere X is a vector of track variables AH, BH, CH, AV,BV, and CV. If Routable(X) is satisfiable, then a routing solution exists and can be derived from the values returned by the SAT solver. The authors extend the model so that doglegs can be defined too. Interested readers and referred to Ref. [39] for details. Even though detailed routing can be elegantly formulated as a SAT problem, in practice its application is limited. If a solution does not exist (i.e., when there are not enough tracks), the SAT solver would take a long time exploring all track assignment possibilities and returning with anegative answer, that is, Routable(X) is not satisfiable. Furthermore, even if a solution exists but the routing instance is difficult (e.g., when there are barely enough routing tracks to route the given problem instance), the SAT solver might take a long time. In practice, the SAT solver could be terminated if the time spent on the problem is more than a prespecified limit. This could either mean that the problem instance is difficult, or no routing solution exists for the given number of tracks. 46.5.3 GRAPH-BASED ROUTING The FPGA global routing problem can be modeled as a graph matching problem in which branches of a routing tree are assigned (matched) to sets of routing segments in a multisegm ent architecture to estimate the number of channels required for detailed routing. Lin et al. propose a graph-based routing method in Ref. [40]. The input to the problem is a set of globally routed nets. The goal is to assign each straight segment of each net to a track in the channel that it is globally routed so that a lower bound on the required number of tracks is obtained for each channel. Interactions between channels are ignored in this work, as a result, the bound on the number of tracks needed for each channel is calculated in isolation. The actual number of tracks needed for the whole design might be larger depending on the switchbox architecture and the way horizontal and vertical channels interact. They model the track assignment problem within one channel as a weighted matching problem. Straight segments of nets are called subnets (e.g., a net routed in the shape of an “L” is divided into two subnets). Within a channel, subnets belonging to a maximum clique C of overlapping subnets ∗ are assigned to tracks from a set o f tracks H using a bipartite graph matching problem. Members of set C form the nodes on one side of the bipartite graph used in the matching problem, and the nodes on the other side of the matching graph are tracks in set H. The weight on the edges from subnets to routing tracks are determined based on the track length utilization. The track utilization U r (i x , t) of a subnet i x on track t is defined as U r ( i x , t ) = len ( i x )  1≤y<k len  s y  + α k (46.11) ∗ Refer to Ref. [41] for more discussions on finding cliques of overlapping net intervals and calculating lower bounds on channel densities. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 980 9-10-2008 #25 980 Handbook of Algorithms for Physical Design Automation where len(i x ) and len(s y ) are the respective lengths of the subnet i x and the segment s y y is an FPGA routing segment in the track that i x is globally routed in k is the number of segments needed to route the subnet on that track Note that the first and the last FPGA routing segments used in routing the subnet might be longer than what the subnet needs, and hence some of the track length would go underutilized. The algorithm tries to maximize routing segment utilization by matching a subnet to a track that has segments whose lengths and starting points match closely to those of the span of the subnet. This is achieved by maximizing the sum of track utilizations U r (i x , t) over all subnets. Parameter α in the equation above is used to enable simultaneous routability and timing optimization . They further extend the algorithm to consider timing as well as routability using an iterative process. After an initial routing, they distribute timing slacks to nets, and order channels based on how critical they are. A channel is critical if its density is the highest. 46.5.4 LOW POWER ROUTING The authors in Ref. [36] assume that all switches and connection boxes in a modified island-style FPGA are Vdd-programmable. An SRAM bit can determine if the driver driving a particular switch or connection box will be in high or low Vdd. To avoid adding level converters, they enforce the constraint that no low-Vdd switch can drive a high-Vdd element. The result is each routing tree can be mapped either fully in high-Vdd, or fully in low-Vdd, or mapped to high-Vdd from the source up to a point in the routing tree, and then low-Vdd from that point to the sink. In terms of power consumption, it is desired to map as many routing resources to low-Vdd, as that would consume less power than high-Vdd. But because low-Vdd resources are slower, care must be taken not to slow down critical paths in the circuit. They propose a heuristic sensitivity-based algorithm and a linear programming formulation for assigning voltage levels to programmable routing resources (switches and their associated buffers). The sensitivity-based method first calculates power sensitivity P/V dd for each routing resource, which is the power reduction by changing high-Vdd to low-Vdd. A resource with the highest sensitivity is tried with low-Vdd. If the path containing the switch does not violate th e timing constraint, then the switch and all its downsteam routing resources are locked on low-Vdd. Otherwise, the switch is changed back to high-Vdd. The linear programming method tries to distribute path slacks among route segments such that the number of low-Vdd resources is maximized subject to the constraint that no low-Vdd switch drives a high-Vdd one. 46.5.5 OTHER ROUTING METHODS In this subsection, we review miscellaneous routing methods such as pipeline routing, congestion- driven routing, and statistical timing routing. 46.5.5.1 Pipeline Routing Eguro and Hauck [42] propose a timing-driven pipeline-aware routing algorithm that reduces critical path delay. A pipeline-aware routing problem requires the connection from a source node to a sink node to pass through certain number of pipeline registers and each segment of the route (between source, sink, and registers) must satisfy delay constraints. The work by Eguro and Hauck adapts PathFinder [37]. When considering pipelining, the problem becomes more difficult compared to a traditional routing problem, because as registers move along a route, the criticality of the r outing segments would change. For example, suppose a net is to connect logic block A to logic block B through one register R. In the first routing iteration, R might be placed close to A, which makes the subroute A–R not critical, but R–B would probably be critical. In the next iteration, R might move Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 981 9-10-2008 #26 FPGA Technology Mapping, Placement, and Routing 981 closer to B, and hence the two subroutes might be considered critical and noncritical in successive iterations. To address the problem stated above, the authors in Ref. [42] perform simultaneous wave propagation maze routing searches, each assuming that the net has a distinct timing-criticality value. When the sink (or a register) is reached in the search process, the routing wave that b est balances congestion and timing criticality is chosen. Interested readers are referred to Ref. [42] for more details. 46.5.5.2 Congestion-Driven Routing Another work that d eals with routability and congestion estimation is fGrep [43]. To estimate congestion, waves are started from a source node, and all possible paths are implicitly enumerated at every step of the wave propagation. The probability that the n et passes through a particular routing element is the ratio of the total number of p aths that pass through that routing element to the total number of paths that can route the net. Routing demand or congestion on a routing element is defined as the sum of these probabilities among all nets. Of course, performing full wave propagation for every net would be costly. As a trade-off, the authors trim the wave once it has passed a certain predetermined distance, which results in the speedup of the estimation at the cost of accuracy. Another speedup technique used by the authors is to start waves from all terminals of a net and stop when two waves reach each other. 46.5.5.3 Statistical Timing Routing Statistical timing analysis has found its way into FPGA CAD tools in recent years. Sivaswamy et al. [44] showed that using SSTA during the routing stage could greatly improve timing yield over traditional static timing analysis methods with guard-banding. More specifically, in their experimental setup they could reduce the yield loss from about 8 per 10,000 chips to about 1 per 10,000 chips. They considered inter- and intradie variations and modeled spatial correlations in their statistical modeling o f device parameters. Matsumo et al. [45] proposed a reconfiguration methodology for yield enhancement in which multiple rou ting solutions are generated for a design and the one that yields the best timing for a particular FPGA chip is loaded on that chip. This can be done by performing at-speed testing of an individual FPGA chip using each of the n configurations that are generated and by picking the one that yields the best clock speed. The advantage of this method compared to a method that requires obtaining the delay map of all elements on the chip (e.g., the work by Cheng et al. [32]) is that extensive tests are not required to determine which configuration yields the best timing results. In the current version of their method, Matsumo et al. [45] fix the placement and only explore different routing solutions. In each configuration, they try to avoid routing each critical path through the same regions used by other configurations, which means that ideally, each configuration routes a critical path through a unique set of routing resources that are spatially far away from the paths in other configurations. As a result, if a critical path in one configuration is slow due to process variations, chances are that other configurations would route the same path through regions that are faster, resulting in a faster clock frequency. Figure 46.19 shows three configuration s with different routes f or a critical path and the delay variation map of the switch matrix. Using the delay map in Figure 46.19, wecancalculatethe delayofthe criticalpath inthefirst,second, andthirdconfigurations as 4.9, 4.5, and 5.1, respectively. They ignore spatial correlations in their method, hence they can analytically calculate the probability that a design fails timing constraints given n configurations. The probability that none of the n configurations passes the timing test is Y n ( Target ) = 1 −  1 −Y 1 (Target)  n (46.12) . Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 972 9-10-2008 #17 972 Handbook of Algorithms for Physical Design Automation move to other. access to the FPGA. Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 974 9-10-2008 #19 974 Handbook of Algorithms for Physical Design Automation using a matrix. routin g segments Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 978 9-10-2008 #23 978 Handbook of Algorithms for Physical Design Automation Cost s1 s2 s3 (b) (a) s4 s5 Cutline Pin C ij (3) X-coordinate

Định dạng
Số trang	10
Dung lượng	202,94 KB