Handbook of algorithms for physical design automation part 97 pdf

Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 942 24-9-2008 #3 942 Handbook of Algorithms for Physical Design Automation to map a circuit to an application specific integrated circuit (ASIC). The next chapter will describe the physical design algorithms for FPGAs; this chapter sets the stage by describing the architecture of FPGAs. Section 45.2 describes several programming technologies, Section 45.3 describes logic block architectures, Section 45.4 describes routing architectures,and Sections 45.5and 45.6 describe embedded memories and embedded computation blocks. 45.2 PROGRAMMING TECHNOLOGIES The circuit being implemented on an FPGA is stored in the FPGA using a set of configuration bits. These bits can be constructed in various ways; this section describes static random access memory (SRAM), Flash, and antifuse-based configuration bits. These schemes are all used in con- temporary commercial FPGAs; many FPGAs vendors, such as Xilinx, Altera, and Lattice, use SRAM configurable bits to control the programmable switches to configure routing and logic [Altera05,Lattice05,Xilinx05a].Actel produces both Flash and antifuse FPGA products [Actel05a]. QuickLogic uses antifuse technology in their products [Quick05]. Table 45.1 provides a comparison among these three technologies; details on each are provided b elow. FPGAs based on emerging technologies have also been described [Ferrera04,Dehon05], but because they are not commercially available yet, they will not be discussed further here. 45.2.1 SRAM-BASED FPGAS The most popular scheme to implement configuration bits is to use SRAM cells. SRAM technology is fast, and allows for repr ogrammability.In addition, SRAM bits can be implemented using standard complementary metal-oxide-semiconductor(CMOS) processes, meaning FPGAs using SRAMs can be implementedin leading-edgeprocesses. Figure45.1 shows a typical six-transistor SRAM memory cell. It uses the data bit in both the true and complement forms to achieve fast read and write time [Trimberger94]. Although a six-transistor cell is generally more stable because it is resistant to state flipping owing to crosstalk or charge sharing [Betz99], four-transistor and five-transistor SRAM cells are possible. Xilinx uses a five-transistor SRAM cell for their FPGAs [Trimberger94]. The main disadvantage of SRAM is its volatility. Data stored in SRAM cells is erased when the power is turned off. Therefore, additional off-chip memory, like electrically erasable programmable read-only memory (EEPROM), is necessary to store the configuration bits and program the FPGA at power-up. This potentiallycauses security concerns, because designs can be copied by capturing the external bit stream [Zeidman02]. To address this, some FPGA vendors, such as Altera and Lattice, apply on-chip Flash memory to store the configuration bits, so the SRAM-based FPGA can be programmed without external memory upon power-up. A second disadvantage of this technology is that SRAM cells are susceptible to neutron-induced errors, also known as soft-errors, which are TABLE 45.1 Comparison among SRAM, Antifuse, and Flash Features SRAM Flash Antifuse Volatile Yes No No In-system programmable Yes Yes No Power consumption High Lower Lower Density High High High IP security No Yes Yes Soft-error resistance Low High High Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 943 24-9-2008 #4 Field-Programmable Gate Array Architectures 943 Data Data_bar Program line: Asserted during the configuration phase Load line: Load the value for data during configuration phase Load line_bar: Load the value for data_bar during configuration phase FIGURE 45.1 Six-transistor SRAM cell. caused by neutrons, alpha particles, cosmic or terrestrial radiation. These errors are common in high- radiation environments, such as at high altitude or in space. Such errors do not permanently damage the FPGA, but they may cause instability and functional failure in the system. The main strategies to overcome these errors in SRAM-based FPGAs are triple redundancy, error-correcting or parity codes, and redundancy in time. 45.2.2 FLASH-BASED FPGAS Flash cells provide nonvolatile programmability while retaining the ability to reprogram the FPGAs. Figure 45.2 illustrates the Flash switch used in Actel’s ProASIC3. In the Flash switch, two transistors share the floating gate, which stores the programming data. The sensing transistor is used for writing and verification of the flo ating gate voltage while the switching transistor is employed to configure routing nets and logic. Flash-based FPGAs are more secure andconsume less powerthan their SRAM Wordline Switch output Switch input Sensing transistor Switching transistor Floating gate Readback Configure FIGURE 45.2 Flash-based switch. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 944 24-9-2008 #5 944 Handbook of Algorithms for Physical Design Automation counterparts [Actel05a]. However, the manufacturing process for Flash is more complicated than that of SRAM. As a result, Flash technology usually lags one to two p rocess generations behind SRAM technologies. Testing is also lengthy owing to the nature of Flash. Therefore, Flash-based FPGAs have a slower time-to-market compared to the SRAM-based FPGAs. 45.2.3 ANTIFUSE-BASED FPGAS Antifusescan also beused to implementconfigurationbits [Actel05b].An antifuse is a thin insulating layer between conductors. The insulating layer gets mutated by applying high voltage. After the alteration, a low-resistance path is created between the conductors. Such alteration is irreversible. Like Flash, antifuse technology is nonvolatile. The major disadvantage of antifuse FPGA is its one-time programma bility. However, it consumes less power and is more area-efficient th a n SRAM and Flash. 45.3 LOGIC BLOCK ARCHITECTURES Programmability is provided in an FPGA in two ways. Logic is implemented in configurable logic blocks; these logic blocks are then connected to each other and to the I/O pads using a configurable routing network [Rose93,Betz99]. This section focuses on logic blocks and the next section focuses on the routing network. 45.3.1 LOOKUP-TABLES Most FPGAs use lookup-tables (LUTs) as their basic logic element. A K-input LUT (K-LUT) is a memory with 2 K bits, K address lines, and a single output line. Each K-LUT can be configured to implement any function of K inputs by storing the truth table of the desired function in the 2 K storage bits. Figure 45.3 shows an example of a 2-input LUT implemented using SRAM cells (antifuse and Flash memory cells could also be used). Early research has shown that K = 4 works well; this is used in most commercial FPGAs [Rose90,Singh92]. Later work reconfirmed that K = 4 is a good choice for area, but that for performance, K = 7 works well [Ahmed04]. In general, the parameter K has a significant impact on the efficiency of the architecture. If K is too large, it may not be possible to completely fill each logic block, while if K is too small, delay will suffer because more logic blocks will be needed along the critical path of a circuit. Figure 45.4 shows how a 6-input function might be implemented with two 4-LUTs; had a 6-LUT been used, only 1-LUT would be required. Variations on the basic LUT architecture have been used. Figure 45.5 shows a logic block that employs a fracturable LUT mask (FLM) [Lewis05]. A k, m-FLM can implement a single k-input function or two functions, each with up to k − 1 inputs, which together use no more than k + m distinct inputs. The architecture in Figure 45.5a is a 6,2-FLM. An extension of the FLM architecture, called a shared LUT mask (SLM) architecture, is shown in Figure 45.5b. A k, m-SLM can implement two identical functions of k inputs provided that the two functions share k − m inputs. The SLM Output Output SRAM cells SRAM cells Inputs Inputs 00 11 10 01 0 0 0 1 FIGURE 45.3 Two-input LUT. Unprogrammed and programmed as a two-input and gate. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 945 24-9-2008 #6 Field-Programmable Gate Array Architectures 945 Out Out A B C D E F A B C D E F 2 logic levels 1 logic level 4-input LUT 4-input LUT 6-input LUT FIGURE 45.4 Implementing a 6-input function using two 4-LUTs. (a) (b) E0 E1 F 0 F 1 A E F FLM SLM B C0 C1 D0 D1 A B DC0 DC1 Z 1(A,B,C1,D1,F ) Z 2(A,B,C,D,E,F ) Z 0(A,B,C 0,D0,E ) 4 4 4 4-LUT Fixed input routing Fixed input routing 4-LUT 4-LUT 4-LUT 3-LUT 3-LUT 3-LUT 3-LUT 3-LUT 3-LUT 3-LUT 3-LUT MUX network MUX network 4 / / / / / / / / / / / / 3 3 3 3 3 3 3 3 Z 0(A,B,DC 0,DC1,E 0,F 0) Z 1(A,B,DC 0,E0,F 0) Z 0(A,B,DC 0,DC1,E 1,F 1) Z 2(A,B,DC 1,E1,FF) FIGURE 45.5 Adv anced logic block structures. architecture does this through the sharing of LUT masks (the set of configuration bits that indicate the function implemented by the LUT) so that both functions are the same but can have different inputs. The logic block in the Altera Stratix II FPGA is based on a 6,2-SLM [Altera05]. Lookup-tables are usually coupled with flip-flops, as shown in Figure 45.6. In this structure, a configuration bit is used to control the state of the output multiplexer. Depend ing on the value of this configuration bit, the output signal of the LUT can either be registered or unregistered. As in Ref. [Betz99], we refer to the LUT and flip-flop as a basic logic element (BLE). 45.3.1.1 Clusters To increase speed and reduce area and compile time, larger logic blocks are preferred. However, LUT complexity grows exponentially with the number of inputs [Rose93]. Clusters are logic blocks of larger granular ity, typically composed of multiple BLEs, internal cluster routing, and possibly specialized internal cluster connections, such as carry and arithmetic chains [Marquardt00]. Within a cluster, BLE inputs are typically connected to the cluster inputs and BLE outputs by a multiplexer- based crossbar. This internal interconnect is generally faster than the general purpose routing between Clock Inputs Output 4-input LUT D flip-flop FIGURE 45.6 LUT coupled with a flip-flop (BLE). Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 946 24-9-2008 #7 946 Handbook of Algorithms for Physical Design Automation N I N Outputs Logic cluster Clock I Inputs BLE #1 BLE #N … … FIGURE 45.7 Basic BLE and basic cluster composed of identical BLEs. blocks. Altera refers to clusters as logic array blocks (LABs), while Xilinx refers to clusters as configurable logic blocks (CLBs). Figure 45.7 shows a typical cluster. The cluster architecture is described by these four parameters: (1) K, the number of inputs to a LUT, (2) N, the number of BLEs in a cluster, (3) I, the number of inputs to the cluster that connect to LUT inputs, and (4) M clk , the number of clock inputs to a cluster (most studies assume this is 1). Increasing K or N increases the functionality of the cluster. This reduces the number of blocks needed to implement circuits and the number of blocks on the critical path, but increases the size of the block and makes the local cluster interconnect slower. Research has found that K = 4–6 and N = 3–10 provide the best combined speed and area [Ahmed04]. The value of I is often smaller than K × N, because BLEs often share inputs or use the outputs from BLEs within the cluster. Smaller values of I use smaller multiplexers in the crossbar, reducing area, but overly small I values make some BLEs unusable. Research has found that 98 percent utilization can be achieved when I =[(K/2 ) ×(N + 1)] [Ahmed04]. 45.3.1.2 Carry Chains Carry chains are locally routed connections that aid in the efficient implementation of arithmetic operations. They also can be used in the efficient implementation of logical operations, such as parity and comparison . Fast carry chains are important because the critical path for these operations is often through the carry. Each 4-LUT in a BLE can be fractured to implement two 3-LUTs; this is sufficient to implement both the sum and carry, given two input bits (a and b) and a carry input, as shown in Figure 45.8. The carry out signal from one BLE would typically be connected to the carry in of an adjacent BLE using a fast dedicated connection. The Z-input is used to break the carry chain before the first bit of an addition. More complex carry schemes have been described. In Ref. [Hauck00], carry chains based on carry select, variable block, and Brent–Kung schemes are described; the Brent–Kung scheme is shown to be 3.8 times faster than the simple ripple carry adder in Figure 45.8. Support for carry- lookahead adders is included in the Actel Axcelerator device, the Xilinx Virtex-II, Virtex-II Pro, Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 947 24-9-2008 #8 Field-Programmable Gate Array Architectures 947 Carry out Sum out Carry in Z a b 3-LUT 3-LUT 2-LUT 2-LUT 2-LUT 2-LUT P FIGURE 45.8 Carry chain connections to a 4-LUT. and Virtex-4 devices. Carry select capabilities are included in the Altera Stratix FPGAs. The Altera Stratix-II contains two dedicated 1-bit adders in each logic block. Because high-fanin arithmetic can cause routing congestion in a small area of the device, both Xilinx and Altera parts support two independent carry chains in each cluster. This allows for narrower fanin logic, which helps reduce routing congestion around the adders. 45.3.2 NON-LUT-BASED LOGIC BLOCKS Not all FPGAs contain logic blocks based on LUTs. The Actel ProASIC3 logic blocks contain a set of multiplexers, which allow for the implementation of 3-inputcombinational or sequentialfunctions in each logic block [Actel05a]. The QuickLogicEclipse II logic cell contains two 6-inputAND gates, four 2-input AND gates, and seven two-to-one multiplexers [Quick05]. The use of universal logic modules as FPGA logic blocks has also been proposed; these blocks can implement any function of theirinputs by applyinginput permutationand negation[Lin94]. Finally, programmabledevicesusing more coarse-grained logic blocks exist; these logic blocks are typically arithmetic/logic units and are suitable for computationally intensive applications [Ebeling96,Goldstein00,Singh00, Mei03]. 45.4 ROUTING ARCHITECTURES Connections between logic blocks are implemented using fixed prefabricated metal tracks. These tracks are arranged in channels; channels typically run vertically and horizontally, forming a grid [Lemieux04a]. Although many academic studies have assumed that all channels contain the same number of tracks [Betz99], many commercial architectures (such as those from Altera) contain more tracks in each horizontal channel than each vertical channel. Figure 45.9 shows an FPGA with tracks arranged in horizontal and vertical channels. 45.4.1 SEGMENTATION Tracks within a channel can span one logic block, or multiple logic b locks. Typically, not all tracks within a channel will be of the same length. Several studies have investigated the optimum segment length. In Ref. [Brown96], a heterogeneous routing architecture, in which some tracks span three logic blocks, some span two logic blocks, and some span one logic block, is found to work well. In Ref. [Betz99], it is shown that longer wires result in a more efficient architecture; they suggest a homogeneous architecture in which all tracks span either four or eight logic block gives the most efficient FPGA. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 948 24-9-2008 #9 948 Handbook of Algorithms for Physical Design Automation Horizontal channel Switch block Connection block Vertical channel Routing track Logic block Logic block Logic block Logic block Logic block Logic block FIGURE 45.9 Overall routing architecture. 45.4.2 PROGRAMMABLE SWITCHES The tracks are connected to each other and to the logic blocks using programmable switches. These programmable switches can be buffered or unbuffered, as shown in Figure 45.10. Switches in modern FPGAs are typically buffered, because unbuffered switches result in a quadratic increase in delay for long connections. Buffered switches can be bidirectional, as shown in Figure 45.10b or unidirectional, as shown in Figure 45.10c. Although many academic studies assume bidirectional switches [Betz99], most modern FPGAs contain unidirectional switches [Lemieux04b]; these switches allow for better delay optimization and result in a more dense routing fabric. 45.4.3 SWITCH BLOCKS AND CONNECTION BLOCKS Tracks are connected to each other using switch blocks, and to logic blocks using connection blocks. Commercial FPGAs often contain combined switch blocks and connection blocks, however for clarity, this section will describe each separately. (a) Unbuffered (b) Buffered bidirectional (c) Buffered unidirectional FIGURE 45.10 Programmable switches. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 949 24-9-2008 #10 Field-Programmable Gate Array Architectures 949 0 000 0 0 0 0 0 00 1 1 1 1 1 11 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 44 4 0 FIGURE 45.11 Switch block patterns. A switch block lies at the intersection of each horizontal and vertical channel, and can connect each incident track to some number of other incident tracks. Academic work uses the notation F s to describe the number of outgoing tracks to which each incoming track can be connected [Rose93]. Most physical design algorithm studies assume F s = 3; in this case, each incoming track can be connected to one track on each of the other three sides of the switch block. The switch pattern determines wh ich F s tracks to which each incoming track can be connected. Academic work has proposed the three switch patterns in Figure 45.11. The disjoint pattern divides the routing fabric into domains; if there are W tracks in each channel, there are W domains. This simplifies the routing task, and results in an efficient layout. The universal pattern has been shown to support the largest number of simultaneous connections through each switch block [Chang96], while the Wilton block has been shown to result in good overall routability [Wilton97]. An extension of the Wilton block to architectures with different segment lengths is described in Ref. [Masud99]. In Ref. [Sivaswamy05], it is proposed that some of the connections in a switch block should be hard- wired (nonprogrammable);this gives 30 percent speedup, a slight reduction in area, and an 8 percent reduction in power. Connection blocks are used to connect logic block pins to the routing tracks. Each logic block pin can be connected to a subset of routing tracks in the neighboring channel. The quantity F c indicates the proportion o f the tracks in each channel to which a pin can be connected. In ref. [Betz99], it is shown that F c = 0.25–0.5 (depending on the type of switch block employed) works well. 45.4.4 BUS-BASED ROUTING ARCHITECTURES FPGAs are often used to implement datapath-intensive circuits, in which many signals are part of wide buses. Because each bit of a bus is connected in the same way, it has been suggested that a datapath rou ting architecture, in which a single configuration bit contro ls multiple switches, will lead to an improvement in FPGA density. In Ref. [Ye05], the architecture in Figure 45.12 is presented. In this architecture, some of the tracks (the top four in Figure 45.12) are dedicated bus-based routing tracks, and connections to them are controlled by a bus switch; a bus switch contains one switch for each bit controlled by a single configuration cell. In this case, each bus ( and eachbus switch) is 4-bits wide. The lower tracks are regular bit-based routing tracks, which are connected to each other and to the logic cells using standard connection and switch blocks, as described above. In Ref. [Ye05], it is shown that a bus-width of 4 works well, and that 40–50 percent of the tracks should be buses (with the remainder being bit-based routin g tracks). It is sh own that this r esults in a density improvement of 9.6 percent compared to a conventional architecture. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 950 24-9-2008 #11 950 Handbook of Algorithms for Physical Design Automation 4-bit bus Bit-based routing tracks P = Configuration cell LUT LUT LUT LUT PP P P P FIGURE 45.12 Bus-based routing architecture. 45.4.5 PIPELINED INTERCONNECT ARCHITECTURES In deep-submicron technologies, the delay of long wires can limit the clock speed of the circuit implemented on an FPGA. To address this, several authors have proposed pipelined interconnect architectures [Singh01a,Singh01b,Weaver04]. In these architectures, some of the interconnect switches contain registers. This results in additional complexity for the router, however, because it must now balance the number o f registers on each path. 45.5 MEMORIES Today, FPGAs are often used to implement entire systems. These systems often require storage. Although it is possible to implement storage off-chip, on-chip storage has a number of advantages. On-chip storage reduces system costs, allows for a wider, faster memory interface, and reduces I/O demands on the FPGA. There are two ways of implementing memory on FPGAs: embedded memory and distributed memory. Embedded memory solutions offer a number of relatively large fixed dedicated memory blocks on the FPGA. Distributed memory, on the other hand, uses small memories spread across the entire FPGA chip, often implemented in unused logic elements. 45.5.1 EMBEDDED MEMORY Most FPGAs contain embedded memory blocks (EMBs). EMBs are typically arrangedin columns or rows to simplify connectionsto logic and between otherEMBs [Wilton99],as shown in Figure 45.13. Altera’s Stratix and Stratix-II devices include three different sized EMBs: 512 bits, 4 Kbits, and 512 Kbits [Altera05]. Xilinx’s Virtex-4, Virtex-II, and Spartan series contain 18Kbits EMBs [Xil- inx05a]. Actel’s ProASIC3 and ProASIC-Plus contain 4 Kbits and 2 Kbits EMBs, respectively [Actel05]. Each EMB has a fixed number of bits, but its aspect ratio can be configured by the user. For example, in the Stratix II architecture, a 4-Kbit EMB may be configured to act as memories with aspect ratios of 4096 × 1, 2048 ×2,1024 × 4, 512 × 8, 256 × 16, or 128 × 32. On many devices, EMBs can be configured to act as a ROM, single-port RAM, or dual-port RAM. In addition, they typically include parity bits, various enable/reset control signals, and have synchronous inputs with synchronous or asynchronous outputs. Of particular importance is the interface between the memory and the logic. Figure 45.14 shows one published scheme; in this architecture, each EMB connects to the logic through a memory- logic interconnect block [Wilton99]. Figure 45.15 shows the contents of one of these memory-logic Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 951 24-9-2008 #12 Field-Programmable Gate Array Architectures 951 Logic blocks Logic blocks Memory arrays FIGURE 45.13 Logic and memory in an FPGA. Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Logic block Memory/logic interconnect block Memory block Memory/logic interconnect block Memory block Memory/logic interconnect block FIGURE 45.14 Memory/logic interconnect architecture. . switch. Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 944 24-9-2008 #5 944 Handbook of Algorithms for Physical Design Automation counterparts [Actel05a] Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 942 24-9-2008 #3 942 Handbook of Algorithms for Physical Design Automation to map a circuit. flip-flop (BLE). Alpert /Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 946 24-9-2008 #7 946 Handbook of Algorithms for Physical Design Automation N I N Outputs Logic

Định dạng
Số trang	10
Dung lượng	160,08 KB