MEMORY, MICROPROCESSOR, and ASIC phần 3 doc

3-15SRAM FIGURE 3.18 Noise-reduction output circuit. (From Izumikawa, M. et al., IEEE J. Solid-State Circuits, 32, 1, 52, 1997. With permission.) FIGURE 3.19 Waveforms of noise-reduction output circuit (solid line) and conventional output circuit: (a) gate bias, (b) data output, and (c) GND bounce. (From Miyaji, F. et al., IEEE Solid-State Circuits, 24, 5, 1213, 1989. With permission.) 3-16 Memory, Microprocessor, and ASIC inductance of the GND line. Therefore, the address buffer and the ATD circuit are influenced by the GND bounce, and unnecessary signals are generated. Figure 3.18 shows a noise-reduction output circuit. The waveforms of the noise-reduction output circuit and conventional output circuit are shown in Fig. 3.19. In the conventional circuit, nodes A and B are connected directly as shown in Fig. 3.18. Its operation and characteristics are shown by the dotted lines in Fig. 3.18. Due to the high-speed driving of transistor M4, the GND potential goes up, and the valid data is delayed by the output ringing. A new noise-reduction output circuit consists of one PMOS transistor, two NMOS transistors, one NAND gate, and the delay part (its characteristics are shown by the solid lines in Fig. 3.19). The operation of this circuit is explained as follows. The control signals CE and OE are at high level and signal WE is at low level in the read operation. When the data zero output of logical high level is transferred to node C, transistor M1 is cut off, and M2 raises node A to the middle level. Therefore, the peak current that flows into the GND line through transistor M4 is reduced to less than one half that of the conventional circuit because M4 is driven by the middle level. After a 5-ns delay from the beginning of the middle level, transistor M3 raises node A to the VDD level. As a result, the conductance of M4 becomes maximum, but the peak current is small because of the low output voltage. Therefore, the increase of GND potential is small, and the output ringing does not appear. References 1. Bellaouar, A. and Elmasry, M.I., Low-Power Digital VLSI Design Circuit and Systems, Kluwer Academic Publishers, 1995. 2. Ishibashi, K. et al., “A 1-V TFT-Load SRAM Using a Two-Step Word-Voltage Method,” IEEE J. Solid-State Circuits, vol. 27, no. 11, pp. 1519–1524, Nov. 1992. 3. Chen, C W. et al., “A Fast 32KX8 CMOS Static RAM with Address Transition Detection,” IEEE J. Solid-State Circuits, vol. SC-22, no. 4, pp. 533–537, Aug. 1987. 4. Miyaji, F. et al., “A 25-ns 4-Mbit CMOS SRAM with Dynamic Bit-Line Loads,” IEEE J. Solid- State Circuits, vol. 24, no. 5, pp.1213–1217, Oct. 1989. 5. Matsumiya, M. et al., “A 15-ns 16-Mb CMOS SRAM with Interdigitated Bit-Line Architecture,” IEEE J. Solid-State Circuits, vol. 27, no. 11, pp. 1497–1502, Nov. 1992. 6. Mizuno, H. and Nagano, T., “Driving Source-Line Cell Architecture for Sub-lV High-Speed Low-Power Applications,” IEEE J. Solid-State Circuits, no. 4, pp. 552–557, Apr. 1996. 7. Morimura, H. and Shibata, N., “A Step-Down Boosted-Wordline Scheme for 1-V Battery- Operated Fast SRAM’s,” IEEE J. Solid-State Circuits, no. 8, pp. 1220–1227, Aug. 1998. 8. Yoshimito, M. et al., “A Divided Word-Line Structure in the Static RAM and Its Application to a 64 K Full CMOS RAM,” IEEE J. Solid-State Circuits, vol. SC-18, no. 5, pp. 479–485, Oct. 1983. 9. Hirose, T. et al., “A 20-ns 4-Mb CMOS SRAM with Hierarchical Word Decoding Architecture,” IEEE J. Solid-State Circuits, vol. 25, no. 5, pp. 1068–1074, Oct. 1990. 10. Itoh, K., Sasaki, K., and Nakagome, Y., “Trends in Low-Power RAM Circuit Technologies,” Proceedings of the IEEE, pp. 524–543, Apr. 1995. 11. Nambu, H. et al., “A 1.8-ns Access, 550-MHz, 4.5-Mb CMOS SRAM,” IEEE J. Solid-State Circuits, vol. 33, no. 11, pp. 1650–1657, Nov. 1998. 12. Cararella, J.S., “A Low Voltage SRAM for Embedded Applications,” IEEE J. Solid-State Circuits, vol. 32, no. 3, pp. 428–432, Mar. 1997. 13. Prince, B., Semiconductor Memories: A Handbook of Design, Manufacture, and Application, 2nd edition, John Wiley & Sons, 1991. 14. Minato, O. et al., “A 20-ns 64 K CMOS RAM,” in ISSCC Dig. Tech. Papers, pp. 222–223, Feb. 1984. 15. Sasaki, K., et al., “A 9-ns 1-Mbit CMOS SRAM,” IEEE J. Solid-State Circuits, vol. 24, no. 5, pp. 1219–1224, Oct. 1989. 3-17SRAM 16. Seki, T. et al., “A 6-ns 1-Mb CMOS SRAM with Latched Sense Amplifier” IEEE J. Solid-State Circuits, vol. 28, no. 4, pp. 478–482, Apr. 1993. 17. Kushiyama, N. et al., “An Experimental 295 MHz CMOS 4K X 256 SRAM Using Bidirectional Read/Write Shared Sense Amps and Self-Timed Pulse Word-Line Drivers,” IEEE J. Solid-State Circuits, vol. 30, no. 11, pp. 1286–1290, Nov. 1995. 18. Izumikawa, M. et al., “A 0.25-µm CMOS 0.9-V 100M-Hz DSP Core,” IEEE J. Solid-State Circuits, vol. 32, no. 1, pp. 52–60, Jan. 1997. 4-1 4 Embedded Memory 4.1 Introduction 4-1 4.2 Merits and Challenges 4-2 On-Chip Memory Interface • System Integration • Memory Size 4.3 Technology Integration and Applications 4-3 4.4 Design Methodology and Design Space 4-5 Design Methodology 4.5 Testing and Yield 4-6 4.6 Design Examples 4-7 A Flexible Embedded DRAM Design • Embedded Memories in MPEG Environment • Embedded Memory Design for a 64- bit Superscaler RISC Microprocessor 4.1 Introduction As CMOS technology progresses rapidly toward the deep submicron regime, the integration level, performance, and fabrication cost increase tremendously. Thus, low-integration, low-performance small circuits or systems chips designed using deep submicron CMOS technology are not cost-effective. Only high-performance system chips that integrate CPU (central processing unit), DSP (digital signal processing) processors or multimedia processors, memories, logic circuits, analog circuits, etc. can afford the deep submicron technology. Such system chips are called system-on-a-chip (SOC) or system-on- silicon (SOS). 1,2 A typical example of SOC chips is shown in Fig. 4.1. Embedded memory has become a key component of SOC and more practical than ever for at least two reasons: 3 1. Deep submicron CMOS technology affords a reasonable trade-off for large memory integration in other circuits. It can afford ULSI (ultra large-scale integration) chips with over 10 9 elements on a single chip. This scale of integration is large enough to build an SOC system. This size of circuitry inevitably contains different kinds of circuits and technologies. Data processing and storage are the most primitive and basic components of digital circuits, so that the memory implementation on logic chips has the highest priority. Currently in quarter-micron CMOS technology, chips with up to 128 Mbits of DRAM and 500 Kgates of logic circuit, or 64 Mbits of DRAM and 1 Mgates of logic circuit, are feasible. 2. Memory bandwidth is now one of the most serious bottlenecks to system performance. The memory bandwidth is one of the performance determinants of current von Neuman-type MPU (microprocessing unit) systems. The speed gap between MPUs and memory devices has been increased in the past decade. As shown in Fig. 4.1, the MPU speed has improved by a factor of 4 to 20 in the past decade. On the other hand, in spite of exponential progress in storage capacity, minimum access times for each quadrupled storage capacity have improved only by a factor of two, as shown in Fig. 4.2. This is partly due to the I/O speed limitation and to the fact that major efforts in semiconductor memory development have focused on density Chung-Yu Wu National Chiao Tang University 0–8493–1737–1/03/$0.00+$ 1.50 © 2003 by CRC Press LLC 4-2 Memory, Microprocessor, and ASIC and bit cost improvements. This speed gap creates a strong demand for memory integration with MPU on the same chip. In fact, many MPUs with cycle times better than 60 ns have on- chip memories. The new trend in MPUs, (i.e., RISC architecture) is another driving force for embedded memory, especially for cache applications. 4 RISC architecture is strongly dependent on memory bandwidth, so that high-performance, non-ECL-based RISC MPUs with more than 25 to 50 MHz operation must be equipped with embedded cache on the chip. 4.2 Merits and Challenges The main characteristics of embedded memories can be summarized as follows. 5 4.2.1 On-Chip Memory Interface Advantages include: 1. Replacing off-chip drivers with smaller on-chip drivers can reduce power consumption significantly, as large board wire capacitive loads are avoided. For instance, consider a system which needs a 4-Gbyte/s bandwidth and a bus width of 256 bits. A memory system built with discrete SDRAMs (16-bit interface at 100 MHz) would require about 10 times the power of an embedded DRAM with an internal 256-bit interface. 2. Embedded memories can achieve much higher fill frequencies, 6 which is defined as the bandwidth (in Mbit/s) divided by the memory size in Mbit (i.e., the fill frequency is the number of times per second a given memory can be completely filled with new data), than discrete memories. This is because the on-chip interface can be up to 512 bits wide, whereas discrete memories are limited to 16 to 64 bits. Continuing the above example, it is possible to make a 4-Mbit embedded DRAM with a 256-bit interface. In contrast, it would take 16 discrete 4-Mbit chips (256 K×16) to achieve the same width, so the granularity of such a discrete system is 64 Mbits. But the application may only call for, say, 8 Mbits of memory. 3. As interface wire lengths can be optimized for application in embedded memories, lower propagation times and thus higher speeds are possible. In addition, noise immunity is enhanced. Challenges and disadvantages include: FIGURE 4.1 An example of system-on-a-chip (SOC). 4-3Embedded Memory 1. Although the power consumption per system decreases, the power consumption per chip may increase. Therefore, junction temperature may increase and memory retention time may decrease. However, it should be noted that memories are usually low-power devices. 2. Some sort of minimal external interface is still needed in order to test the embedded memory. The hybrid chip is neither a memory nor a logic chip. Should it be tested on a memory or logic tester, or on both? 4.2.2 System Integration Advantages include: 1. Higher system integration saves board space, packages, and pins, and yields better form factors. 2. Pad-limited design may be transformed into non-pad-limited by choosing an embedded solution. 3. Better speed scalability, along with CMOS technology scaling. Challenges and disadvantages include: 1. More expensive packages may be needed. Also, memories and logic circuits require different power supplies. Currently, the DRAM power supply (2.5 V) is less than the logic power supply (3.3 V), but this situation will reverse in the future due to the back-biasing problem in DRAMs. 2. The embedded memory process adds another technology for which libraries must be developed and characterized, macros must be ported, and design flows must be tuned. 3. Memory transistors are optimized for low leakage currents, yielding low transistor performance, whereas logic transistors are optimized for high saturation currents, yielding high leakage currents. If a compromise is not acceptable, expensive extra manufacturing steps must be added. 4. Memory processes have fewer layers of metal than do logic circuit processes. Layers can be added at the expense of fabrication cost. 5. Memory fabs are optimized for large-volume production of identical products, for high-capacity utilization, and for high yield. Logic fabs, while sharing these goals, are slanted toward lower batch sizes and faster turnaround time. 4.2.3 Memory Size The advantage is that: • Memory size can be customized and memory architecture can be optimized for dedicated applications. Challenges and disadvantages include: • On the other hand, the system designer must know the exact memory requirement at the time of design. Later extensions are not possible, as there is no external memory interface. From the customer’s point of view, the memory component goes from a commodity to a highly specialized part that may command premium pricing. As memory fabrication processes are quite different, second-sourcing problems abound. 4.3 Technology Integration and Applications 3,5 The memory technologies for embedded memories have a wide variation—from ROM to RAM—as listed in Table 4.1. 3 In choosing these technologies, one of the most important figure of merits is the compatibility to logic process. 1. Embedded ROM: ROM technology has the highest compatibility to logic process. However, its application is rather limited. PLA, or ROM-based logic design, is a well-used but rather special case of embedded ROM category. Other applications are limited to storage for 4-4 Memory, Microprocessor, and ASIC microcode or well-debugged control code. A large size ROM for tables or dictionary applications may be implemented in generic ROM chips with lower bit cost. 2. Embedded EPROM/E 2 PROM: EPROM/E 2 PROM technology includes high-voltage devices and/or thin tunneling insulators, which require two to three additional mask steps and processing steps to logic process. Due to its unique functionality, PROM-embedded MPUs 7 are well used. To minimize process overhead, a single poly E 2 PROM cell has been developed. 8 Counterparts to this approach are piggy-back packaged EPROM/MPUs or battery-backed SRAM/MPUs. However, considering process technology innovation, on-chip PROM implementation is winning the game. 3. Embedded SRAM is one of the most frequently used memory embedded in logic chips. Major applications are high-speed on-chip buffers such as TLB, cache, register file, etc. Table 4.2 gives a comparison of some approaches for SRAM integration. A six-transistor cell approach may be the most highly compatible process, unless any special structures used in standard 6-Tr SRAMs are employed. The bit density is not very high. Polysilicon resistor load 4-Tr cells provide higher bit density with the cost of process complexity associated with additional polysilicon-layer resistors. The process complexity and storage density may be compromised to some extent using a single layer of polysilicon. In the case of a polysilicon resistor load SRAM, which may have relaxed specifications with respect to data holding current, the requirement for substrate structure to achieve good soft error immunity is more relaxed as compared to low stand-by generic SRAMs. Therefore, the TFT (thin-film transistor) load cell may not be required for several generations due to its complexity. 4. Embedded DRAM (eDRAM) is not as widely used as SRAMs. Its high density features, however, are very attractive. Several different embedded DRAM approaches are listed in Table 4.3. A trench or stacked cell used in commodity DRAMs has the highest density, but the complexity is also high. The cost is seldom attractive when compared to a multi-chip approach using standard DRAM, which is the ultimate in achieving low bit cost. This type of cell is well suited for ASM (application-specific memory), which will be described in the next section. A planar cell with TABLE 4.2 Embedded SRAM Options TABLE 4.1 Embedded Memory Technologies and Applications 4-5Embedded Memory multiple (double) polysilicon structures is also suitable for memory-rich applications. 9 A gate capacitor storage cell approach can be fully compatible two with logic process providing relatively high density. 10 The four-Tr cell (4-Tr SRAM cell minus resistive load) provides the same speed and density as SRAM, but full compatibility to logic process and requires refresh operation. 11 4.4 Design Methodology and Design Space 3,5 4.4.1 Design Methodology The design style of embedded memory should be selected according to applications. This choice is critically important for the best performance and cost balancing. Figure 4.2 shows the various design styles to implement embedded memories. The most primitive semi-custom design style is based on the memory cell. It provides high flexibility in memory architecture and short design TAT (turnaround time). However, the memory density is the lowest among various approaches. The structured array is a kind of gate array that has a dedicated memory array region in the master chip that is configurable to several variations of memory organizations by metal layer customization. Therefore, it provides relatively high density and short TAT. Configurability and fixed maximum memory area are the limitations to this approach. TABLE 4.3 Embedded DRAM Technology Options FIGURE 4.2 Various design styles for embedded memories. 4-6 Memory, Microprocessor, and ASIC The standard cell design has high flexibility to the extent that the cell library has a variety of embedded memory designs. But in many cases, new system design requires new memory architectures. The memory performance and density is high, but the mask-to-chip TAT tends to be long. Super-integration is an approach that integrates existing chip design, including I/O pads, so the design TAT is short and proven designs can be used. However, availability of memory architecture is limited and the mask-to-chip TAT is long. Hand-craft design (does not necessarily mean the literal use of human hands, but heavy interactive design) provides the most flexibility, high performance, and high density; but design TAT is the longest. Thus, design cost is the highest so that the applications are limited to high-volume and/or high-end systems. Standard memories, well-defined ASMs, such as video memories, 12 integrated cache memories, 13 and high-performance MPU-embedded memories, are good examples. An eDRAM (embedded DRAM) designer faces a design space that contains a number of dimensions not found in standard ASICs, some of which we will subsequently review. The designer has to choose from a wide variety of memory cell technologies which differ in the number of transistors and in performance. Also, both DRAM technology and logic technology can serve as a starting point for embedding DRAM. Choosing a DRAM technology as the base technology will result in high memory densities but suboptimal logic performance. On the other hand, starting with logic technology will result in poor memory densities, but fast logic circuits. To some extent, one can therefore trade logic speed against logic area. Finally, it is also possible to develop a process that gives the best of both worlds—most likely at higher expense. Furthermore, the designer can trade logic area for memory area in a way heretofore impossible. Large memories can be organized in very different ways. Free parameters include the number of memory banks, which allow the opening of different pages at the same time, the length of a single page, the word width, and the interface organization. Since eDRAM allows one to integrate SRAMs and DRAMs, the decision between on/off-chip DRAM- and SRAM/DRAM-partitioning must be made. In particular, the following problems must be solved at the system level: • Optimizing the memory allocation • Optimizing the mapping of the data into memory such that the sustainable memory bandwidth approaches the peak bandwidth • Optimizing the access scheme to minimize the latency for the memory clients and thus minimize the necessary FIFO depth The goals are to some extent independent of whether or not the memory is embedded. However, the number of free parameters available to the system designer is much larger in an embedded solution, and the possibility of approaching the optimal solution is thus correspondingly greater. On the other hand, the complexity is also increased. It is therefore incumbent upon eDRAM suppliers to make the trade-offs transparent and to quantize the design space into a set of understandable if slightly suboptimal solutions. 4.5 Testing and Yield 3,5 Although embedded memory occupies a minor portion of the total chip area, the device density in the embedded memory area is generally overwhelming. Failure distribution is naturally localized at memory areas. In other words, embedded memory is a determinant of total chip yield to the extent that the memory portion has higher device density weighted by its silicon area. For a large memory-embedded VLSI, memory redundancy is helpful to enhance the chip yield. Therefore, the embedded-memory testing, combined with the redundancy scheme, is an important issue. The implementation of means for direct measurement of embedded memory on wafer as well as in assembled samples is necessary. [...]... was started from a basic MOS structure.As shown in Fig 5.1, the insulator in the conventional MOS structure was replaced with a thin oxide layer (II), an isolated metal layer (M1), and a thick oxide layer (I2) These stacked oxide and metal layers led to the so-called MIMIS structure In this 0–84 93 1 737 –1/ 03/ $0.00+$ 1.50 © 20 03 by CRC Press LLC 5-1 5-2 Memory, Microprocessor, and ASIC FIGURE 5.1 Schematic... 5-6 Memory, Microprocessor, and ASIC FIGURE 5.6 Schematic cross-section of ETOX-type Flash memory cell: (a) the top view of the cell, and (b) the cross-section along the channel length and channel width FIGURE 5.7 Schematic cross-section of split-gate Flash memory cell and (5 .3) where CFG, CB, CD, and Cs are the capacitances between floating gate and control gate, well terminal, drain terminal, and. .. area and the power that would be needed if the IDCT input buffer were built with flip-flops In the orthogonal memory, word-lines and bit-lines run both vertically and horizontally to achieve the functionality.The macro size of the orthogonal memory is 420 µm×760 µm, with a memory cell size of 10.8 µm 32 .0 µm FIGURE 4.5 Circuit diagram of orthogonal memory 4-10 Memory, Microprocessor, and ASIC FIFOs and. .. to make FIGURE 4.17 VRAM row decoder 4-18 Memory, Microprocessor, and ASIC FIGURE 4.18 RAM layout verification a virtual connection, as shown in Fig 4.18 These works are basically handled by CAD software plus small programming without editing the layout by hand Direct testing of large on-chip memory is highly preferable in VLSI because of faster test time and complete test coverage TFP IU defines cache... and Kohyama, S., A 72K CMOS Channelless Gate Array with Embedded 1Mbit Dynamic RAM IEEE CICC, Proc 20 .3. 1, May 1988 10 Archer, D., Deverell, D., Fox, F., Gronowski, P., Jain,A., Leary, M., Olesin,A., Persels, S., Rubinfeld, P., Schmacher, D., Supnik, B., and Thrush, T., A 32 b CMOS Microprocessor with On-Chip Instruction and Data Caching and Memory Management ISSCC Digest of Technical Papers, p 32 33 ;... double poly-Si, triple metal, and triple well A deep n-well was used in PLL and cache cell arrays in order to decouple these circuits from the noisy substrate or power line of the CMOS logic part The chip operates up to 75 MHz at 3. 1 V and 70°C, and the peak performance reaches 30 0 MIPS Features of each embedded memory are summarized in Table 4.6 Instruction, branch, and data caches are direct mapped... Papers, p 32 33 ; Feb 1987 11 Beyers, J.W., Dohse, L.J., Fucetola, J.P., Kochis, R.L., Lob, C.G., Taylor, G.L., and Zeller, E.R., A 32 b VLSI CPU Chip ISSCC Digest of Technical Papers, p 104–105, Feb 1981 12 Ishimoto, S., Nagami, A.,Watanabe, H., Kiyono, J., Hirakawa, N., Okuyama,Y, Hosokawa, F., and Tokushige, K., 256K Dual Port Memory ISSCC Digest of Technical Papers, p 38 39 , Feb 1985 13 Sakurai,T.,... done about 3 ns before the end of the cycle, as shown in Fig 4.11 To take advantage of this big address setup time, address is received by transparent latch:TLAT_N (transparent while clock is low) instead of flip-flop.Thus, decode is started as Memory, Microprocessor, and ASIC 4-14 TABLE 4.6 Summary of Embedded RAM Features FIGURE 4.12 Basic RAM block diagram Embedded Memory 4-15 FIGURE 4. 13 RAM timing... electrons are injected into and ejected from the floating gate through a high-quality thin oxide region outside the channel region.5 The FLOTOX cell must be isolated by a select transistor to avoid the over-erase FIGURE 5 .3 Schematic cross-section of p-channel SAMOS structure 5-4 Memory, Microprocessor, and ASIC FIGURE 5.4 Schematic cross-section of FLOTOX structure issue and therefore it consists of... increase design cost and chip cost Reading very wide data causes large power dissipation Test time of the chip could be increased because of the large memory Therefore, design efficiency, careful power bus design, and careful design for testability are necessary 4-12 Memory, Microprocessor, and ASIC FIGURE 4.9 Instruction RAM masterslice for code debugging TFP is a high-speed and highly concurrent . Poor 4-12 Memory, Microprocessor, and ASIC TFP is a high-speed and highly concurrent 64-bit superscaler RISC microprocessor, which can issue up to four instructions per cycle. 17,18 Very wide bandwidth. Solid-State Circuits, vol. 33 , no. 11, pp. 1650–1657, Nov. 1998. 12. Cararella, J.S., “A Low Voltage SRAM for Embedded Applications,” IEEE J. Solid-State Circuits, vol. 32 , no. 3, pp. 428– 432 , Mar. 1997. 13. Prince,. University 0–84 93 1 737 –1/ 03/ $0.00+$ 1.50 © 20 03 by CRC Press LLC 4-2 Memory, Microprocessor, and ASIC and bit cost improvements. This speed gap creates a strong demand for memory integration with MPU on the same

Định dạng
Số trang	42
Dung lượng	1,07 MB