MEMORY, MICROPROCESSOR, and ASIC phần 8 pot

11-8 Memory, Microprocessor, and ASIC back to memory. The memory system is constructed of basic semiconductor DRAM units called modules or banks. There are several properties of memory, including speed, capacity, and cost, that play an important role in the overall system performance. The speed of a memory system is the key performance parameter in the design of the microprocessor system. The latency (L) of the memory is defined as the time delay from when the processor first requests data from memory until the processor receives the data. Bandwidth is defined as the rate which information can be transferred from the memory system. Memory bandwidth and latency are related to the number of outstanding requests (R) that the memory system can service: (11.4) Bandwidth plays an important role in keeping the processor busy with work. However, technology trade-offs to optimize latency and improve bandwidth often conflict with the need to increase the capacity and reduce the cost of the memory system. Cache Memory Cache memory, or simply cache, is a small, fast memory constructed using semiconductor SRAM. In modern computer systems, there is usually a hierarchy of cache memories. The top-level cache is closest to the processor and the bottom level is closest to the main memory. Each higher level cache is about 5 to 10 times faster than the next level. The purpose of a cache hierarchy is to satisfy most of the processor memory accesses in one or a small number of clock cycles. The top-level cache is often split into an instruction cache and a data cache to allow the processor to perform simultaneous accesses for instructions and data. Cache memories were first used in the IBM mainframe computers in the 1960s. Since 1985, cache memories have become a standard feature for virtually all microprocessors. Cache memories exploit the principle of locality of reference. This principle dictates that some memory locations are referenced more frequently than others, based on two program properties. Spatial locality is the property that an access to a memory location increases the probability that the nearby memory location will also be accessed. Spatial locality is predominantly based on sequential access to program code and structured data. Temporal locality is the property that access to a memory location greatly increases the probability that the same location will be accessed in the near future. Together, the two properties ensure that most memory references will be satisfied by the cache memory. There are several different cache memory designs: direct-mapped, fully associative, and set-associative. Figure 11.6 illustrates the two basic schemes of cache memory: direct-mapped and set-associative. Direct-mapped cache, shown in Fig. 11.6(a), allows each memory block to have one place to reside within a cache. Fully associative cache, shown in Fig. 11.6(b), allows a block to be placed anywhere in the cache. Set associative cache restricts a block to a limited set of places in the cache. Cache misses are said to occur when the data requested does not reside in any of the possible cache locations. Misses in caches can be classified into three categories: conflict, compulsory, and capacity. Conflict misses are misses that would not occur for fully associative caches with least recently used (LRU) replacement. Compulsory misses are misses required in cache memories for initially referencing a memory location. Capacity misses occur when the cache size is not sufficient to contain data between references. Complete cache miss definitions are provided in Ref. 4. Unlike memory system properties, the latency in cache memories is not fixed and depends on the delay and frequency of cache misses. A performance metric that accounts for the penalty of cache misses is effective latency. Effective latency depends on the two possible latencies: hit latency (L HIT ), the latency experienced for accessing data residing in the cache, and miss latency (L MISS ), the latency experienced when accessing data not residing in the cache. Effective latency also depends on the hit rate (H), the percentage of memory accesses that are hits in the cache, and the miss rate (M or 1–H), the percentage of memory accesses that miss in the cache. Effective latency in a cache system is calculated as: 11-9Architecture (11.5) In addition to the base cache design and size issues, there are several other cache parameters that affect the overall cache performance and miss rate in a system. The main memory update method indicates when the main memory will be updated by store operations. In write-through cache, each write is immediately reflected to the main memory. In write-back cache, the writes are reflected to the main memory only when the respective cache block is replaced. Cache block allocation is another parameter and designates whether the cache block is allocated on writes or reads. Last, block replacement algorithms for associative structures can be designed in various ways to extract additional cache performance. These include least recently used (LRU), least frequently used (LFU), random, and first- in, first-out (FIFO). These cache management strategies attempt to exploit the properties of locality. Spatial locality is exploited by deciding which memory block is placed in cache, and temporal locality is exploited by deciding which cache block is replaced. Traditionally, when cache service misses, they would block all new requests. However, non-blocking cache can be designed to service multiple miss requests simultaneously, thus alleviating delay in accessing memory data. In addition to the multiple levels of cache hierarchy, additional memory buffers can be used to improve cache performance. Two such buffers are a streaming/prefetch buffer and a victim cache. 2 Figure 11.7 illustrates the relation of the streaming buffer and victim cache to the primary cache of a memory system. A streaming buffer is used as a prefetching mechanism for cache misses. When a cache miss occurs, the streaming buffer begins prefetching successive lines starting at the miss target. A victim cache is typically a small, fully associative cache loaded only with cache lines that are removed from the primary cache. In the case of a miss in the primary cache, the victim cache may hold additional data. The use of a victim cache can improve performance by reducing the number of conflict misses. Figure 11.7 illustrates how cache accesses are processed through the streaming buffer into the primary cache on cache requests, and from the primary cache through the victim cache to the secondary level of memory on cache misses. Overall, cache memory is constructed to hold the most important portions of memory. Techniques using either hardware or software can be used to select which portions of main memory to store in cache. However, cache performance is strongly influenced by program behavior and numerous hardware design alternatives. FIGURE 11.6 Cache memory: (a) direct-mapped design, (b) two-way set-associative design. 11-10 Memory, Microprocessor, and ASIC Virtual Memory Cache memory illustrated the principle that the memory address of data can be separate from a particular storage location. Similar address abstractions exist in the two-level memory hierarchy of main memory and disk storage. An address generated by a program is called a virtual address, which needs to be translated into a physical address or location in main memory. Virtual memory management is a mechanism which provides the programmers with a simple, uniform method to access both main and secondary memories. With virtual memory management, the programmers are given a virtual space to hold all the instructions and data. The virtual space is organized as a linear array of locations. Each location has an address for conve- nient access. Instructions and data have to be stored somewhere in the real system; these virtual space locations must correspond to some physical locations in the main and secondary memory. Virtual memory management assigns (or maps) the virtual space locations into the main and secondary memory locations. The mapping of virtual space locations to the main and secondary memory is managed by the virtual memory management. The programmers are not concerned with the mapping. The most popular memory management scheme today is demand paging virtual memory management, where each virtual space is divided into pages indexed by the page number (PN). Each page consists of several consecutive locations in the virtual space indexed by the page index (PI). The number of locations in each page is an important system design parameter called page size. Page size is usually defined as a power of two so that the virtual space can be divided into an integer number of pages. Pages are the basic unit of virtual memory management. If any location in a page is assigned to the main memory, the other locations in that page are also assigned to the main memory. This reduces the size of the mapping information. The part of the secondary memory to accommodate pages of the virtual space is called the swap space. Both the main memory and the swap space are divided into page frames. Each page frame can host a page of the virtual space. If a page is mapped into the main memory, it is also hosted by a page frame in the main memory. The mapping record in the virtual memory management keeps track of the association between pages and page frames. When a virtual space location is requested, the virtual memory management looks up the mapping record. If the mapping record shows that the page containing requested virtual space location is in main memory, the management performs the access without any further complication. Otherwise, a secondary memory access has to be performed. Accessing the secondary memory is usually a complicated task and is usually performed as an operating system service. In order to access a piece of information stored in the secondary memory, an operating system service usually has to be requested to transfer the information into the main memory. This also applies to virtual memory management. When a page is mapped into the secondary memory, the virtual memory management has to request a service in the operating system to transfer the requested virtual space location into the main memory, update its FIGURE 11.7 Advanced cache memory system. 11-11Architecture mapping record, and then perform the access. The operating system service thus performed is called the page fault handler. The core process of virtual memory management is a memory access algorithm. A one-level virtual address translation algorithm is illustrated in Fig. 11.8. At the start of the translation, the memory access algorithm receives a virtual address in a memory address register (MAR), looks up the mapping record, requests an operating system service to transfer the required page if necessary, and performs the main memory access. The mapping is recorded in a data structure called the Page Table located in main memory at a designated location marked by the page table base register (PTBR). The page table index and the PTBR form the physical address (PAPTE) of the respective page table entry. Each PTE keeps track of the mapping of a page in the virtual space. It includes two fields: a hit/miss bit and a page frame number. If the hit/miss (H/M) bit is set (hit), the corresponding page is in main memory. In this case, the page frame hosting the requested page is pointed to by the page frame number (PFN). The final physical address (PAD) of the requested data is then formed using the PFN and PI. The data is returned and placed in the memory buffer register (MBR) and the processor is informed of the completed memory access. Otherwise (miss), a secondary memory access has to be performed. In this case, the page frame number should be ignored. The fault handler has to be invoked to access the secondary memory. The hardware component that performs the address translation algorithm is called the memory management unit (MMU). The complexity of the algorithm depends on the mapping structure. A very simple mapping structure is used in this section to focus on the basic principles of the memory access algorithms. However, more complex two-level schemes are often used due to the size of the virtual address space. The size of the page table designated may be quite large for a range of main memory sizes. As such, it becomes necessary to map portions of page table into a second page table. In such designs, only the second-level page table is stored in a reserved region of main memory, while the first page table is mapped just like the data in the virtual spaces. There are also requirements for such designs in a multiprogramming system, where there are multiple processes active at the same time. Each processor has its own virtual space and therefore its own page table. As a result, these systems need to keep multiple page tables at the same time. It usually take too much main memory to accommodate all the active page tables. Again, the natural solution to this problem is to provide other levels of mapping. FIGURE 11.8 Virtual memory translation. 11-12 Memory, Microprocessor, and ASIC Translation Lookaside Buffer Hardware support for a virtual memory system generally includes a mechanism to translate virtual addresses into the real physical addresses used to access main memory. A translation lookaside buffer (TLB) is a cache structure which contains the frequently used page table entries for address translation. With a TLB, address translation can be performed in a single clock cycle when the TLB contains the required page table entries (TLB hit). The full address translation algorithm is performed only when the required page table entries are missing from the TLB (TLB miss). Complexities arise when a system includes both virtual memory management and cache memory. The major issue is whether address translation is done before accessing the cache memory. In virtual cache systems, the virtual address directly accesses cache. In a physical cache system, the virtual address is translated into a physical address before cache access. Figure 11.9 illustrates both the virtual and physical cache translation approaches. A virtual cache system typically overlaps the cache memory access and the access to the TLB. The overlap is possible when the virtual memory page size is larger than the cache capacity divided by the degree of cache associativity. Essentially, since the virtual page index is the same as the physical address index, no translation for the lower indexes of the virtual address is necessary. Thus, the cache can be accessed in parallel with the TLB, or the TLB can be accessed after the cache access for cache misses. Typically, with no TLB logic between the processor and the cache, access to cache can be achieved at lower cost in virtual cache systems and multi-access per cycle cache systems can avoid requiring a multiported TLB. However, the virtual cache translation alternative introduces virtual memory consistency problems. The same virtual address from two different processes means different physical memory locations. Solutions to this form of aliasing are to attach a process identifier to the virtual address or to flush cache contents on context switches. Another potential alias problem is that different virtual addresses of the same process may be mapped into the same physical address. In general, there is no easy solution, and it involves a reverse translation problem. FIGURE 11.9 Translation lookaside buffer (TLB) architectures: (a) virtual cache, (b) physical cache. 11-13Architecture Physical cache designs are not always limited by the delay of the TLB and cache access. In general, there are two solutions to allow large physical cache design. The first solution, employed by companies with past commitments to page size, is to increase the set associativity of cache. This allows the cache index portion of the address to be used immediately by the cache in parallel with virtual address translation. However, large set associativity is very difficult to implement in a cost-effective manner. The second solution, employed by companies without past commitment, is to use a larger page size. The cache can be accessed in parallel with the TLB access similar to the other solution. In this solution, there are fewer address indexes that are translated through the TLB, potentially reducing the overall delay. With larger page sizes, virtual caches do not have advantage over physical caches in terms of access time. 11.3.3 Input/Output Subsystem The input/output (I/O) subsystem transfers data between the internal components (CPU and main memory) and the external devices (disks, terminals, printers, keyboards, scanners). Peripheral Controllers The CPU usually controls the I/O subsystem by reading from and writing into the I/O (control) registers. There are two popular approaches for allowing the CPU to access these I/O registers: I/O instructions and memory-mapped I/O. In an I/O instruction approach, special instructions are added to the instruction set to access I/O status flags, control registers, and data buffer registers. In a memory- mapped I/O approach, the control registers, the status flags, and the data buffer registers are mapped as physical memory locations. Due to the increasing availability of chip area and pins, microprocessors are increasingly including peripheral controllers on-chip. This trend is especially clear for embedded microprocessors. Direct Memory Access Controller A DMA controller is a peripheral controller that can directly drive the address lines of the system bus. The data is directly moved from the data buffer to the main memory, rather than from data buffer to a CPU register, then from CPU register to main memory. 11.3.4 System Interconnection System interconnection is the facilities that allow the components within a computer system to commu- nicate with each other. There are numerous logical organizations of these system interconnect facilities. Dedicated links or point-to-point connections enable dedicated communication between components. There are different system interconnection configurations based on the connectivity of the system components. A complete connection configuration, requiring N. (N-1)/2 links, is created when there is one link between every possible pair of components. A hypercube configuration assigns a unique n-tuple {1,0} as the coordinate of each component and constructs a link between components whose coordinates differ only in one dimension, requiring N. log N links. A mesh connection arranges the system components into an N-dimensional array and has connections between immediate neighbors, requiring 2. N links. Switching networks are a group of switches that determine the existence of communication links among components. A cross-bar network is considered the most general form of switching network and uses a N×M two-dimensional array of switches to provide an arbitrary connection between N components on one side to M components on another side using N. M switches and N+M links. Another switching network is the multistage network, which employs multiple stages of shuffle networks to provide a permutation connection pattern between N components on each side by using N. log N switches and N. log N links. Shared buses are single links which connect all components to all other components and are the most popular connection structure. The sharing of buses among the components of a system requires 11-14 Memory, Microprocessor, and ASIC several aspects of bus control. First, there is a distinction between bus masters, the units controlling bus transfers (CPU, DMA, IOP) and bus slaves, and the other units (memory, programmed I/O interface). Bus interfacing and bus addressing are the means to connect and disconnect units on the bus. Bus arbitration is the process of granting the bus resource to one of the requesters. Arbitration typically uses a selection scheme similar to interrupts; however, there are more fixed methods of establishing selection. Fixed-priority arbitration gives every requester a fixed priority, and round-robin ensures every requester the most favorable at one point in time. Bus timing refers to the method of communication among the system units and can be classified as either synchronous or asynchronous. Synchronous bus timing uses a shared clock that defines the time other bus signals change and stabilize. Clock sharing by all units allows the bus to be monitored at agreed time intervals and action taken accordingly. However, the synchronous system bus must operate at the speed of the slowest component. Asynchronous bus timing allows units to use different clocks, but the lack of a shared clock makes it necessary to use extra signals to determine the validity of bus signals. 11.4 Instruction Set Architecture There are several elements that characterize an instruction set architecture, including word size, instruction encoding, and architecture model. Word Size Programs often differ in the size of data they prefer to manipulate. Word processing programs operate on 8-bit or 16-bit data that corresponds to characters in text documents. Many applications require 32- bit integer data to avoid frequent overflow in arithmetic calculation. Scientific computation often requires 64-bit floating-point data to achieve the desired accuracy. Operating systems and databases may require 64-bit integer data to represent a very large name space with integers. As a result, the processors are usually designed to access multiple-byte data from memory systems. This is a well- known source of complexity in microprocessor design. The endian convention specifies the numbering of bytes with a memory word. In the little endian convention, the least significant byte in a word is numbered byte 0. The number increases as the positions increase in significance. The DEC VAX and X86 architectures follow the little endian convention. In the big endian convention, the most significant byte in a word is numbered 0. The number decreases as the positions decrease in significance. The IBM 360/370, HP PA-RISC, Sun SPARC, and Motorola 680X0 architectures follow the big endian convention. The difference usually manifests itself when users try to transfer binary files between machines using different endian conventions. Instruction Encoding Instruction encoding plays an important role in the code density and performance of microprocessors. Traditionally, the cost of memory capacity was the determining factor in designing either a fixed-length or variable-length instruction set. Fixed-length instruction encoding assigns the same encoding size to all instructions. Fixed-length encoding is generally a characteristic of modern microprocessors and the product of the increasing advancements in memory capacity. Variable-length instruction set is the term used to describe the style of instruction encoding that uses different instructions lengths according to addressing modes of operands. Common addressing modes included either register or methods of indexing memory. Figure 11.10 illustrates two potential designs found in modern use of decoding variable-length instructions. The first alternative, in Fig. 11.10(a), involves an additional instruction decode stage in the original pipeline design. In this model, the first stage is used to determine instruction lengths and steer the instructions to the second stage, where the actual instruction decoding is performed. The second alternative, in Fig. 11.10(b), involves pre-decoding and marking instruction lengths in the instruction cache. This design methodology has been effectively used in decoding X86 variable instructions. 5 The primary advantage of this scheme is the simplification of the number of decode stages in the pipeline design. However, the method requires a larger instruction cache structure for holding the resolved instruction information. 11-15Architecture Architecture Model Several instruction set architecture models have existed over the past three decades of computing. First, complex instruction set computers (CISC) characterized designs with variable instruction formats, numerous memory addressing modes, and large numbers of instruction types. The original CISC philosophy was to create instructions sets that resembled high-level programming languages in an effort to simplify compiler technology. In addition, the design constraint of small memory capacity also led to the development of CISC. The two primary architecture examples of the CISC model are the Digital VAX and Intel X86 architecture families. Reduced instruction set computers (RISC) gained favor with the philosophy of uniform instruction lengths, load-store instruction sets, limited addressing modes, and reduced number of operation types. RISC concepts allow the microarchitecture design of machines to be more easily pipelined, reducing the processor clock cycle frequency and the overall speed of a machine. The RISC concept resulted from improvements in programming languages, compiler technology, and memory size. The HP PA- RISC, Sun SPARC, IBM Power PC, MIPS, and DEC Alpha machines are examples of RISC architectures. Architecture models allowing multiple instructions to issue in a clock cycle are very long instruction word (VLIW). VLIWs issue a fixed number of operations conveyed as a single long instruction and place the responsibility of creating the parallel instruction packet on the compiler. Early VLIW processors suffered from code expansion due to instructions. Examples of VLIW technology are the Multiflow Trace and Cydrome Cydra machines. Explicitly parallel instruction computing (EPIC) is similar in concept to VLIW in that both use the compiler to explicitly group instructions for parallel execution. In fact, many of the ideas for EPIC architectures come from previous RISC and VLIW machines. In general, the EPIC concept solves the excessive code expansion and scalability problems associated with VLIW models by not completely eliminating its functionality. Also, the trend of compiler-controlled architecture mechanisms are generally considered part of the EPIC-style architecture domain. The Intel IA-64, Philips Trimedia, and Texas Instruments ‘C6X are examples of EPIC machines. 11.5 Instruction-Level Parallelism Modern processors are being designed with the ability to execute many parallel operations at the instruction level. Such processors are said to exploit instruction-level parallelism (ILP). Exploiting ILP is recognized as a new fundamental architecture concept in improving microprocessor performance, and there are a wide range of architecture techniques that define how an architecture can exploit ILP. FIGURE 11.10 Variable-sized instruction decoding: (a) staging, (b) pre-decoding. 11-16 Memory, Microprocessor, and ASIC 11.5.1 Dynamic Instruction Execution A major limitation of pipelining techniques is the use of in-order instruction execution. When an instruction in the pipeline stalls, no further instructions are allowed to proceed to insure proper execution of in- flight instruction. This problem is especially serious for multiple issue machines, where each stall cycle potentially costs work of multiple instructions. However, in many cases, an instruction could execute properly if no data dependence exists between the stalled instruction and the instruction waiting to execute. Static scheduling is a compiler-oriented approach for scheduling instructions to separate depen- dent instructions and minimize the number of hazards and pipeline stalls. Dynamic scheduling is another approach that uses hardware to rearrange the instruction execution to reduce the stalls. The concept of dynamic execution uses hardware to detect dependences in the in-order instruction stream sequence and rearrange the instruction sequence in the presence of detected dependences and stalls. Today, most modern superscalar microprocessors use dynamic out-of-order scheduling techniques to increase the number of instructions executed per cycle. Such microprocessors use basically the same dynamically scheduled pipeline concept; all instructions pass through an issue stage in-order, are executed out-of-order, and are retired in-order. There are several functional elements of this common sequence which have developed into computer architecture concepts. The first functional concept is scoreboarding. Scoreboarding is a technique for allowing instructions to execute out-of-order when there are available resources and no data dependencies. Scoreboarding originates from the CDC 6600 machine’s issue logic, named the scoreboard. The overall goal of scoreboarding is to execute every instruction as early as possible. A more advanced approach to dynamic execution is Tomasulo’s approach. This scheme was employed in the IBM 360/91 processor. Although there are many variations on this scheme, the key concept of avoiding write-after-read (WAR) and write-after-write (WAW) dependencies during dynamic execution is attributed to Tomasulo. In Tomasulo’s scheme, the functionality of the scoreboarding is provided by the reservation stations. Reservation stations buffer the operands of instructions waiting to issue as soon as they become available. The concept is to issue new instructions immediately when all source operands become available instead of accessing such operands through the register file. As such, waiting instructions designate the reservation station entry that will provide their input operands. This action removes WAW dependencies caused by successive writes to the same register by forcing instructions to be related by dependencies instead of by register specifiers. In general, renaming of register specifiers for pending operands to the reservation station entries is called register renaming. Overall, Tomasulo’s scheme combines scoreboarding and register renaming. An Efficient Algorithm for Exploring Multiple Arithmetic Units 6 provides the complete details of Tomasulo’s scheme. 11.5.2 Predicated Execution Branch instructions are recognized as a major impediment to exploiting ILP. Branches force the compiler and hardware to make frequent predictions of branch directions in an attempt to find sufficient parallelism. Misprediction of these branches can result in severe performance degradation through the introduction of wasted cycles into the instruction stream. Branch prediction strategies reduce this problem by allowing the compiler and hardware to continue processing instructions along the pre- dicted control path, thus eliminating these wasted cycles. Predicated execution support provides an effective means to eliminate branches from an instruction stream. Predicated execution refers to the conditional execution of an instruction based on the value of a Boolean source operand, referred to as the predicate of the instruction. This architectural support allows the compiler to use an if-conversion algorithm to convert conditional branches into predicate defining instructions, and instructions along alternative paths of each branch into predicated instructions. 7 Predicated instructions are fetched regardless of their predicate value. Instructions whose predicate value is true are executed normally. Conversely, instructions whose predicate is false are nullified, and thus are prevented from modifying the processor state. Predicated execution allows the compiler to trade instruction fetch efficiency for the capability to expose ILP to the hardware along multiple execution paths. 11-17Architecture Predicated execution offers the opportunity to improve branch handling in microprocessors. Eliminating frequently mispredicted branches may lead to a substantial reduction in branch prediction misses. As a result, the performance penalties associated with the eliminated branches are removed. Eliminating branches also reduces the need to handle multiple branches per cycle for wide-issue processors. Finally, predicated execution provides an efficient interface for the compiler to expose multiple execution paths to the hardware. Without compiler support, the cost of maintaining multiple execution paths in hardware grows rapidly. The essence of predicated execution is the ability to suppress the modification of the processor state based upon some execution condition. Full predication cleanly supports this through a combination of instruction set and microarchitecture extensions. These extensions can be classified as a support for suppression of execution and expression of condition. The result of the condition which determines if an instruction should modify state is stored in a set of 1-bit registers. These registers are collectively referred to as the predicate register file. The values in the predicate register file are associated with each instruction in the extended instruction set through the use of an additional source operand. This operand specifies which predicate register will determine whether the operation should modify processor state. If the value in the specified register is 1, or true, the instruction is executed normally; if the value is 0, or false, the instruction is suppressed. Predicate register values may be set using predicate define instructions. The predicate define semantics used are those of the HPL Playdoh architecture. 8 There is a predicate define instruction for each comparison opcode in the original instruction set. The major difference with conventional comparison instructions is that these predicate defines have up to two destination registers and that their destination registers are predicate registers. The instruction format of a predicate define is shown below. This instruction assigns values to Pout1 and Pout2 according to a comparison of src1 and src2 specified by <cmp>. The comparison <cmp> can be: equal (eq), not equal (ne), greater than (gt), etc. A predicate <type> is specified for each destination predicate. Predicate defining instructions are also predicated, as specified by P in . The predicate <type> determines the value written to the destination predicate register based upon the result of the comparison and of the input predicate, P in . For each combination of comparison result and Pin, one of three actions may be performed on the destination predicate: it can write 1, write 0, or leave it unchanged. There are six predicate types which are particularly useful: the unconditional (U), OR, and AND type predicates and their complements. Table 11.1 contains the truth table for these predicate definition types. Unconditional destination predicate registers are always defined, regardless of the value of P in and the result of the comparison. If the value of P in is 1, the result of the comparison is placed in the predicate register (or its compliment for U). Otherwise, a 0 is written to the predicate register. Unconditional predicates are utilized for blocks which are executed based on a single condition. The OR-type predicates are useful when execution of a block can be enabled by multiple conditions, such as logical AND (&&) and OR constructs in C. OR-type destination predicate registers are set if P in is 1 and the result of the comparison is 1 (0 for OR) for otherwise, the destination predicate TABLE 11.1 Predicate Definition Truth Table [...]... circuit is represented by arithmetic and storage units and corresponds to the register transfer level (RTL) discussed earlier The register-level components are selected and interconnected so as to achieve the 12 -8 Memory, Microprocessor, and ASIC FIGURE 12.5 Simplified ASIC design flow: the progress of the design from the behavior to mask level and the synthesis processes and steps involved functionality... chip layouts and mask generation for thefabrication process These developments, coupled with an increasing market for semiconductor chips innearly all every-day devices, have led to a spur in the demand for ASICs and chips which have ASICs inthem ASIC design and manufacturing span a broad range of activities, which includes product conceptualization, design and synthesis, verification, and testing Once... with ASIC manufacturing for all but applications requiring more than tens of thousands of IC parts Lately, the situation has been changing in favor of increased use of ASIC parts, in part helped by robust designmethodologies and increased use of automated circuit synthesis tools.These tools allow designers 0 84 93–1737–1/03/$0.00+$ 1.50 © 2003 by CRC Press LLC 12-1 12-2 Memory, Microprocessor, and ASIC. .. specification of the system.These HDLs are simulatable and, hence, the functionality and architectural design can be simulated to verify the correctness and fulfillment of end-product requirements In present ASIC design methodologies used in the industry, HDLs are 12-10 Memory, Microprocessor, and ASIC typically used to capture designs at a register-transfer level and logic synthesis tools are then used to synthesize... performance requirements—while reducing system size, power, heat, and cost—than possible with standard IC parts Due to cost and performance advantages, ASICs and semiconductor chips with ASIC blocks are used in a wide range of products, from consumer electronics to space applications Traditionally, the design of ASICs has been a long and tedious process because of the different steps in the design process... transition table, (d) next state Karnaugh map, (e) output Karnaugh map 12- 18 Memory, Microprocessor, and ASIC The information from the FSM can be captured in a state transition table as shown in Fig 12.12(c) In this figure, the present and the next states are shown using their encoding and are marked by bit variables Q1 Q0 and D1 D0, respectively.The output Z is a two-bit variable Z1 Z0 which goes... low cost and time overheads However, FPGA is still an expensive technology since the number of gate arrays required to implement a moderately 12-4 Memory, Microprocessor, and ASIC complex design is large.The cost per gate of prototype design is decreasing due to continuous density and capacity improvements in FPGA technology Hence, there are several design styles available to a designer, and choosing... the synthesis and optimization of gate-level logic circuits Before the advent of logic synthesis, ASIC designers used a capture -and- simulate design methodology .8 In this methodology, a team of design architects starts with the requirements for the product and produces a rough block diagram of the chip architecture.This architecture is then refined to ensure completeness and functionality and then given... dataflow graphs as inputs to describe the behavior of the system and synthesize the processors, memories, and ASICs from them.7,12 They assist in making decisions that have been the domain of chip architects and have been based mostly on experience and engineering intuition The relationship of the ASIC design flow, synthesis methodologies, and CAD tools is shown in Fig 12.3.This figure shows how the design... 177– 189 , Jan 1 983 8 V.Kathail, M.S.Schlansker, and B.R.Rau, HPL PlayDoh architecture specification:Version 1.0,Tech Rep HPL-93 80 , Hewlett-Packard Laboratories, Palo Alto, CA, Feb 1994 9 S.A.Mahlke et al., Sentinel scheduling: A model for compiler-controlled speculative execution, ACM Transactions on Computer Systems, vol 11, Nov 1993 10 Embedded Microprocessor Forum (San Jose, CA), Oct 19 98 12 ASIC . 11 -8 Memory, Microprocessor, and ASIC back to memory. The memory system is constructed of basic semiconductor DRAM units called modules or banks. There are several properties of memory,. spur in the demand for ASICs and chips which have ASICs inthem. ASIC design and manufacturing span a broad range of activities, which includes product conceptualization, design and synthesis,. Definition Truth Table 11- 18 Memory, Microprocessor, and ASIC register is unchanged. Note that OR-type predicates must be explicitly initialized to 0 before they are defined and used. However, after

Định dạng
Số trang	43
Dung lượng	1,86 MB